Last changed: 28 February 1995
Introduction
Rationale
The objective is to propose multilingual extensions to the Web
that are as compatible as possible with the present Web.
The extensions should cover language engineering.
The term Web is used to name the clients, servers, HTTP, HTML, etc.
This undefinition is on purpose, as I do not wish to suggest too specifically
where to implement the mechanisms;
though, developing a multilingual client should be considered.
I give a concrete syntax,
but it should be mostly considered as an example to illustrate
the functional characteristics.
Suggestions
The concepts need maturing and suggestions are strongly encouraged,
particularly where requested with the label:
"Suggestions ?".
The broad areas are:
- Missing functionalities
- How to implement the mechanisms on the Web
Definitions
Language engineering
Language engineering covers anything related to languages:
- Terminology
- Translator's memory
- Multilingual documentary databases
- Aligned text
- Translator's Workbench
- Author's Workbench
- Machine translation
- Publishing (in particular, multilingual synchronized publishing)
Parallel Texts
Parallel texts are linguistic versions of the same text; for example,
the Treaty of Rome in English and Spanish are parallel texts.
Alignness
Alignness is a quality among linguistic versions of the same text; for example,
the Treaty of Rome in English and Spanish are parallel texts and they should
be aligned. One can have a guarantee alignness if the texts are kept as
Linguistic Objects; i.e., keeping them in a structured database.
The interesting part is aligning parallel texts automatically.
Level of Alignness
According to which depth it is possible to identify the equivalent string,
the texts are aligned at:
- Document level
- the trivial case; i.e., parallel texts
- Paragraph level
- not too hard
- Sentence level
- desirable and possible
- Term level
- it needs tagging (see below)
In this context, sentence is a part of a text delimited by
a dot, semicolon or similar.
Character set
The present Web character set is probably sufficient for
English and for some non English documents.
It is insufficient for complex multilingual environments;
for example, the European Institutions have eleven official languages.
For this type of environment, a large character set such as
Unicode (ISO 10646 BMP) is needed.
Note that the same document could have several languages with different alphabets;
for example, a document could have English and Greek:
The Greek Commissioner said: <something in Greek>.
Unicode is a 16 bits code character set that includes most of the world languages.
The 16 bits are the force and the weakness of Unicode.
The force because one can represent all the characters;
the weakness because it is an overkill for English.
Without compression, it duplicates the disk space and the transmission by two.
The top 8 bits are not needed for English; i.e., they are set to zero.
A Multilingual Web must be able to process Unicode and the present character set.
There should be a way to indicate a Unicode file;
for example, a file extension or a magic number (Suggestions ?).
The programs should also be capable of mapping Unicode to other
character sets; for example, to ASCII, ISO 8859-1, PC1, etc.
When the target character set is poorer, the mapping should be reasonable;
for example, mapping "é" to "e", or unmappable character to "_".
Tagging
The tagging in a Unicode document can be done
with one Unicode (16 bits) character.
A region of the coding space should be dedicated to this purpose;
i.e., it could be considered a Web alphabet.
Multilingual Aligned Hypertext (MAH)
Multilingual Aligned Hypertext is an extension of the hypertext paradigm
to natural languages;
for example, a user looking at a document in English should be able
to obtain the Spanish equivalent in a transparent way.
For this, the Web must know about foreign languages.
A lot can be done without changing HTML and just by implementing clients
that know about the structure below.
A multilingual Web should have functionalities such as:
- Making a list of available languages for displaying in a drop menu or similar
- Indicating the level of alignness (paragraph, sentence, etc)
- Display at least two aligned documents side by side and move them in sync
- Give polite error messages such as "the document is not available in Swedish"
- Most of the Services proposed in Dragoman should be implemented
- (Suggestions ? : more functionalities)
It is possible to have some Multilingual Aligned Hypertext with
the present Web using the structure below and the present character set,
but the end user must be aware of the structure and
as long as the present character is sufficient for the languages desired.
Data structure
A data structure is needed for Multilingual Aligned Hypertext.
The top of the structure is a mahName.html file.
The file can describe several schemes:
- Directory based scheme for a single set of files
- Directory based scheme for several sets of files
- SGML (Suggested by Jean Paoli, GRIF, abramatic@inria.fr)
- (Suggestions ? : other schemes such as WAIS, tar, cpio)
The mah files below could have any URL and
the name of the document could be different from
the file name (DocName).
(Suggestion ? : generalize grouping;
i.e. a groupingName.html, rather than a mahName.html file)
The default for a single set of files is:
mahDocName.html
DocName.mah (directory)
/en.html English
/es.html Spanish
/de.html German
The default for several sets of files is:
mahDatabaseName.html
DatabaseName.mah (directory)
/en/DocName1.html English
/en/DocName2.html English
/es/DocName1.html Spanish
/es/DocName2.html Spanish
/de/DocName1.html German
/de/DocName2.html German
(Suggestions ? : documentary indexing)
The mahName.html should be usable directly
by the present clients (browsers)
and/or indirectly to generate html files of the fly.
Multilingual clients should use the information to access the documents in
a transparent way.
Anchoring Strategy
The anchoring strategy must minimize the number of anchors and it must allow
changing the defaults.
Only one linguistic version of the document should have explicit anchors (e.g. English),
the other linguistic versions would have implicit anchors;
i.e., the anchors should be calculated by the alignness of the different linguistic versions.
The anchors would have to be at least at sentence level.
It would be hard to place implicit anchors in part of a sentence without tagging
and the second text should have null anchors;
named null anchors if there are several in one sentence.
example:
- No need for null anchoring in the second text.
A whole sentence is anchored in the first text and
finding the place for the implicit anchor in the second text is easy.
The white table. The black table . The green table.
La mesa blanca. La mesa negra. La mesa verde.
(implicit anchor)
- It needs a null anchor in the second text.
Only part of a sentence is anchored in the first text and
finding the place for the implicit anchor in the second text is hard.
The white table. The black table . The green table.
La mesa blanca. La mesa negra. La mesa verde.
(null anchor)
(Suggestions ? : Module that should be in charge of implicit anchoring)
Applications
- The requester of a translation could send only the URL of the document
to be translated, perhaps in a mah file.
The document would have to be in a document repository;
i.e., it must be guaranteed that no further changes are allowed.
Note that the repository does not have to be necessarily controlled
by the translation department.
- The requester can pass the background documentation as anchors.
- The translation folder is just a mah file that could be
at the origin the one sent by the requester and augmented latter.
- The translator could look at the work of his collegues in other
languages; i.e., if an English document has to be translated into
Spanish and German, the Spanish translator could look at the ongoing work
of the German translator.
- The Translator could verify the alignness of his translation.
- The requester of a publication could use a similar paradigm to the
requester of translation.
- The publishing department could generate the publications camera ready
in several linguistics versions.
Defaults
- Structure as above
- English documents have the explicit anchors
- The terminology database is DefaultTerminologyDatabase
Precedence
- Defaults
- Common preference.html (see present practice)
- Document
- Private preference.html
where 1 has the lowest precedence
preference.html
LanguagePreference=<a list of ISO-639 codes>
(incompleted)
ISO-639 are two character codes for languages
Miscellaneous
Dragoman
Dragoman is a reference model for language engineering.
It uses Multilingual Aligned Hypertext technique.
In essence, Dagroman describes a database
(part structured and part documental)
and Services that can be implemented over the database.
The Web paradigm is particularly well adapted to Dragoman.
The term Dragoman has nothing to do with dragons; it means language interpreter.
To do list
- Define the mahName.html file
- Complete the preference file
- Multilingual indexing
- Connections between structured (database) and unstructure (documents) data
- Fixing the Web terminology in other languages
Disclaimer
This document represents only the views of the author.
The document does not engage in any way
the Commission of the European Communities.
Copyright 1995 the Conference Organizer for
the Third International Web-Wide Web Conference
in case of acceptance by the International Program Committee,
otherwise the author.
- End -