On the Multilingual normalization of the Web

Mr. Manuel Tomas CARRASCO BENITEZ
Unicode, character, normalization, Dragoman

Last changed: 28 February 1995

Introduction

Rationale

The objective is to propose multilingual extensions to the Web that are as compatible as possible with the present Web. The extensions should cover language engineering.

The term Web is used to name the clients, servers, HTTP, HTML, etc. This undefinition is on purpose, as I do not wish to suggest too specifically where to implement the mechanisms; though, developing a multilingual client should be considered. I give a concrete syntax, but it should be mostly considered as an example to illustrate the functional characteristics.

Suggestions

The concepts need maturing and suggestions are strongly encouraged, particularly where requested with the label:

"Suggestions ?".

The broad areas are:

Missing functionalities
How to implement the mechanisms on the Web

Definitions

Language engineering

Language engineering covers anything related to languages:

Terminology
Translator's memory
Multilingual documentary databases
Aligned text
Translator's Workbench
Author's Workbench
Machine translation
Publishing (in particular, multilingual synchronized publishing)

Parallel Texts

Parallel texts are linguistic versions of the same text; for example, the Treaty of Rome in English and Spanish are parallel texts.

Alignness

Alignness is a quality among linguistic versions of the same text; for example, the Treaty of Rome in English and Spanish are parallel texts and they should be aligned. One can have a guarantee alignness if the texts are kept as Linguistic Objects; i.e., keeping them in a structured database. The interesting part is aligning parallel texts automatically.

Level of Alignness

According to which depth it is possible to identify the equivalent string, the texts are aligned at:

Document level
the trivial case; i.e., parallel texts
Paragraph level
not too hard
Sentence level
desirable and possible
Term level
it needs tagging (see below)

In this context, sentence is a part of a text delimited by a dot, semicolon or similar.

Character set

The present Web character set is probably sufficient for English and for some non English documents. It is insufficient for complex multilingual environments; for example, the European Institutions have eleven official languages. For this type of environment, a large character set such as Unicode (ISO 10646 BMP) is needed. Note that the same document could have several languages with different alphabets; for example, a document could have English and Greek:

The Greek Commissioner said: <something in Greek>.

Unicode is a 16 bits code character set that includes most of the world languages. The 16 bits are the force and the weakness of Unicode. The force because one can represent all the characters; the weakness because it is an overkill for English. Without compression, it duplicates the disk space and the transmission by two. The top 8 bits are not needed for English; i.e., they are set to zero.

A Multilingual Web must be able to process Unicode and the present character set. There should be a way to indicate a Unicode file; for example, a file extension or a magic number (Suggestions ?). The programs should also be capable of mapping Unicode to other character sets; for example, to ASCII, ISO 8859-1, PC1, etc. When the target character set is poorer, the mapping should be reasonable; for example, mapping "é" to "e", or unmappable character to "_".

Tagging

The tagging in a Unicode document can be done with one Unicode (16 bits) character. A region of the coding space should be dedicated to this purpose; i.e., it could be considered a Web alphabet.

Multilingual Aligned Hypertext (MAH)

Multilingual Aligned Hypertext is an extension of the hypertext paradigm to natural languages; for example, a user looking at a document in English should be able to obtain the Spanish equivalent in a transparent way. For this, the Web must know about foreign languages. A lot can be done without changing HTML and just by implementing clients that know about the structure below. A multilingual Web should have functionalities such as:

Making a list of available languages for displaying in a drop menu or similar
Indicating the level of alignness (paragraph, sentence, etc)
Display at least two aligned documents side by side and move them in sync
Give polite error messages such as "the document is not available in Swedish"
Most of the Services proposed in Dragoman should be implemented
(Suggestions ? : more functionalities)

It is possible to have some Multilingual Aligned Hypertext with the present Web using the structure below and the present character set, but the end user must be aware of the structure and as long as the present character is sufficient for the languages desired.

Data structure

A data structure is needed for Multilingual Aligned Hypertext. The top of the structure is a mahName.html file. The file can describe several schemes:

Directory based scheme for a single set of files
Directory based scheme for several sets of files
SGML (Suggested by Jean Paoli, GRIF, abramatic@inria.fr)
(Suggestions ? : other schemes such as WAIS, tar, cpio)

The mah files below could have any URL and the name of the document could be different from the file name (DocName). (Suggestion ? : generalize grouping; i.e. a groupingName.html, rather than a mahName.html file)

The default for a single set of files is:

mahDocName.html DocName.mah (directory) /en.html English /es.html Spanish /de.html German The default for several sets of files is: mahDatabaseName.html DatabaseName.mah (directory) /en/DocName1.html English /en/DocName2.html English /es/DocName1.html Spanish /es/DocName2.html Spanish /de/DocName1.html German /de/DocName2.html German (Suggestions ? : documentary indexing)

The mahName.html should be usable directly by the present clients (browsers) and/or indirectly to generate html files of the fly. Multilingual clients should use the information to access the documents in a transparent way.

Anchoring Strategy

The anchoring strategy must minimize the number of anchors and it must allow changing the defaults. Only one linguistic version of the document should have explicit anchors (e.g. English), the other linguistic versions would have implicit anchors; i.e., the anchors should be calculated by the alignness of the different linguistic versions. The anchors would have to be at least at sentence level. It would be hard to place implicit anchors in part of a sentence without tagging and the second text should have null anchors; named null anchors if there are several in one sentence.

example:

No need for null anchoring in the second text. A whole sentence is anchored in the first text and finding the place for the implicit anchor in the second text is easy.
It needs a null anchor in the second text. Only part of a sentence is anchored in the first text and finding the place for the implicit anchor in the second text is hard.

(Suggestions ? : Module that should be in charge of implicit anchoring)

Applications

The requester of a translation could send only the URL of the document to be translated, perhaps in a mah file. The document would have to be in a document repository; i.e., it must be guaranteed that no further changes are allowed. Note that the repository does not have to be necessarily controlled by the translation department.
The requester can pass the background documentation as anchors.
The translation folder is just a mah file that could be at the origin the one sent by the requester and augmented latter.
The translator could look at the work of his collegues in other languages; i.e., if an English document has to be translated into Spanish and German, the Spanish translator could look at the ongoing work of the German translator.
The Translator could verify the alignness of his translation.
The requester of a publication could use a similar paradigm to the requester of translation.
The publishing department could generate the publications camera ready in several linguistics versions.

Defaults

Structure as above
English documents have the explicit anchors
The terminology database is DefaultTerminologyDatabase

Precedence

Defaults
Common preference.html (see present practice)
Document
Private preference.html

where 1 has the lowest precedence

preference.html

LanguagePreference=<a list of ISO-639 codes>
(incompleted)

ISO-639 are two character codes for languages

Miscellaneous

Dragoman

Dragoman is a reference model for language engineering. It uses Multilingual Aligned Hypertext technique. In essence, Dagroman describes a database (part structured and part documental) and Services that can be implemented over the database. The Web paradigm is particularly well adapted to Dragoman. The term Dragoman has nothing to do with dragons; it means language interpreter.

To do list

Define the mahName.html file
Complete the preference file
Multilingual indexing
Connections between structured (database) and unstructure (documents) data
Fixing the Web terminology in other languages

Disclaimer

This document represents only the views of the author. The document does not engage in any way the Commission of the European Communities.

Copyright 1995 the Conference Organizer for the Third International Web-Wide Web Conference in case of acceptance by the International Program Committee, otherwise the author.

- End -