December 2010
The current specifications constitute a comprehensive set of guidelines and standards for the development and implementation of the Universal Networking Language (UNL). Prepared by the UNDL Foundation, they result from an extensive revision of the previous specifications and incorporate the outcomes of several UNL projects, as well as the experience gained through UNL annotation and tool development. These specifications strengthen the language-independency features of UNL and have been thoroughly discussed within the UNL community, particularly through the UNLweb platform. They replace the earlier specifications and provide a more consistent and robust framework for the effective use of UNL across different applications and platforms.
The UNL, an acronym for “Universal Networking Language”, is a computer language designed to represent the meaning conveyed by natural languages in a machine-readable and language-independent way. It does not aim to replicate the functions of natural languages in human communication, but rather to provide a formal and computational framework through which the semantics of any utterance can be explicitly encoded and processed by computers. By enabling machines to handle information at the level of meaning, the UNL supports the emulation of human linguistic abilities that rely on interpretation and comprehension, thus offering a foundation for multilingual understanding, knowledge processing, and intelligent communication.
The UNL is a declarative language designed to express information and knowledge in the form of a semantic hypergraph. A semantic hypergraph is a structured representation made of interconnected pieces of meaning, where each node (or hyper-node) corresponds to a concept, and each arc corresponds to a semantic relation between concepts. In the UNL framework, meaning can be codified at three different levels, according to its nature: conceptual, relational, and attributive. Accordingly, the UNL semantic hypergraph is composed of three types of discrete semantic units:
Consider the following English sentence:
{eng} The cat is on the mat. {/eng}
This sentence can be represented in UNL as follows:
{unl:eng} plc( cat(icl>feline).@def , mat(icl>floor cover).@def.@on.@present ) {/unl}
In this UNL expression, {unl:eng} indicates that we are using the UCN schema for English to annotate the sentence (see below). The strings "cat(icl>feline)" and "mat(icl>floor cover)" are UWs representing the concepts conveyed by the English words "cat" and "mat". The suffixes "(icl>feline)" and "(icl>floor cover)" are restrictions to avoid lexical ambiguity. The relation "plc" (place) specifies the relationship between the two concepts. The attribute "@def" indicates that the cat is a specific cat known to both the speaker and the listener. The same applies to the mat. The attribute "@present" indicates that the action is taking place in the present time. The attribute "@on" indicates the position of the cat in relation to the mat. Accordingly, this UNL expression can be read as "The specific cat is located on the specific mat in the present time".
This representation captures the meaning of the original English sentence while providing a structured format that can be easily processed by computers.
The UNL Programme started in 1996, as an initiative of the Institute of Advanced Studies of the United Nations Universityin Tokyo, Japan. In January 2001, the United Nations University set up an autonomous organization, the UNDL Foundation, to be responsible for the development and management of the UNL Programme. The Foundation, a non-profit international organisation, has an independent identity from the United Nations University, although it has special links with the United Nations. It inherited from the UNU/IAS the mandate of implementing the UNL Programme. Its headquarters are based in Geneva, Switzerland.
The UNL Programme has already crossed important milestones. The overall architecture of the UNL System has been developed with a set of basic software and tools necessary for its functioning. These are being tested and improved. A vast amount of linguistic resources from the various native languages already under development has been accumulated in the last few years. Moreover, the technical infrastructure for expanding these resources is already in place, thus facilitating the participation of many more languages in the UNL system from now on. A growing number of scientific papers and academic dissertations on the UNL are being published every year.
The most visible accomplishment so far is the recognition by the Patent Co-operation Treaty (PCT) of the innovative character and industrial applicability of the UNL, which was obtained in May 2002 through the World Intellectual Property Organisation (WIPO). Acquiring the patent for the UNL is a completely novel achievement within the United Nations.
The main goal of the UNL Programme is to construct the UNL, an artificial language that can be used to process information across the language barriers.
The major commitments of the UNL are the following:
The basic assumption of the UNL approach is that the information conveyed by natural languages can be formally represented through a semantic network made of three different types of discrete semantic units: Universal Words, Universal Relations and Universal Attributes.
Universal Words, or simply UW's, are the words of UNL, and correspond to nodes - to be interlinked by Universal Relations and specified by Universal Attributes - in a UNL graph. They correspond to semantic discrete units conveyed by natural language open lexical categories (noun, verb, adjective and adverb). Any other semantic content (such as the ones conveyed by articles, prepositions, conjunctions etc.) is represented as attributes or relations. This criterion is not language-biased: if a given semantic value proves to be conveyed, in any language, by a closed class, it should not be represented as a UW, regardless of its realisation in other languages.
As the name indicates, Universal Words are expected to be "universal". This does not mean that they represent a sort of common lexical denominator to all languages or a semantic primitive. The concept of "universal", in UNL, must be understood in the sense of "capable of being used and understood by all" (as in "Coordinated Universal Time (UTC)", or in "universal adapter"), rather than "common to all" (as in "Universal Grammar"). They are "universal" in the sense that they are uniform identifiers to the entities defined in the UNL Knowledge Base, which is expected to map everything that we know about the world, and that is used to assign translatability to any concept.
UW's may represent concepts that are believed to be lexicalized[5] in most languages (such as "cause to die"); concepts that are lexicalized only in a few languages (such as "to execute someone by suffocation so as to leave the body intact and suitable for dissection"); concepts that are lexicalized in one single language (such as "a person who is ready to forgive any transgression a first time and then to tolerate it for a second time, but never for a third time"); and concepts that are not lexicalized in any language (such as "women that normally wear red hats and white shoes in big theaters").
The universality of a UW does not come from the type of concept that it represents, but from the way it does that: the UW provides a method for processing the concept, so that any natural language would be able to deal with it, either as a single node, if lexicalized, or as a hyper-node (i.e., a sub-graph), otherwise.
UW's can be permanent or temporary.
Permanent UWs can be simple, compound or complex.
An Uniform Concept Identifier (UCI) is used to identify a concept. It is a URI (Uniform Resource Identifier) for Universal Words (UW's). In the UNL framework, UCI's are represented either as UCL (Uniform Concept Locator) or UCN (Uniform Concept Name).
The UCI follows the generic syntax defined for URI's:
<scheme name>:<hierarchical part>Where:
Uniform Concept Locators (UCL), as URL's, provide a method for finding the concept in the UNL Knowledge Base. They are represented as:
ucl://<AUTHORITY>/<ID>Where:
For instance, the concept "a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs", which is lexicalized in English through the noun "table", may be located through "ucl://unlkb.unlarchive.org/104379964". This address is expected to bring all the information concerning the concept, i.e., it's definition in UNL, which may be used by the languages where this concept is not lexicalized.
Uniform Concept Names (UCN) use the ucn scheme and, as URN's, do not imply availability of the identified resource. They are represented as:
ucn:<LID>:<NSS>Where:
For instance, the concept "a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs", which is lexicalized in English through the noun "table" may be associated to several different names:
UCN's must be unique and the namespace-specific string is normally split into two different parts: a root and a suffix, as exemplified above. The root can be a word or a multi-word expression. The suffix, which is always introduced by a UNL relation, is used to disambiguate the root.
UCL and UCN are both used to identify UW's. The difference is that UCL is an address to the position of the UW in the UNL Knowledge Base, whereas the UCN is only the name of the UW. The same address (i.e., UCL) may be associated to different UCN's, but a single UCN may not have more than one UCL. A UCL always describe an available UW, i.e., a UW that has been already defined in the UNL KB, whereas a UCN is not necessarily linked to an address. In that sense, UCL's are more "official" than UCN's, which are normally used in order to preserve the readability of the UNL code.
In the UNL Document Structure, UCI's are always abbreviated to the last part, because the scheme, the authority and the namespace may be inferred from the document header. For instance:
Universal Words are represented as follows:
| Type | Concept (in English) |
Lexicalization (in English) |
UCL | UCN |
|---|---|---|---|---|
| Simple UW | above average | big | 301382086 |
big(icl>size) grande(icl>tamaño) groß(icl>Größe) grand(icl>taille) ... |
| Compound UW | comparative of above average | bigger | 301382086.@more |
big(icl>size).@more ... |
| Complex UW | affix a stamp to | to stamp | obj(201356370,106796119) | obj(to affix(icl>to attach), stamp(icl>seal)) |
| Temporary UW | UNDL Foundation | UNDL Foundation | "UNDL Foundation" | "UNDL Foundation" |
Permanent UW's are classified in four different categories, depending on their semantic values:
These categories are semantically-based. They are related to the UW's and are not oriented to any particular language.
In that sense, adjectival UW's (such as "300217728" = "delighting the senses or exciting intellectual or emotional admiration") tend to be associated to English adjectives ("beautiful"), but they can also be realised as prepositional phrases ("with beauty"), verbal phrases ("possessing beauty"), etc.
The UNL representation is expected to be as semantically saturated as possible, and deictics are supposed to be substituted during the UNLization process. In that sense, ellipses and natural language pro-forms (such as "he", "she", "it", "they" etc.) are expected to be replaced by their corresponding antecedents. In many cases, however, it is not possible to find a substitute for words requiring information that is not available inside natural language texts. In these cases, we use pro-UWs, which are represented by the null UW "00" combined with attributes, when applicable.
The main cases are:
It is important to stress that all cases above refer to situations where the semantic content cannot be fully saturated. Whenever possible, pro-forms and ellipses are expected to be replaced by their referents. For instance, the pro-UW "00.@3" is not supposed to be used in the case of "Peter said that he will not come", if we are sure that "he" is "Peter". In this case, this sentence is expected to be represented as "Peter(i) said that Peter(i) will not come".
It should also be stressed that, in the UNL approach, pronouns should be differentiated from determiners. The word "which" in "which is that?" is an interrogative pronoun and should be represented, therefore, by the pro-UW "00.@wh", if we cannot determine to what we are referring to; but the word "which" in "which book is that?" is a determiner, to be represented as an attribute (.@wh) assigned to "book" ("book.@wh").
Most named entities (names of people, of places, of brands etc.) are represented as temporary UW's, because it would not be feasible to include them all in the UNL Dictionary. Nevertheless, some named entities of widespread use (such as "England", "William Shakespeare", "Romeo and Juliet", "Romeo" etc.) have been already included in the UNL Dictionary and are treated as permanent UW's. Our current criteria is the Wikipedia. If a proper name is defined as an entry in the Wikipedia, then it should be defined as a permanent UW and included in the UNL Dictionary|UNL Unabridged Dictionary.
UW's are grouped in several different lexical databases:
Universal Attributes are arcs linking a node to itself. In opposition to Universal Relations, they correspond to one-place predicates, i.e., functions that take a single argument. In UNL, attributes have been normally used to represent information conveyed by natural language grammatical categories (such as tense, mood, aspect, number, etc).
The set of attributes, which is claimed to be universal, is not open to frequent additions.
The syntax of attributes is defined as follows:
<attribute> ::= "@"<attribute_name>
<attribute_name> ::= <character>+
<character> ::= {"a",...,"z","_"}
where:
< > variable
" " terminal symbol
::=... is defined as ...
{ } disjunction ("or")
+ to be used one or more times
... to be repeated more than 0 times
Attribute names are always lower case words or expressions.
Normally, English words ("past", "will") or mnemonic abbreviations ("def", "pl") are used for attribute labelling.
No blank space is allowed inside an attribute name.
Attributes are annotations made to nodes or hypernodes of a UNL hypergraph. They denote the circumstances under which these nodes (or hypernodes) are used.
Attributes may convey three different kinds of information:
Universal Relations, formerly known as "links", are labelled arcs connecting a node to another node in a UNL graph. They correspond to two-place semantic predicates holding between two Universal Words. In UNL, universal relations have been normally used to represent semantic cases or thematic roles (such as agent, object, instrument, etc.) between UWs. The repertoire of universal relations is defined in the Specs|UNL Specs and it is not open to frequent additions.
In the UNL framework, universal relations describe semantic functions between two UWs. These functions are binary and directed (from a source to a target) and are claimed to be universal. Because of their similarity in name and function to syntactic relations, it may seem that the labels used for relations are different names for special grammatical functions. This is emphatically not the case. The intention is that the labels used denote specific ideas rather than grammatical structures: the idea of “something that initiates an event,” or “agent” for example, is quite different from “grammatical subject of a sentence”, even though many times the subject of a sentence will indicate the agent of the event. The agent of an event may also appear as an adjective or noun modifier, with the preposition “by” or embedded in nouns with “er” suffixes. The whole point of the conceptual relations is to have a name for these very different grammatical structures which are conceptually quite the same. Thus, the conceptual relations used in UNL are much more abstract than the grammatical relations found in sentences.
Universal relations are represented as follows:
<rel>:<scope>(<source>,<target>)
where:
Universal Relations are organized in a hierarchy where lower nodes subsume upper nodes. The topmost level is the relation "rel", which simply indicates that there is a semantic relation between two elements.
rel
The same relation may play different syntactic roles. Consider, for instance, the case of the relation 'gol' (goal):
Consider, for instance, the case of the English preposition "in", as in 'Peter works in X'.
The same relation may be used to describe nominal and verbal structures:
Consider, for instance, the verbs "to kill", "to love" and "to give":
| Tag | Relation | Definition | Example |
|---|---|---|---|
| agt | agent | A participant in an action or process that provokes a change of state or location. | John killed Mary = agt(killed;John) Mary was killed by John = agt(killed;John) arrival of John = agt(arrival;John) |
| and | conjunction | Used to state a conjunction between two entities. | John and Mary = and(John;Mary) both John and Mary = and(John;Mary) neither John nor Mary = and(John;Mary) John as well as Mary = and(John;Mary) |
| ant | opposition or concession | Used to indicate that two entities do not share the same meaning or reference. Also used to indicate concession. | John is not Peter = ant(Peter;John) 3 + 2 != 6 = ant(6;3+2) Although he's quiet, he's not shy = ant(he's not shy;he's quiet) |
| aoj | object of an attribute | The subject of an stative verb. Also used to express the predicative relation between the predicate and the subject. | John has two daughters = aoj(have;John) the book belongs to Mary = aoj(belong;book) the book contains many pictures = aoj(contain;book) John is sad = aoj(sad;John) John looks sad = aoj(sad;John); |
| ben | beneficiary | A participant who is advantaged or disadvantaged by an event. | John works for Peter = ben(works;Peter) John gave the book to Mary for Peter = ben(gave;Peter) |
| cnt | content or theme | The object of an stative or experiental verb, or the theme of an entity. | John has two daughters = cnt(have;two daughters) the book belongs to Mary = cnt(belong;Mary) the book contains many pictures = cnt(contain;many pictures) John believes in Mary = cnt(believe;Mary) John saw Mary = cnt(saw;Mary) John loves Mary = cnt(love;Mary) The explosion was heard by everyone = cnt(hear;explosion) a book about Peter = cnt(book;Peter) |
| con | condition | A condition of an event. | If I see him, I will tell him = con(I will tell him;I see him) I will tell him if I see him = con(I will tell him;I see him); |
| dur | duration or co-occurrence | The duration of an entity or event. | John worked for five hours = dur(worked;five hours) John worked hard the whole summer = dur(worked;the whole summer) John completed the task in ten minutes = dur(completed;ten minutes) John was reading while Peter was cooking = dur(John was reading;Peter was cooking) |
| equ | synonym or paraphrase | Used to indicate that two entities share the same meaning or reference. Also used to indicate semantic apposition. | The morning star is the evening star = equ(evening star;morning star) 3 + 2 = 5 = equ(5;3+2) UN (United Nations) = equ(UN;United Nations) John, the brother of Mary = equ(John;the brother of Mary) |
| exp | experiencer | A participant in an action or process who receives a sensory impression or is the locus of an experiential event. | John believes in Mary = exp(believe;John) John saw Mary = exp(saw;John) John loves Mary = exp(love;John) The explosion was heard by everyone = exp(hear;everyone) |
| fld | field | Used to indicate the semantic domain of an entity. | sentence (linguistics) = fld(sentence;linguistics) |
| gol | final state, place, destination or recipient | The final state, place, destination or recipient of an entity or event. | John received the book = gol(received;John) John won the prize = gol(won;John) John changed from poor to rich = gol(changed;rich) John gave the book to Mary = gol(gave;Mary) He threw the book at me = gol(threw;me) John goes to NY = gol(go;NY) train to NY = gol(train;NY) |
| icl | hyponymy, is a kind of | Used to refer to a subclass of a class. | Dogs are mammals = icl(mammal;dogs) |
| ins | instrument or method | An inanimate entity or method that an agent uses to implement an event. It is the stimulus or immediate physical cause of an event. | The cook cut the cake with a knife = ins(cut;knife) She used a crayon to scribble a note = ins(used;crayon) That window was broken by a hammer = ins(broken;hammer) He solved the problem with a new algorithm = ins(solved;a new algorithm) He solved the problem using an algorithm = ins(solved;using an algorithm) He used Mathematics to solve the problem = ins(used;Mathematics) |
| iof | is an instance of | Used to refer to an instance or individual element of a class. | John is a human being = iof(human being;John) |
| lpl | logical place | A non-physical place where an entity or event occurs or a state exists. | John works in politics = lpl(works;politics) John is in love = lpl(John;love) officer in command = lpl(officer;command) |
| man | manner | Used to indicate how the action, experience or process of an event is carried out. | John bought the car quickly = man(bought;quickly) John bought the car in equal payments = man(bought;in equal payments) John paid in cash = man(paid;in cash) John wrote the letter in German = man(wrote;in German) John wrote the letter in a bad manner = man(wrote;in a bad manner) |
| mat | material | Used to indicate the material of which an entity is made. | A statue in bronze = mat(statue;bronze) a wood box = mat(box;wood) a glass mug = mat(mug;glass) |
| mod | modifier | A general modification of an entity. | a beautiful book = mod(book;beautiful) an old book = mod(book;old) a book with 10 pages = mod(book;with 10 pages) a book in hard cover = mod(book;in hard cover) a poem in iambic pentameter = mod(poem;in iambic pentamenter) a man in an overcoat = mod(man;in an overcoat) |
| nam | name | The name of an entity. | The city of New York = nam(city;New York) my friend Willy = nam(friend;Willy) |
| obj | patient | A participant in an action or process undergoing a change of state or location. | John killed Mary = obj(killed;Mary) Mary died = obj(died;Mary) The snow melts = obj(melts;snow) |
| opl | objective place | A place affected by an action or process. | John was hit in the face = opl(hit;face) John fell in the water = opl(fell;water) |
| or | disjunction | Used to indicate a disjunction between two entities. | John or Mary = or(John;Mary) either John or Mary = or(John;Mary) |
| per | proportion, rate, distribution, measure or basis for a comparison | Used to indicate a measure or quantification of an event. | The course was split in two parts = per(split;in two parts) twice a week = per(twice;week) The new coat costs $70 = per(cost;$70) John is more beautiful than Peter = per(beautiful;Peter) John is as intelligent as Mary = per(intelligent;Mary) John is the most intelligent of us = per(intelligent;we) |
| plc | place | The location or spatial orientation of an entity or event. | John works here = plc(work;here) John works in NY = plc(work;NY) John works in the office = plc(work;office) John is in the office = plc(John;office) a night in Paris = plc(night;Paris) |
| pof | is part of | Used to refer to a part of a whole. | John is part of the family = pof(family;John) |
| pos | possessor | The possessor of a thing. | the book of John = pos(book;John) John's book = pos(book;John) his book = pos(book;he) |
| ptn | partner | A secondary (non-focused) participant in an event. | John fights with Peter = ptn(fight;Peter) John wrote the letter with Peter = ptn(wrote;Peter) John lives with Peter = ptn(live;Peter) |
| pur | purpose | The purpose of an entity or event. | John left early in order to arrive early = pur(John left early;arrive early) You should come to see us = pur(you should come;see us) book for children = pur(book;children) |
| qua | quantity | Used to express the quantity of an entity. | two books = qua(book;2) a group of students = qua(students;group) |
| res | result or factitive | A referent that results from an entity or event. | The cook bake a cake = res(bake;cake) They built a very nice building = res(built;a very nice building) |
| rsn | reason | The reason of an entity or event. | John left because it was late = rsn(John left;it was late) John killed Mary because of John = rsn(killed;John) |
| seq | consequence | Used to express consequence. | I think therefore I am = seq(I think;I am) |
| src | initial state, place, origin or source | The initial state, place, origin or source of an entity or event. | John came from NY = src(came;NY) John is from NY = src(John;NY) train from NY = src(train;NY) John changed from poor into rich = src(changed;poor) John received the book from Peter = src(received;Peter) John withdrew the money from the cashier = src(withdrew;cashier) |
| tim | time | The temporal placement of an entity or event. | The whistle will sound at noon = tim(sound;noon) John came yesterday = tim(came;yesterday) |
| tmf | initial time | The initial time of an entity or event. | John worked since early = tmf(worked;early) |
| tmt | final time | The final time of an entity or event. | John worked until late = tmt(worked;late) |
| via | intermediate state or place | The intermediate place or state of an entity or event. | John went from NY to Geneva through Paris = via(went;Paris) The baby crawled across the room = via(crawled;across the room) |
UNL sentences, or UNL expressions, are sentences of UNL. They are hypergraphs made out of nodes (Universal Words) interlinked by binary semantic Universal Relations and modified by Universal Attributes. UNL sentences have been the basic unit of representation inside the UNL framework.
According to the Specs|UNL Specs, there are two different ways of representing UNL sentences: the table format and the list format. In the list format, UWs and relations are represented separately; in the table format, they constitute a single structure.
The syntax for UNL sentences in the list format is the following:
<UNL sentence> ::= "[W]" <list of UWs> "[/W]" [ "[R]" <list of relations> "[/R]" ]
<list of UWs> ::= <UW+attributes> [<UW+attributes>...]
<UW+attributes> ::= <UW>{:<Scope-ID>}[<attribute list>]:<UW-ID>
<list of relations> ::= <binary relation>[<binary relation>...]
<binary relation> ::= <source node><relation[":"<Scope-ID>]<target node>
<source node> ::= <UW-ID>
<target node> ::= <UW-ID>
The syntax for UNL sentences in the table format is the following:
<UNL sentence> ::= <list of relations>
<list of relations> ::= <binary relation>[<binary relation>...]
<binary relation> ::= <relation> [":"<Scope-ID>] "(" <source node> , <target node> ")"
<source node> ::= <UW+attributes>
<target node> ::= <UW+attributes>
<UW+attributes> ::= <UW>{:<Scope-ID>}[<attribute list>]:<UW-ID>
Where
" and " indicate a predefined delimiter
< and > indicate a non-terminal symbol
{ and } indicate a range
[ and ] indicate an omissible part
... indicates more than 0 times repetition of the front part
::= indicates the left part can be replaced by the right part
UNL documents are documents written in UNL. They are plain text files that include UNL Sentences and some special tags. They are the output of the UNLization process and the input of the NLization process.
A UNL document is enclosed with tags “[D:<id>]” and “[/D]”. Within these tags, each paragraph is enclosed with a pair of tags “[P:<id>]” and “[/P]”, and each sentence is enclosed with a pair of tags “[S:<id>]” and “[/S]”. Inside a sentence, the text of original sentence is enclosed with “{org:<lang>}” and “{/org}”, its UNL expression is enclosed with “{unl:<id>}” and “{/unl}”. Sentences of target languages can also be stored in the UNL document. Each target sentence is enclosed with a pair of language tags “{<lang>}” and “{/<lang>}” following the UNL expression of each sentence.
| Tag | Description |
|---|---|
| [D:<id>] | indicates the beginning of a document. |
| [/D] | indicates the end of a document |
| [P:<id>] | indicates the beginning of a paragraph. |
| [/P] | indicates the end of a paragraph |
| [S:<id>] | indicates the beginning of a sentence. |
| [/S] | indicates the end of a sentence |
| {org:<lang>=<code>} | indicates the beginning of an original/source sentence |
| {/org} | indicates the end of an original sentence |
| {unl:<id>} | indicates the beginning of the UNL expressions of a sentence. |
| {/unl} | indicates the end of the UNL expressions of a sentence |
| {<lang>} | indicates the beginning of a target sentence of the language indicated by <lang> |
| {/<lang>} | indicates the end of a target sentence of the language indicated by <lang> |
Where
For the time being, a UNL document is simply a collection of UNL sentences. However, it can also be treated as a hypergraph itself, comprising several subhypergraphs (the UNL sentences) inter-related by a special relation "nxt" (for "next"), which indicates sequential order. In the XUNL Project, we have been proposing some other strategies for representing cross-sentential relations, which are, however, still under discussion.