|
Semantic annotation of WordNet glosses
Introduction
The Extended WordNet project [1] aims to transform the
WordNet glosses into a format that allows the derivation of additional semantic
and logic relations. The last release of the Extended WordNet is based on WordNet 2.0 has three stages: part
of speech tagging and parsing, logic form transformation, and semantic
disambiguation. This paper presents the semantic
disambiguation of the WordNet glosses. The next section presents some statistics
regarding the disambiguation of WordNet glosses, the second section describes
the format of the files, and the third section briefly presents the
methods used for the semantic annotation.
Statistics
WordNet 2.0 contains a total number of 115,424 glosses divided into 79,689 noun
synset glosses, 13,508 verb synset glosses, 18,563 adjective synset glosses
and 3,664 adverb glosses. In order to be consistent with the logic form transformation
and parsing trees, in each gloss we removed the examples and the comments
in parentheses. This resulted in 637,067 open class words to be disambiguated. From these, 160,879 are monosemous remainig 476,188 to be disambiguated.
For disambiguating these open class words we used both manual and automatic
annotation. Automatic annotation was done using two programs: one specially
designed to disambiguate the WordNet glosses called XWN_WSD, and an in-house
system for WSD of open text. A voting between the two systems was performed and we estimate a precision of 90% for the words tagged with the same sense by both system. The precision of annotation was classified as "gold" for manually
checked words, "silver" for the words automatically tagged with the same sense
by the both disambiguation systems, and "normal" for the rest of the words
automatically annotated by the XWN_WSD system. Word forms corresponding
to the verbs "to be" and "to have" were not disambiguated automatically. Table
1 presents the number of the open class words in each category for sets of glosses
corresponding to each part of speech for XWN2.0-1.1 release of XWN.
| Set of glosses |
Number of glosses |
Open class words |
Monosemous words |
"Gold" words |
"Silver" words |
"Normal" words |
| Noun glosses |
79,689 |
505,946 |
138,274 |
10,142 |
45,015 |
296,045 |
| Verb glosses |
13,508 |
48,200 |
6,903 |
2,212 |
5,193 |
30,813 |
| Adjective glosses |
18,563 |
74,108 |
14,142 |
263 |
6,599 |
50,359 |
| Adverb glosses |
3,664 |
8,998 |
1,605 |
1,829 |
385 |
4,920 |
Table 1. Disambiguated words in each category.
File Format
For releasing the semantically annotated glosses we used an XML format. Below there is a part of XML schema definition file regarding the semantic disambiguation:
<xsd:simpleType name="puncType">
<xsd:restriction base="xsd:string">
<xsd:pattern value="([^a-zA-Z0-9])+"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="wfType">
<xsd:simpleContent>
<xsd:extension base="xsd:string">
<xsd:attribute name="pos" type="wPosType" use="required"/>
<xsd:attribute name="lemma" type="xsd:string" use="optional"/>
<xsd:attribute name="quality" type="qualityType" use="optional" default="normal"/>
<xsd:attribute name="wnsn" type="senseType" use="optional"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
<xsd:complexType name="wsdType">
<xsd:all>
<xsd:element name="punc" type="puncType" minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="wf" type="wfType" minOccurs="0" maxOccurs="unbounded"/>
</xsd:all>
</xsd:complexType>
<xsd:element name="xwn">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="gloss" minOccurs="0" maxOccurs="unbounded">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="synonymSet" type="xsd:string"/>
<xsd:element name="text" type="xsd:string"/>
<xsd:element name="wsd" type="wsdType"/>
<xsd:element name="parse" type="parseType" minOccurs="1" maxOccurs="unbounded"/>
<xsd:element name="lft" type="lftType" minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="synsetID" type="synsetIDType" use="required"/>
<xsd:attribute name="pos" type="glossPosType" use="required"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="ver" type="xsd:string"/>
<xsd:attribute name="wnver" type="xsd:string"/>
</xsd:complexType>
</xsd:element>
Each file contains the enclosing tag <xwn>. This tag contains the
attribute "ver" representing the current release version (2.0-1), and "wnver" representing
the WordNet version (2.0). The glosses are represented by the <gloss>
tag that inlcludes the synonym set, the text of the gloss, the parse tree, the logic form tranformation and the semantic disambiguation of the gloss.
The semantic disambiguation part is marked by the tag <gloss> and includes words represented by the <wf> tag and punctuation
represented by the <punc> tag.
The <punc> tag does not have any attribute.
The tag <wf> contains the following attributes:
- pos - representing the part of speech as given by the Brill tagger
[2]. This attribute is required.
- quality - representing the quality of the semantic annotation as described
above. This attribute can take 3 values" gold", "silver" and "normal".
- lemma - representing the stem of a word in the open class category.
- wnsn - representing the annotated sense or senses separated by comma.
The senses stored in "wnsn" attribute were obtained using several methods
of semantic disambiguation. The following section will overview the process
of semantically disambiguation of WordNet glosses.
Semantic Disambiguation of WordNet glosses
The semantic disambiguation of WordNet glosses consists of two phases:
- The first phase is preprocessing that separates the
WordNet glosses into definitions and examples, and performs tokenization, part of speech
tagging using Brill's tagger [2], and identifying of compound
concepts.
- The second phase is the effective disambiguation that consists of assigning
to each open class word the correct sense using its part of speech. The senses
were assigned using both manual and automatic procedures.
Human annotators disambiguated open class words from the set of glosses labeled as
gold standards for checking the disambiguation system accuracy. These
disambiguated glosses were integrated into the files from this release of Extended
WordNet package.
The disambiguation software is based on several heuristics:
- The Monosemous Words method identify all the words with only one sense
and mark them with sense #1.
- The Same Hierarchy method identifies the gloss word belonging
to the same hierarchy as the synset of the gloss.
- The Lexical Parallelism method identifies the words with
the same part of speech separated by comas or conjunctions and mark them with
senses that belongs to the same hierarchy, when this is possible.
- Given a word in a gloss, the Semcor bigrams method forms
two pairs, one with the previous word and the other with the next word, and
searches for these pairs in Semcor corpus [4]. If in all the
occurrences of these pairs, the given word has the same sense, and the number
of occurrences is bigger than a threshold than we assign that sense to the
word.
- Given an ambiguous word W in the synset S, the Cross-Reference
method looks for a reference to the synset S in all the glosses corresponding
to the word senses.
- Reversed Cross-Reference method tries to find if there
are two words in the gloss belonging to the same synset.
- Distance among glosses method determines the number of
common words between two synsets. For an ambiguous word W in a gloss G, this
method selects the sense of the word that has the greatest number of common
words with the gloss G.
- Some of the WordNet glosses have a domain associated with them written
in parentheses. Magnini [3] assigned a domain to all the
nouns synsets in WordNet. The Common Domain method selects the sense of a
word that has the same domain as the synset of the gloss.
- The "Patterns" method ([5]) exploits
the idiosyncratic nature of the WordNet glosses identifying the repetitive
expressions.
These methods disambiguate 64% words of WordNet glosses with 75% accuracy.
The rest of the words were tagged with the first sense.
For disambiguating the WordNet glosses we also used another WSD system for
open text. The glosses were transformed into sentences and disambiguated
using this system with 100% coverage and 70% accuracy.
About 10% of words tagged with the same sense by both systems have an estimated 90% accuracy.
Conclusion
The semantic disambiguation of WordNet glosses is part of extended WordNet.
The definitions were first separated from comments and examples, tokenized
and part of speech tagged. This preprocessing stage resulted in 637,067 open
class words to be disambiguated for which we used both human and automatic
annotation. The words manually disambiguated or checked were labeled as "gold". We performed
a voting between two disambiguation systems, one specially designed for disambiguating
glosses, and one for disambiguating open text. The words that have the same
sense assigned by both systems were labeld as "silver". The rest of the words
are labeled as normal. The disambiguated glosses are presented in an XML
format. The disambiguated words in WordNet can be used to derive new semantic
relations and build lexical chains [6].
References
- S. Harabagiu, G. Miller, D. Moldovan, WordNet2 - a morphologically
and semantically enhanced resource. In Proceedings if SIGLEX-99, pages 1-8,
Univ of Mariland, 1999.
- E. Brill. Transformation-based error driven
learning and natural language processing a case study in part of speech tagging.
Computational linguistic, 21(4):543-566, 1995
- B. Magnini, C. Strapparava. Experiments
in Word Domain Disambiguation for Parallel Texts. Proceedings of the ACL
workshop on Word Senses and Multilinguality, pag. 27-33, 2000
- G. Miller, G. A. Leacock, C. Tengi, R.,
and Bunker, R. A semantic concordance. Proceedings of the ARPA Human Language
Technology Workshop (Princeton, NJ, March 21--23) , pp. 303--308, 1993
- A. Novischi. Accurate Semantic Annotations
via Pattern Matching. Proceedings of Florida Artificial Intelligence Research
Society, 2002.
- D. Moldovan, A. Novischi. "Lexical Chains
for Question Answering" Proceedings of COLING 2002.
|
|
|