In our discussion of semantics up to now, we focused on structural issues: how to represent the relations between predicates and events and their arguments and modifiers; how to represent quantification; how to convert syntactic structure into semantic structure. As our predicates, we used words, but this is really problematic for a semantic representation: one word may have several meanings (polysemy) and several words may have the same or nearly the same meaning (synonymy). In this section we take a closer look at word meanings.
Terminology [J&M 19.1, 2]
- multiple senses of a word - polysemy (and homonymy for totally unrelated senses ("bank")) - metonomy for certain types of regular, productive polysemy ("the White House", "Washington") - zeugma (conjunction combining distinct senses) as test for polysemy ("serve") - synonymy: when two words mean (more-or-less) the same thing - hyponymy: X is the hyponym of Y if X denotes a more specific subclass of Y (X is the hyponym, Y is the hypernym)
WordNet [J&M 19.3]
- large-scale database of lexical relations - freely available for interactive use or download - organized as a graph whose nodes are synsets (synonym sets) - each synset consists of 1 or more word senses which are considered synonymous - primary relation: hyponym / hypernym - very fine sense distinctions - sense-annotated corpus (SemCor, subset of Brown corpus) - similar wordnets developed for many foreign languages: Global WordNet Association
Word Sense Disambiguation [J&M 20.1]
- process of identifying the sense of a word in context - WSD evaluation: either using WordNet or coarser senses (e.g., main senses from a dictionary) - local cues (Weaver): train a classifer using nearby words as features - either treat words at specific positions relative to target word as separate features - or put all words within a given window (e.g., 10 words wide) as a 'bag of words' - simple demo for 'interest'
selected sense s' = argmax(sense s) P(s | F) where F is the set of context features (n different features) s' = argmax(s) P(F | s) P(s) / P(F) = argmax(s) P(F | s) P(s) If we now assume features are independent P(F | s) = product(i) P(f[i] | s) s' = argmax(s) P(s) product(i) P(f[i] | s) Maximum likelihood estimates for P(s) and P(f[i] | s) can be easily obtained by counting - some smoothing (e.g., add-one smoothing) is needed Works quite well at selecting best sense (not at estimating probabilities) But needs substantial annotated training data for each word
Semi-supervised WSD algorithm [J&M 20.5]
Based on Gale / Yarowsky's "one sense per discourse" observation (generally true for coarse word senses) Allows bootstrappig from a small set of sense-annotated seeds
Identifying similar words
Distance metric for Wordnet [J&M 20.6]
Simplest metrics just use path length in WordNet More sophisticated metrics take account of the fact that going 'up' (to a hypernymm) may represent different degrees of generalization in different cases Resnik introduced P(c): for each concept (synset), P(c) = probability that a word in a corpus is an instance of the concept (matches the synset c or one of its hyponyms) Information content of a concept IC(c) = -log P(c) If LCS(c1, c2) is the lowest common subsumer of c1 and c2, the JC distance between c1 and c2 is IC(c1) + IC(c2) - 2 IC(LCS(c1, c2))
Similarity metric from corpora [J&M 20.7]
Basic idea: characterize words by their contexts; words sharing more contexts are more similar Contexts can either be defined in terms of adjacency or dependency (syntactic relations) Given a word w and a context feature f, define pointwise mutual information PMI PMI(w,f) = log ( P(w,f) / P(w) P(f)) Given a list of contexts (words left and right) we can compute a context vector for each word. The similarity of two vectors (representing two words) can be computed in many ways; a standard way is using the cosine (normalized dot product). See the Thesaurus demo by Patrick Pantel.