A final drawback to the linguistic approach is that it is heavily language dependent. Term-formation patterns differ from language to language. For instance, term-formation patterns that are typical in English (e.g., ADJECTIVE+NOUN, NOUN+NOUN) are not the same as term-formation patterns that are common in French (e.g., NOUN+ADJECTIVE, NOUN+PREPOSITION+NOUN). Consequently, term-extraction tools that use a linguistic approach are generally designed to work in a single language (or closely related languages) and cannot easily be extended to work with other languages.
2. Statistical approach
The most straightforward statistical approach to term extraction is for a tool to look for repeated series of lexical items. The frequency thresh-old (the nunber of times that a series of items must be repeated) can often be specified by the user. For example, as illustrated in figure 3, if the minimum frequency threshold is set at two, a given series of lexical items must appear at least twice in the text in order to be recognized as a candidate term by the term-extraction tool.
Based on a minimum-frequency threshold of two, the text in figure 3 yielded two potential terms: "antivirus software" and "virus signature files." Unfortunately, this simple strategy often leads to problems because language is full of repetition, but not all repeated series of lexical items qualify as terms. For instance, consider the slightly modified
version of the text shown in figure 4.
Working solely on the basis of identifying repeated series of lexical items, the term-extraction software has identified two additional candidates: "developers are" and "as often as."
Figure 3 A short text that has been processed by a statistical term-extraction tool using a minimum frequency threshold of two.
Figure 4 A slightly modified version of the text that has been processed by a statistical term-extraction tool using a minimum frequency threshold of two.
These candidates constitute "noise" rather than terms, and they would need to be eliminated from the list of potential terms by a human. Stop lists can be used to reduce the number of unlikely terms that may otherwise be identified as candidates. For instance, a stop list could be implemented to instruct the term-extraction tool to ignore series that begin or end with function words, such as prepositions, articles, and conjunctions.
Another drawback to the statistical approach is that not all of the terms that appear in a given text will be repeated, which may lead to "silence." For instance, in figure 4.6, the term "push-technology updating" was not identified as a candidate because it only appeared once in the text and the minimum frequency threshold was set to two.
A related statistical approach to identifying candidate terms is to calculate mutual information (MI). The premise here is that if two lexical items appear together more often than they appear separately, the multi-word unit in question may be a potential term. Once again, however, this approach is not foolproof, and noise and silence may occur.
Nevertheless, the use of statistics as a basis for term extraction does have one clear strength: it is not language dependent. This means that a statistical term-extraction tool can, in principle, be used to process texts in multiple languages.
责任编辑:admin