行业文章

搜索导航

Introduction of Term Extraction

2023-07-18 09:25:34 etogether.net 网络次

A final drawback to the linguistic approach is that it is heavily language dependent. Term-formation patterns differ from language to language. For instance, term-formation patterns that are typical in English (e.g., ADJECTIVE+NOUN, NOUN+NOUN) are not the same as term-formation patterns that are common in French (e.g., NOUN+ADJECTIVE, NOUN+PREPOSITION+NOUN). Consequently, term-extraction tools that use a linguistic approach are generally designed to work in a single language (or closely related languages) and cannot easily be extended to work with other languages.

2. Statistical approach

The most straightforward statistical approach to term extraction is for a tool to look for repeated series of lexical items. The frequency thresh-old (the nunber of times that a series of items must be repeated) can often be specified by the user. For example, as illustrated in figure 3, if the minimum frequency threshold is set at two, a given series of lexical items must appear at least twice in the text in order to be recognized as a candidate term by the term-extraction tool.

Based on a minimum-frequency threshold of two, the text in figure 3 yielded two potential terms: "antivirus software" and "virus signature files." Unfortunately, this simple strategy often leads to problems because language is full of repetition, but not all repeated series of lexical items qualify as terms. For instance, consider the slightly modified

version of the text shown in figure 4.

Working solely on the basis of identifying repeated series of lexical items, the term-extraction software has identified two additional candidates: "developers are" and "as often as."

Figure 3.png

Figure 3 A short text that has been processed by a statistical term-extraction tool using a minimum frequency threshold of two.

Figure 4.png

Figure 4 A slightly modified version of the text that has been processed by a statistical term-extraction tool using a minimum frequency threshold of two.

These candidates constitute "noise" rather than terms, and they would need to be eliminated from the list of potential terms by a human. Stop lists can be used to reduce the number of unlikely terms that may otherwise be identified as candidates. For instance, a stop list could be implemented to instruct the term-extraction tool to ignore series that begin or end with function words, such as prepositions, articles, and conjunctions.

Another drawback to the statistical approach is that not all of the terms that appear in a given text will be repeated, which may lead to "silence." For instance, in figure 4.6, the term "push-technology updating" was not identified as a candidate because it only appeared once in the text and the minimum frequency threshold was set to two.

A related statistical approach to identifying candidate terms is to calculate mutual information (MI). The premise here is that if two lexical items appear together more often than they appear separately, the multi-word unit in question may be a potential term. Once again, however, this approach is not foolproof, and noise and silence may occur.

Nevertheless, the use of statistics as a basis for term extraction does have one clear strength: it is not language dependent. This means that a statistical term-extraction tool can, in principle, be used to process texts in multiple languages.

责任编辑：admin

[上一页][1] [2] 【欢迎大家踊跃评论】

上一篇：Benefits and Drawbacks of Working with a TMS
下一篇：Benefits and Drawbacks of Working with Corpus-Analysis Tools

微信公众号搜索“译员”关注我们，每天为您推送翻译理论和技巧，外语学习及翻译招聘信息。

行业文章

相关行业文章