会员中心 |  会员注册  |  兼职信息发布    浏览手机版!    超值满减    人工翻译    英语IT服务 贫困儿童资助 | 留言板 | 设为首页 | 加入收藏  繁體中文
当前位置:首页 > 行业文章 > 笔译技术 > 正文

Introduction of Term Extraction

发布时间: 2023-07-18 09:25:34   作者:etogether.net   来源: 网络   浏览次数:



A final drawback to the linguistic approach is that it is heavily language dependent. Term-formation patterns differ from language to language. For instance, term-formation patterns that are typical in English (e.g., ADJECTIVE+NOUN, NOUN+NOUN) are not the same as term-formation patterns that are common in French (e.g., NOUN+ADJECTIVE, NOUN+PREPOSITION+NOUN). Consequently, term-extraction tools that use a linguistic approach are generally designed to work in a single language (or closely related languages) and cannot easily be extended to work with other languages.


2. Statistical approach

The most straightforward statistical approach to term extraction is for a tool to look for repeated series of lexical items. The frequency thresh-old (the nunber of times that a series of items must be repeated) can often be specified by the user. For example, as illustrated in figure 3, if the minimum frequency threshold is set at two, a given series of lexical items must appear at least twice in the text in order to be recognized as a candidate term by the term-extraction tool.

Based on a minimum-frequency threshold of two, the text in figure 3 yielded two potential terms: "antivirus software" and "virus signature files." Unfortunately, this simple strategy often leads to problems because language is full of repetition, but not all repeated series of lexical items qualify as terms. For instance, consider the slightly modified

version of the text shown in figure 4.


Working solely on the basis of identifying repeated series of lexical items, the term-extraction software has identified two additional candidates: "developers are" and "as often as."


Figure 3.png

Figure 3 A short text that has been processed by a statistical term-extraction tool using a minimum frequency threshold of two.



Figure 4.png

Figure 4 A slightly modified version of the text that has been processed by a statistical term-extraction tool using a minimum frequency threshold of two.



These candidates constitute "noise" rather than terms, and they would need to be eliminated from the list of potential terms by a human. Stop lists can be used to reduce the number of unlikely terms that may otherwise be identified as candidates. For instance, a stop list could be implemented to instruct the term-extraction tool to ignore series that begin or end with function words, such as prepositions, articles, and conjunctions.


Another drawback to the statistical approach is that not all of the terms that appear in a given text will be repeated, which may lead to "silence." For instance, in figure 4.6, the term "push-technology updating" was not identified as a candidate because it only appeared once in the text and the minimum frequency threshold was set to two.


A related statistical approach to identifying candidate terms is to calculate mutual information (MI). The premise here is that if two lexical items appear together more often than they appear separately, the multi-word unit in question may be a potential term. Once again, however, this approach is not foolproof, and noise and silence may occur.


Nevertheless, the use of statistics as a basis for term extraction does have one clear strength: it is not language dependent. This means that a statistical term-extraction tool can, in principle, be used to process texts in multiple languages.



责任编辑:admin



微信公众号

[上一页][1] [2] 【欢迎大家踊跃评论】
我来说两句
评论列表
已有 0 条评论(查看更多评论)