会员中心 |  会员注册  |  兼职信息发布    浏览手机版!    超值满减    人工翻译    英语IT服务 贫困儿童资助 | 留言板 | 设为首页 | 加入收藏  繁體中文
当前位置:首页 > 行业文章 > 笔译技术 > 正文

Benefits and Drawbacks of Working with Corpus-Analysis Tools

发布时间: 2023-07-15 10:00:33   作者:etogether.net   来源: 网络   浏览次数:



Finally, multi-word units may be difficult to identify when using a feature such as a word-frequency list. Word-frequency lists generally treat white space as a boundary between words, but some coherent concepts can be expressed only using a multi-word unit (e.g., "operating system," "boot sector virus"). It will not be possible to identify or determine the frequency of multi-word units using word frequency lists, although other features such as concordancers or collocation generators may be useful for helping to identify and count these units. Alternatively, another type of software, known as term-extraction software, may prove useful for identifying them.


6. Character sets and language-related difficulties

Some technical difficulties may arise for translators working with certain languages. Not all corpus-analysis tools come equipped with the character sets for all languages. While many Indo-European languages can be processed without difficulty, some tools may not be able to handle languages that are not based on the Roman alphabet, such as Arabic, Greek, Hebrew, and Russian.


Asian languages, such as Chinese, Japanese, and Korean, present further difficulties. Whereas the characters of many languages can be stored using one byte (one unit of storage), Asian languages with complex characters require two bytes to store a single character. Such a language is therefore said to have a double-byte character set (DBCS).

Unfortunately, many computer applications, including many corpus-analysis tools, have been written in such a way that they can process only single-byte characters. Therefore, translators who work with double-byte languages may not be able to use certain corpus-analysis tools. Fortunately, a double-byte method for encoding all characters, known as Unicode, is now emerging as an industry standard (Unicode Consortium, 2000), and it is hoped that applications developers will incorporate double-byte encoding into all future products and releases.


Another problem that may arise for some languages is alignment. In order to create a bilingual parallel corpus, the alignment tool must be able to divide the source text into segments (e.g., sentences, paragraphs). This means that the system must be able to recognize which elements indicate the end of a segment (e.g., punctuation). When working with languages that do not use Indo-European-style punctuation, alignment tools may have difficulty determining where one segment ends and the next begins, which means it is therefore difficult to align the corresponding segments of the source and target texts.


7. Economic aspects

For the most part, corpus-analysis tools are very reasonably priced, many costing less than a few hundred dollars, which is within the budget of many translators. In addition, they do not typically have excessive hardware requirements, though as corpora grow in size, translators will require sufficient disk space to store them on. 


责任编辑:admin


微信公众号

[上一页][1] [2] [3] 【欢迎大家踊跃评论】
  • 上一篇:Introduction of Term Extraction
  • 下一篇:Indirect Speech Acts


  • 《译聚网》倡导尊重与保护知识产权。如发现本站文章存在版权问题,烦请30天内提供版权疑问、身份证明、版权证明、联系方式等发邮件至info@qiqee.net,我们将及时沟通与处理。


我来说两句
评论列表
已有 0 条评论(查看更多评论)