会员中心 |  会员注册  |  兼职信息发布    浏览手机版!    超值满减    人工翻译    英语IT服务 贫困儿童资助 | 留言板 | 设为首页 | 加入收藏  繁體中文
当前位置:首页 > 行业文章 > 笔译技术 > 正文

Natural and Enhanced Corpora

发布时间: 2023-04-27 09:22:42   作者:etogether.net   来源: 网络   浏览次数:
摘要: Using corpora as part of an internationalization strategy implies that they can be manipulated or engineered in some w...


Corpus-based approaches will enable new approaches to internationalization and thus significantly improve the speed, efficiency and accuracy of computer-assisted translation and localization. Using corpora as part of an internationalization strategy implies that they can be manipulated or engineered in some way to be more effective tools for translation and localization. Just as internationalization in software engineering calls for a reengineering of the software kernel of the software applications, using corpora in internationalization implies developing or compiling special-purpose corpora whose contents and linguistic and textual characteristics are compiled, analyzed, and annotated (marked or tagged in some way) in order to make later translation and subsequent authoring faster, more accurate, and more efficient.


Varantola (1997) has referred to specialized corpora created and targeted for a given translation task as precision corpora. While she has referred, in the main, to smaller corpora compiled by individual translators, her basic idea could be extrapolated to include the engineering of large-scale corpora on an organization or industry-wide basis specifically to improve translation and localization in specific domains. Both Ahmad et al. and Varantola have referred to the ephemeral quality of the special corpora constructed to assist in translation activity. Ahmad et al. (1994) write of virtual corpora, ephemeral constructs created to help the translator complete his or her translation task. Varantola (2000) has also used the interesting phrase disposable corpora in the same context. Given the current practice in the language industry of retaining and aggregating all translation resources produced for a company by its translators, it would seem illogical to discard the translation information gathered as a result of translation research or to ignore the potentially relevant information that could be gathered and stored if bilingual corpora could be discovered (or constructed) and exploited. This argues for Ahmad and Varantola's ephemeral corpora to be made permanent and integrated into the translation resources and translation technology of an organization or localization vendor.


Varantola's conception raises a question. How would one begin to compile a precision corpus large enough, organized enough, and comprehensive enough to be useful in computer-assisted translation and viable as an internationalization strategy? One answer would be to begin by analyzing an organization's naturally occurring collection of documents, what we might call a natural corpus and then use it as a seed corpus to construct a large-scale precision corpus. The natural corpus is not representative of the entire language or textual system, but is, as Noam Chomsky noted so long ago, linguistically skewed (1957: 159). For the purposes at hand, the skewed nature of the corpus is desirable. We are not interested in general language or in discovering the formal characteristics of the language system as a whole, but in those linguistic and textual features that are domain- or language community-bound – special language and text. The natural corpus can be assumed to contain exemplars of the specialized linguistic and textual preferences of a specific and well-defined language community. In some sense, it is a circumscribed text world, a finite repertoire of textual interaction structures used in a particular communicative community (Neubert and Shreve 1992: 41). For the sake of conceptual completeness, we can define a natural corpus as the entire set of documents produced and stored in an organization. An intranet-bounded natural corpus is that subset of the natural corpus of an organization which is in machine-readable format and discoverable by computational means.



微信公众号

[1] [2] [下一页] 【欢迎大家踊跃评论】
我来说两句
评论列表
已有 0 条评论(查看更多评论)