Natural and Enhanced Corpora

发布时间： 2023-04-27 09:22:42 作者：etogether.net 来源：网络浏览次数：

A corpus-based approach to internationalization would entail analysis of the natural corpus to construct models of its specialized content (domain model) and range of document types (document structure model). These two models reflect the contents (topics, subject areas, or domains, as well as specialized linguistic objects such as terms or phrases) and the kinds of document types of greatest import and utility to the organization. Once constructed, these models can then be used to provide parameters to intelligent agents, such as Web spiders (automated Internet search programs that "crawl" the Web looking for documents), so that they may acquire and integrate new documents into the corpus in a specific, targeted manner from the Internet and/or other document repositories outside the original boundaries of the organization's corpus. The construction of the domain and document structure models is the mechanism for using the natural corpus as the seed for a larger precision corpus. New documents can be added to the original seed corpus if they meet certain criteria - for instance, if the distribution of diagnostic terminology in target documents meets certain thresholds. The new corpus thus constructed could be a significant enhancement over the original corpus, as it can be assumed to contain a more complete set of the prototypical instances of the specialized vocabulary, semantic relations, linguistic usages, phraseology, and document formats and document types that are of greatest import and utility to the organization. This enhanced corpus can be taken to more accurately reflect existing practices in the written communications of the linguistic community to which the organization belongs (see Figure 1).

The natural corpus prior to enhancement is typically not annotated; it is a raw corpus. As SGML (Standard Generalized Markup Language) and its variant, XML (Extensible Markup Language), become more commonly used in business, natural corpora will contain preexisting annotation. Even if previous annotation exists, it is most likely the case that application-specific annotation will have to be added to the corpus to make it useful for computer-assisted translation and an effective tool for internationalization. This implies tagging the results of localization/translation-specific corpus analysis using metadata expressed in a markup language such as XML. Many translation scholars have recognized the utility of markup languages in translation corpora (Luz and Baker 2000).

Figure 1.png

Figure 1. Domain modeling using intelligent agents.

Markup allows for the description and later retrieval of linguistic, semantic, and textual objects of relevance to translation and localization that are discovered by corpus analysis. Non-linguistic information related to the parameters of the localization or translation task can also be stored. Markup objects are not limited to terminology and aligned translation units. Collocation and phrase collections, term contexts, thesaurus or concept relationships, style and usage patterns, recurrent text segments, or textual superstructures diagnostic of particular textual forms could also be discovered and annotated. Metadata schemes for annotating these elements could be developed or adapted from existing schema. Yves Savourel (2000: 67) and others have also argued for the inclusion of localization information in documents, using a kind of Localization Markup Language. Markup and metadata schema are already in widespread use for translation memories and terminology management (TMX, XLT).

责任编辑：admin