Benefits and Drawbacks of Working with Corpus-Analysis Tools

发布时间： 2023-07-15 10:00:33 作者：etogether.net 来源：网络浏览次数：

Finally, multi-word units may be difficult to identify when using a feature such as a word-frequency list. Word-frequency lists generally treat white space as a boundary between words, but some coherent concepts can be expressed only using a multi-word unit (e.g., "operating system," "boot sector virus"). It will not be possible to identify or determine the frequency of multi-word units using word frequency lists, although other features such as concordancers or collocation generators may be useful for helping to identify and count these units. Alternatively, another type of software, known as term-extraction software, may prove useful for identifying them.

6. Character sets and language-related difficulties

Some technical difficulties may arise for translators working with certain languages. Not all corpus-analysis tools come equipped with the character sets for all languages. While many Indo-European languages can be processed without difficulty, some tools may not be able to handle languages that are not based on the Roman alphabet, such as Arabic, Greek, Hebrew, and Russian.

Asian languages, such as Chinese, Japanese, and Korean, present further difficulties. Whereas the characters of many languages can be stored using one byte (one unit of storage), Asian languages with complex characters require two bytes to store a single character. Such a language is therefore said to have a double-byte character set (DBCS).

Unfortunately, many computer applications, including many corpus-analysis tools, have been written in such a way that they can process only single-byte characters. Therefore, translators who work with double-byte languages may not be able to use certain corpus-analysis tools. Fortunately, a double-byte method for encoding all characters, known as Unicode, is now emerging as an industry standard (Unicode Consortium, 2000), and it is hoped that applications developers will incorporate double-byte encoding into all future products and releases.

Another problem that may arise for some languages is alignment. In order to create a bilingual parallel corpus, the alignment tool must be able to divide the source text into segments (e.g., sentences, paragraphs). This means that the system must be able to recognize which elements indicate the end of a segment (e.g., punctuation). When working with languages that do not use Indo-European-style punctuation, alignment tools may have difficulty determining where one segment ends and the next begins, which means it is therefore difficult to align the corresponding segments of the source and target texts.

7. Economic aspects

For the most part, corpus-analysis tools are very reasonably priced, many costing less than a few hundred dollars, which is within the budget of many translators. In addition, they do not typically have excessive hardware requirements, though as corpora grow in size, translators will require sufficient disk space to store them on.

责任编辑：admin