Another important consideration when building a corpus is whether or not a text can legally or ethically be included in a corpus. Like printed texts, electronic texts are subject to copyright laws, and if a user wishes to hold a text in a corpus, it is first necessary to establish the precise details of the text's ownership and obtain the owner's permission. Because the Internet is such a new phenomenon, copyright laws in many countries predate this technology, which means that they are sometimes unclear about ownership of electronic texts. Nevertheless, many countries are in the process of updating their laws to address this issue, and translators would be wise to investigate the ownership of any text and obtain permission before including it in a corpus. If the corpus is strictly for personal use, it may be acceptable to include a text (or a portion of a text) in it without obtaining permission, in the same way that it is legal to produce photocopies of documents (or parts of documents) for personal use; however, if the corpus is going to be used for commercial purposes, it is absolutely essential to obtain copyright permission.
4. Pre-processing
Different software applications work with different file formats, so the files in the corpus must be converted to the format that is used by the corpus-analysis tool in question. Many corpus-analysis tools process plain text (ASCII) files, which do not require much effort in the way of pre-processing, although unwanted line or paragraph breaks may need to be deleted. Furthermore, the ASCII character set is limited and therefore some accented letters or foreign characters may not be represented. Other tools may require the corpus to be converted into a special format. In addition, we have already discussed other types of pre-processing that may be required, for example, converting printed text into electronic form using OCR or voice-recognition technology, annotating or marking up a corpus, and aligning text in the case of bilingual parallel corpora. The greater the amount of pre-processing that needs to be done, the more time the translator will have to devote to carrying out this processing and verifying that it has been done correctly (proofreading and editing).
5. Speed and information-retrieval issues
Once a corpus has been compiled, one advantage is that translators can typically work more quickly with electronic media than with printed media. This means that they can consult a greater number of documents and that the consultation process can be much faster, as corpus-analysis tools enable users to focus their research by allowing them to access relevant document sections directly, rather than requiring them to read the documents in a linear fashion from beginning to end.
Nevertheless, translators must develop sensible search strategies for consulting corpora. Corpus-analysis tools are not intelligent – they work using pattern matching techniques. Therefore, they will retrieve exactly what users ask them to retrieve, even if this is not necessarily what the users want to retrieve. Common problems include "silence" and noise." In the case of silence, a pattern that is of interest to the user is not retrieved because the search string is not comprehensive enough. For example, a translator may be interested in examining all contexts that contain any form of the verb "to go." The wildcard search pattern "go*" will retrieve most forms of the verb "to go," including "go," "goes," "going," and "gone," but it will not retrieve the simple past form "went." Likewise, a search pattern that is too broad will retrieve noise (patterns that are not of interest). For example, if a translator wants to examine all contexts that contain a form of the verb "to enter," he or she might try using the wildcard search pattern "enter*." This will retrieve forms such as "enter," "entered," "entering," and "enters," but it will also retrieve all forms of "entertain," "enterprise," and even "enterogastritis" if these terms happen to appear anywhere in the corpus. In order to reduce both noise and silence, translators must think carefully about the search strategies they use. It may even be necessary to develop different search techniques for working with different languages or subject fields.
Other potential retrieval problems include homographs and homonyms. Homographs are words that have the same spelling but have different parts of speech; for example, "cooks" can be a noun ("Too many cooks spoil the broth") or a verb ("He cooks dinner for his mother every Wednesday"). In contrast, homonyms are words that have the same spelling and the same part of speech, but have different meanings; for instance, "ball" can be a noun that refers to a round object used in sports (e.g., "golf ball," "tennis ball") or a noun that refers to a large formal gathering for social dancing (e.g., "masquerade ball," "debutante ball"). If a user is working with an unannotated corpus, there will be no way for the computer to distinguish between homographs and homonyms, and so the data may be slightly distorted. If it is important for users to be able to automatically distinguish between words having different senses or parts of speech, it will be necessary to annotate the corpus accordingly.