- 签证留学 |
- 笔译 |
- 口译
- 求职 |
- 日/韩语 |
- 德语
OCR software takes the scanned image and, through a process of pattern matching, converts the stored image of the text into a form that is truly machine-readable and can be processed by other types of software (e.g., word processors, concordancers, translation memories).
At its most basic, OCR software examines each character in the scanned image and compares it to a series of character patterns stored in a database. When the software perceives that it has made a match, it stores the matched character in a new file and moves on to the next character. Once all the characters in the scanned image have been processed in this way, the new file can be saved in an appropriate format (e.g., as a text file) and opened in an application such as a word processor, where it can be edited or manipulated as desired.
1. Factors affecting the accuracy of OCR
Not surprisingly, there is scope for error during the character-recognition process. Sometimes the OCR software makes an incorrect match. For instance, the letter "e" might be mistaken for a "c," the number "5" might be mistaken for the letter "S," or the letters "c" and "l" might be mistakenly combined to form a "d." A number of factors can affect the
accuracy of OCR, one of the most important being the quality of the hard copy.
Figure 1 Sample texts of differing quality.
On the one hand, if the document being scanned is faded (e.g., a poor-quality photocopy or fax), the intensity of the light that is reflected from the faded characters may be too similar to the level of intensity that is reflected from the background. In such circumstances, the OCR software does not pick out all the characters. On the other hand, if the document is blurred, smudged, stained, or creased, parts of it that should actually comprise the background may be incorrectly recognized as characters. Other factors include the size and style of the font on the page (e.g., small characters and characters in fancy scripts are more difficult to process, as are texts that contain a mixture of fonts), the layout of the text (e.g., columns and tables can be difficult to process), and the character set used (e.g., it may be necessary to purchase different OCR packages to process different languages or to handle texts containing mathematical formulas). For the best results, texts should be clean original laser printouts with limited formatting.
A selection of texts of differing qualities is shown in figure 1.
Sophisticated OCR packages use more complex processing techniques than those described here. One replacement for the classic approach described here, which focuses on isolated characters, is a technique that makes use of context. For example, if an OCR system looked at the surrounding characters before making a decision, it would be easier to tell that the first character in the string "Sir" should be interpreted as the letter "S" and not as the number "5." Another technique is to integrate a dictionary checking stage. For example, if a pattern is initially interpreted as "hcuse," a dictionary check would reveal that this is not a legitimate combination and would suggest viable alternatives, such as "house." This is similar to the way many spellchecker programs work in word-processing packages. It does not, however, solve all misinterpretation problems. For example, if an OCR program mistakenly identifies the "e" in "read" as an "o," the resulting word will be "road," which will not be identified as an error during a dictionary check because "road" is a legitimate word. In the future, OCR software may need to take even larger contexts into account –perhaps parsing entire sentences in order to determine whether a given word makes sense in this larger context.
责任编辑:admin