Main Page | See live article | Alphabetical index

Optical character recognition

Optical character recognition, usually abbreviated to OCR, involves computer systems designed to translate images of typewritten text (usually captured by a scanner) into machine-editable text--to translate pictures of characters into a standard encoding scheme representing them (ASCII or Unicode). OCR began as a field of research in artificial intelligence and machine vision; though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques.

Originally, the distinction between optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were considered separate fields. Since very few applications survive that use true optical techniques the OCR term has now been broadened to cover digital character recognition as well.

Early systems required "training" (essentially, the provision of known samples of each character) to read a specific font. Currently, though, "intelligent" systems that can recognize most fonts with a high degree of accuracy are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.

The United States Postal Service has been using OCR machines to sort mail since 1965. Mail sorting now plays a small role in OCR research; OCR systems need only read the postal code on each envelope. After the postal code has been read, a bar code with the same information can be printed on the envelope. To avoid interference with the human-readable address field which can be located anywhere on the letter, special ink is used that is clearly visible under UV light. This ink looks orange in normal lighting conditions. Envelopes marked with the machine readable bar code may then be processed; machine readable codes can be decoded more quickly than human readable letters and numbers.

While the accurate recognition of Latin-script typewritten text is now considered largely a solved problem, recognition of hand printing and handwriting in general, and printed versions of some other scripts--particularly those with a very large number of characters--are still the subject of active research.

Systems for recognizing hand-printed text on the fly have enjoyed commercial success in recent years. Among these are the input device for the Palm Pilot and other Personal Digital Assistants. The algorithms used in these devices take advantage of the fact that the order, speed and direction of the individual lines segments are input is known. Also, the human is retrained to use specific alphabetic shapes. These constraints do not apply to algorithms that scan paper documents, so accurate recognition of hand-printed documents is still largely an open problem. Accuracy rates of 80%-90% on neat, clean hand-printed characters can be fairly easily achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited contexts.

Recognition of cursive text is an active area of research, with recognition rates even lower than of hand-printed text. Higher rates of recognition of general cursive script will not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a check (where you know the information should be a written out number) is an exmple of a smaller dictionary where accuracy rates can be increased greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (>98%) recognize all handwritten cursive script.

A particularly difficult problem for computers and humans is that of old church baptismal and marriage records containing mostly names where the pages may be damaged by age, water or fire and the names may be obsolete or contain rare spellings. Computer image processing techniques can assist humans in reading extremely difficult texts such as the Archimedes Palimpsest or the Dead Sea Scrolls. Cooperative approaches where computers assist humans and vice-versa is an interesting area of research.

Character recognition has been an active area of computer science research since the late 1950s. It was initially perceived as an easy problem, but it turned out to be a much more interesting problem. It will be many decades, if ever, before computers will be able to read all documents with the same accuracy as human beings.

One area where accuracy and speed of computer input of character information exceeds that of humans is in the area of MICR, where the error rates range around one read error for every 20,000 to 30,000 checks.

See also

External Sites