Optical Character Recognition - OCR
Contents |
Optical Character Recognition
software used
imagemagick (convert) version 6.8.8.10-r1
tesseract version 3.03_rc1
source file: PDF made up entirely of scanned images of a book.
destination format: text file
problem
I had scanned images of books that I wanted in electronic, text format. They were manuals for a computer course I was taking. The pwd
solution
step 1 - break up pdf into individual files using convert
The convert utility gets installed along with the imagemagick package.
$ convert -limit memory 1 -monitor -verbose -density 300 -colorspace Gray source_file.pdf output_file.png
The command above will not output just one file, but 1 file per page of the source pdf.
example...
output_file-1.png, output_file-2.png,output_file-3.png, etc...
step 2 - translate image text into plain text using OCR
tesseract will perform the OCR. We feed it the individual files and append the text to an output file. A bash loop will process all the output files for us.
$ ls -tr1 output_file*.png | while read line; do tesseract $line stdout >> output.txt; done
step 3 - cleanup
Invariably, we will have mistranslations by the OCR. For example, the document I was translating had "open quote" and "close quote" characters in it. However; ascii only has the double quote character. When viewing the text file using less, the hex codes for the non-ascii characters should be displayed. I would examine the file and identify patterns. If one had a reliable traslation, I would perform it using the line editor sed.