Difference between revisions of "Optical Character Recognition - OCR"
(→problem) |
|||
Line 15: | Line 15: | ||
== problem == | == problem == | ||
− | I had scanned images of books that I wanted in electronic, text format. They were manuals for a computer course I was taking. The | + | I had scanned images of books that I wanted in electronic, text format. They were manuals for a computer course I was taking. The company offering the course would not provide a pdf copy of the book. So, I found one someone else had scanned into a pdf. The pdf was entirely images, no text. They probably just used the format because pdf supports multipage documents. I needed the electronic copy. When taking the course, I wanted to be able to copy and paste commands into my terminal. Also, I wanted to copy and paste into my wiki notes. |
− | + | ||
− | + | ||
== solution == | == solution == |
Latest revision as of 06:17, 19 September 2014
Contents |
Optical Character Recognition
software used
imagemagick (convert) version 6.8.8.10-r1
tesseract version 3.03_rc1
source file: PDF made up entirely of scanned images of a book.
destination format: text file
problem
I had scanned images of books that I wanted in electronic, text format. They were manuals for a computer course I was taking. The company offering the course would not provide a pdf copy of the book. So, I found one someone else had scanned into a pdf. The pdf was entirely images, no text. They probably just used the format because pdf supports multipage documents. I needed the electronic copy. When taking the course, I wanted to be able to copy and paste commands into my terminal. Also, I wanted to copy and paste into my wiki notes.
solution
step 1 - break up pdf into individual files using convert
The convert utility gets installed along with the imagemagick package.
$ convert -limit memory 1 -monitor -verbose -density 300 -colorspace Gray source_file.pdf output_file.png
The command above will not output just one file, but 1 file per page of the source pdf.
example...
output_file-1.png, output_file-2.png,output_file-3.png, etc...
step 2 - translate image text into plain text using OCR
tesseract will perform the OCR. We feed it the individual files and append the text to an output file. A bash loop will process all the output files for us.
$ ls -tr1 output_file*.png | while read line; do tesseract $line stdout >> output.txt; done
step 3 - cleanup
Invariably, we will have mistranslations by the OCR. For example, the document I was translating had "open quote" and "close quote" characters in it. However; ascii only has the double quote character. When viewing the text file using less, the hex codes for the non-ascii characters should be displayed. I would examine the file and identify patterns. If one had a reliable traslation, I would perform it using the line editor sed.