
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://www.thelinuxwiki.com/skins/common/feed.css?303"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
		<id>http://www.thelinuxwiki.com/index.php?action=history&amp;feed=atom&amp;title=Optical_Character_Recognition_-_OCR</id>
		<title>Optical Character Recognition - OCR - Revision history</title>
		<link rel="self" type="application/atom+xml" href="http://www.thelinuxwiki.com/index.php?action=history&amp;feed=atom&amp;title=Optical_Character_Recognition_-_OCR"/>
		<link rel="alternate" type="text/html" href="http://www.thelinuxwiki.com/index.php?title=Optical_Character_Recognition_-_OCR&amp;action=history"/>
		<updated>2026-04-29T02:46:04Z</updated>
		<subtitle>Revision history for this page on the wiki</subtitle>
		<generator>MediaWiki 1.21.5</generator>

	<entry>
		<id>http://www.thelinuxwiki.com/index.php?title=Optical_Character_Recognition_-_OCR&amp;diff=874&amp;oldid=prev</id>
		<title>Nighthawk: /* problem */</title>
		<link rel="alternate" type="text/html" href="http://www.thelinuxwiki.com/index.php?title=Optical_Character_Recognition_-_OCR&amp;diff=874&amp;oldid=prev"/>
				<updated>2014-09-19T06:17:59Z</updated>
		
		<summary type="html">&lt;p&gt;‎&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;problem&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table class='diff diff-contentalign-left'&gt;
				&lt;col class='diff-marker' /&gt;
				&lt;col class='diff-content' /&gt;
				&lt;col class='diff-marker' /&gt;
				&lt;col class='diff-content' /&gt;
			&lt;tr style='vertical-align: top;'&gt;
			&lt;td colspan='2' style=&quot;background-color: white; color:black; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
			&lt;td colspan='2' style=&quot;background-color: white; color:black; text-align: center;&quot;&gt;Revision as of 06:17, 19 September 2014&lt;/td&gt;
			&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 15:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 15:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== problem ==&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== problem ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;I had scanned images of books that I wanted in electronic, text format.&amp;#160; They were manuals for a computer course I was taking.&amp;#160; The &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;pwd&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;I had scanned images of books that I wanted in electronic, text format.&amp;#160; They were manuals for a computer course I was taking.&amp;#160; The &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;company offering the course would not provide a pdf copy of the book.&amp;#160; So, I found one someone else had scanned into a pdf.&amp;#160; The pdf was entirely images, no text.&amp;#160; They probably just used the format because pdf supports multipage documents.&amp;#160; I needed the electronic copy.&amp;#160; When taking the course, I wanted to be able to copy and paste commands into my terminal.&amp;#160; Also, I wanted to copy and paste into my wiki notes.&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;#160;&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;#160;&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== solution ==&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== solution ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Nighthawk</name></author>	</entry>

	<entry>
		<id>http://www.thelinuxwiki.com/index.php?title=Optical_Character_Recognition_-_OCR&amp;diff=873&amp;oldid=prev</id>
		<title>Nighthawk at 06:14, 19 September 2014</title>
		<link rel="alternate" type="text/html" href="http://www.thelinuxwiki.com/index.php?title=Optical_Character_Recognition_-_OCR&amp;diff=873&amp;oldid=prev"/>
				<updated>2014-09-19T06:14:57Z</updated>
		
		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class='diff diff-contentalign-left'&gt;
				&lt;col class='diff-marker' /&gt;
				&lt;col class='diff-content' /&gt;
				&lt;col class='diff-marker' /&gt;
				&lt;col class='diff-content' /&gt;
			&lt;tr style='vertical-align: top;'&gt;
			&lt;td colspan='2' style=&quot;background-color: white; color:black; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
			&lt;td colspan='2' style=&quot;background-color: white; color:black; text-align: center;&quot;&gt;Revision as of 06:14, 19 September 2014&lt;/td&gt;
			&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 12:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 12:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;destination format: text file&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;destination format: text file&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt;&amp;#160;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt;&amp;#160;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== problem ==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt;&amp;#160;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt;&amp;#160;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;I had scanned images of books that I wanted in electronic, text format.&amp;#160; They were manuals for a computer course I was taking.&amp;#160; The pwd&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt;&amp;#160;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;&amp;#160;&lt;/td&gt;&lt;td style=&quot;background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Nighthawk</name></author>	</entry>

	<entry>
		<id>http://www.thelinuxwiki.com/index.php?title=Optical_Character_Recognition_-_OCR&amp;diff=872&amp;oldid=prev</id>
		<title>Nighthawk: Created page with &quot;=Optical Character Recognition=   == software used ==   imagemagick (convert) version 6.8.8.10-r1  tesseract version 3.03_rc1  source file: PDF made up entirely of scanned ima...&quot;</title>
		<link rel="alternate" type="text/html" href="http://www.thelinuxwiki.com/index.php?title=Optical_Character_Recognition_-_OCR&amp;diff=872&amp;oldid=prev"/>
				<updated>2014-09-19T06:06:31Z</updated>
		
		<summary type="html">&lt;p&gt;Created page with &amp;quot;=Optical Character Recognition=   == software used ==   imagemagick (convert) version 6.8.8.10-r1  tesseract version 3.03_rc1  source file: PDF made up entirely of scanned ima...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;=Optical Character Recognition=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== software used ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
imagemagick (convert) version 6.8.8.10-r1&lt;br /&gt;
&lt;br /&gt;
tesseract version 3.03_rc1&lt;br /&gt;
&lt;br /&gt;
source file: PDF made up entirely of scanned images of a book.&lt;br /&gt;
&lt;br /&gt;
destination format: text file&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== solution ==&lt;br /&gt;
&lt;br /&gt;
=== step 1 - break up pdf into individual files using convert ===&lt;br /&gt;
&lt;br /&gt;
The convert utility gets installed along with the imagemagick package.  &lt;br /&gt;
&lt;br /&gt;
 $ convert -limit memory 1 -monitor -verbose -density 300 -colorspace Gray source_file.pdf output_file.png&lt;br /&gt;
&lt;br /&gt;
The command above will not output just one file, but 1 file per page of the source pdf.&lt;br /&gt;
&lt;br /&gt;
example...&lt;br /&gt;
&lt;br /&gt;
output_file-1.png, output_file-2.png,output_file-3.png, etc...&lt;br /&gt;
&lt;br /&gt;
=== step 2 - translate image text into plain text using OCR ===&lt;br /&gt;
&lt;br /&gt;
tesseract will perform the OCR.  We feed it the individual files and append the text to an output file.  A bash loop will process all the output files for us.&lt;br /&gt;
&lt;br /&gt;
 $ ls -tr1 output_file*.png | while read line; do tesseract $line stdout &amp;gt;&amp;gt; output.txt; done&lt;br /&gt;
&lt;br /&gt;
=== step 3 - cleanup ===&lt;br /&gt;
&lt;br /&gt;
Invariably, we will have mistranslations by the OCR. For example, the document I was translating had &amp;quot;open quote&amp;quot; and &amp;quot;close quote&amp;quot; characters in it.  However; ascii only has the double quote character.  When viewing the text file using less, the hex codes for the non-ascii characters should be displayed.  I would examine the file and identify patterns.  If one had a reliable traslation, I would perform it using the line editor sed.&lt;/div&gt;</summary>
		<author><name>Nighthawk</name></author>	</entry>

	</feed>