<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  
  
  <channel>
    <title>chrisjrob: ocr</title>
    <link>https://chrisjrob.com</link>
    <atom:link href="https://chrisjrob.com/tag/ocr/feed/index.xml" rel="self" type="application/rss+xml" />
    <description>GNU Linux, Perl and FLOSS</description>
    <language>en-gb</language>
    <pubDate>Fri, 13 Feb 2026 17:22:31 +0000</pubDate>
    <lastBuildDate>Fri, 13 Feb 2026 17:22:31 +0000</lastBuildDate>
    
    <item>
      <title>How To Scan to OCR From The Command Line</title>
      <link>https://chrisjrob.com/2011/10/24/how-to-scan-to-ocr-from-the-command-line/</link>
      <pubDate>Mon, 24 Oct 2011 00:00:00 +0000</pubDate>
      <author>chrisjrob@gmail.com (Chris Roberts)</author>
      <guid>https://chrisjrob.com/2011/10/24/how-to-scan-to-ocr-from-the-command-line</guid>
      <description>
       <![CDATA[
         
         <p>I just had to remind myself how to scan to OCR, and thought I would
share the results.</p>

<p>Before you start, you need to have sane installed, and you also need
tesseract-ocr - both should be available in your distros repositories.</p>

<!--more-->

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo apt-get install sane-utils tesseract-ocr
</code></pre></div></div>

<p>Next you need to find out what scanners you have available, and you do
this with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ scanimage -L
device `v4l:/dev/video0' is a Noname Vimicro USB Camera (Altair) virtual device
device `plustek:libusb:004:002' is a Epson Perfection 1250/Photo flatbed scanner)
</code></pre></div></div>

<p>Obviously the latter is my scanner.</p>

<p>Assuming you have a working scanner, the following is a simple two liner
to scan and OCR.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ scanimage -d 'plustek:libusb:004:002' --mode Lineart \
--format tiff -x 215 -y 297 --resolution 200 &gt; example.tif
</code></pre></div></div>

<p>And finally convert to text with tesseract:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tesseract /tmp/example.tif example
</code></pre></div></div>

<p>You should now have a file example.txt in your current directory, which
you can open in any text editor.</p>

<p>Obviously this has limitations - it works for single-page A4 portrait
typed documents - but it gives you the basics.</p>

<p>You could probably experiment with the resolution, 200 worked for me,
so I didn’t bother trying anything else.  Traditionally the higher the 
resolution the better, but I seem to recall that tesseract works better
on 300 and below.</p>

<p>On my Epson Perfection 1250 I found that I needed to add the sane 
switch <code class="language-plaintext highlighter-rouge">--warmup-time 0</code> as otherwise it never finished warming up.</p>

<p>If you would prefer to OCR an existing PDF, which is another thing that
I find myself doing from time to time, then first convert it to a tif:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ convert -density 200 example.pdf -depth 8 /tmp/example.tif
</code></pre></div></div>

<p>And then run the above tesseract command.</p>


       ]]>
      </description>
    </item>
    
  </channel> 
</rss>
