There’s a few different OCR engines/utilities for Linux. I tested tesseract, gocr, cuneiform, ocrad, and gimagereader. I had crappy results with gocr, ocrad, and cuneiform. I like gimagereader the most as it’s basically just a gui frontend to tesseract which I also got good results from.
Cool features of gimagereader are scanner support, selecting different sections of text off the image, and the ability to edit the output before saving it to a txt file.
Here’s the steps I used:
- Scanned images using xsane v.999.
- xsane options: multipage, choose # of pages, true gray, 600 dpi (tried different values up to 1200 and couldn’t see a improvement after 600dpi), created new project.
- After all the PNM images were saved, I used a tool called ‘unpaper’. This tool prepares the images for the OCR and helps improve your results. Here’s the man page for unpaper.
- for i in image.*.pnm; do unpaper $i new-$i;done
- Then converting the PNM files to TIFF using ImageMagick convert tool. Here’s the man page for convert.
- for i in new-*.pnm; do convert $i final.images-$i.tif; done
Using TIFF images are important when your attempting to use OCR extracting software.
Then opened up gimagereader and loaded these TIFF files into it’s images section. From there you can select each file and custom select the text you want extracted. You can choose a wizard to auto select the sections for you or you can manually select. Once that is done you can click the “ABC” button that will open up the output pane where you can simply append the new data to the end of the file or edit the file on the fly. Once your satisfied you can save the txt file.