Wanted to translate two-digit years into 4 digit years for the 1900s only.
Two digits were getting written as 35-37 or 40-50 in the spreadsheet. This needed to be translated to 1935 1936 1937 or 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950.
for i in seq 35 37; do echo "$i + 1900" | bc; done
for i in seq 40 50; do echo "$i +1900" | bc ;done
By creating a file named ‘years’ with two-digit years on each line.
$ while read line; do content=$(seq $line); for j in $(echo $content);do echo "$j + 1900"| bc;done;echo ""; done < years
Wanted to have Excel do this automatically. But for now, I’m just using this method and copy/pasting results into the spreadsheet. If anyone has any suggestions, please let me know.
There’s a few different OCR engines/utilities for Linux. I tested tesseract, gocr, cuneiform, ocrad, and gimagereader. I had crappy results with gocr, ocrad, and cuneiform. I like gimagereader the most as it’s basically just a gui frontend to tesseract which I also got good results from.
Cool features of gimagereader are scanner support, selecting different sections of text off the image, and the ability to edit the output before saving it to a txt file.
Here’s the steps I used:
- Scanned images using xsane v.999.
- xsane options: multipage, choose # of pages, true gray, 600 dpi (tried different values up to 1200 and couldn’t see a improvement after 600dpi), created new project.
- After all the PNM images were saved, I used a tool called ‘unpaper’. This tool prepares the images for the OCR and helps improve your results. Here’s the man page for unpaper.
- for i in image.*.pnm; do unpaper $i new-$i;done
- Then converting the PNM files to TIFF using ImageMagick convert tool. Here’s the man page for convert.
- for i in new-*.pnm; do convert $i final.images-$i.tif; done
Using TIFF images are important when your attempting to use OCR extracting software.
Then opened up gimagereader and loaded these TIFF files into it’s images section. From there you can select each file and custom select the text you want extracted. You can choose a wizard to auto select the sections for you or you can manually select. Once that is done you can click the “ABC” button that will open up the output pane where you can simply append the new data to the end of the file or edit the file on the fly. Once your satisfied you can save the txt file.
Navigate to chrome://net-internals/#dns and press the “Clear host cache” button.