Google
 

View Full Version : Optical Scanning, Text Conversion ...


Farnsworth,Luther P.
02-09-2008, 07:54 AM
... software and/or equipment. Is there anything decent and affordable out there yet? Sites like Cornell's HEARTH and associated sites are increasing the numbers of books scanned and posted online whose copyrights have expired, and other sites are doing the same with free stuff. I have some cheesy software that supposedly does this that came with my scanner, but of course it's crap, really. It's also some 7 years old, and I was wondering if things have gotten better and cheaper yet in that technology?

Evil Elmo
02-09-2008, 12:14 PM
What you want is an OCR program. the most widely used is OmniPage. it is for OCR what microsoft Word is for word processing. OmniPage pretty much dominates the market.

Manu
02-09-2008, 07:10 PM
OMnipage is decent.

Also, Acrobat Standard does OCR

Farnsworth,Luther P.
02-10-2008, 10:25 AM
Have you guys used those packages?

I was concerned about getting great software, but that it wouldn't be much use if the scanner hardware isn't very good. It gives me good image copies when I print out, but my converter software comes out barely legible at best. My scanner seems to work fine resolution wise when directly copying, so, I'm at a loss on what the software interface is actually doing here.

tinhorn
02-11-2008, 02:11 AM
I've used a few different OCR programs that came with scanners, on dozens of old books, with pretty consistent and pretty good results. Even the best of programs are reputed to generate a few errors, so any text will have to be proofread.

I'm not sure I understand your dilemma, but is there a possibility that in setting up your OCR, you selected a greyscale printer setting instead of b/w?

I just discovered that books.google.com includes many old volumes from libraries. I've heard of different groups who were scanning public domain texts, but I haven't figured out how to find their stashes. Got any links for us (or keywords)?

Evil Elmo
02-11-2008, 11:32 AM
yes, I have. Also, what manu said. Acrobat (not acrobat reader) also does OCR pretty well.

Gibson
02-11-2008, 11:39 AM
Acrobat 7/8 = Win.

Farnsworth,Luther P.
02-12-2008, 05:04 AM
Thanks for the input, folks.

I'm not sure I understand your dilemma, but is there a possibility that in setting up your OCR, you selected a greyscale printer setting instead of b/w?

I was thinking it was someth8ing like that; I've played with the DPI and all that stuff. It just may be the quality of the old scans themselves.

I just discovered that books.google.com includes many old volumes from libraries. I've heard of different groups who were scanning public domain texts, but I haven't figured out how to find their stashes. Got any links for us (or keywords)?

http://hearth.library.cornell.edu/h/hearth/browse.html

The above is great, IMHO; 'home economics' in the Progressive era, i.e. 1890-1920, covered a lot of subjects.

http://chla.library.cornell.edu/

A lot of good old stuff here, including a circa 1910's text on the costs of building your own dams and electrifying your farms.

These are all texts whose copyrights have expired.

There are a few magazines, and the old periodicals are pretty cool, too. These are the main scans I'm trying to get decent conversions of. Many of them have text versions, but they screw up the tables, numbers and other characters don't come out well, and of course the illustrations are left out.

Manu
02-13-2008, 02:17 AM
Acrobat Standard is definitely a cheaper way to go, and does a good OCR

Farnsworth,Luther P.
02-13-2008, 07:32 AM
Acrobat Standard is definitely a cheaper way to go, and does a good OCR

Yes. It certainly seems to be the most popular. Thanks for the input, people.

Farnsworth,Luther P.
02-18-2008, 09:07 AM
Cornell has updated their Digital Collections page if anybody is interested. Here is the complete list. MOA would be of interest to Americans, as will the ones I already linked to above. They've added quite a bit to it since the last time I hit this page:

Cornell Digital Collection (http://rdc.library.cornell.edu/search/index.php?mode=browse&type=Collection)

Google