pdftotext extracting from image files mystery

gaudencio · 2010-10-15 11:40:20

hello all, just had a bit of a shock when I ran pdftotext (accidentally) on an unocr'd pdf file, and it extracted all the text. Running pdftohtml (as I'd intended) produced the expected output - i.e. png dumps. Curious, I then tried running it on a bunch of other downloaded files, and with the exception of one, pdftotext extracted the text from ALL of them. The mystery then is that it isnt using any ocr (no tesseract dependency, plus it's way too quick), so clearly it must be pulling it directly from the files. But if the text really is there in the original, then presumably the people who scanned them didn't know, else they'd have left it there. The only thing I can think is that some common piece of pdf software (probably acrobat) has ocr built in, so it's scanning them automatically and then encoding the hidden text in the pdf.

take for example (legal):
http://www.cd3wd.com/cd3wd_40/JF/JF_VE/SMALL/27-714.pdf

open up in a normal pdfreader - very clearly a (poorly) scanned document. Now run 'pdftotext -layout' on it and you get a pretty impressive text file, considering the source. Sure, some of the formatting is messed up, but I'm sure anyone with sed knowledge could quickly sort most of that out. Besides, that's one of the worst documents - on most others it was perfect, maintaining columns (even tesseract can't do that) and everything.

It makes for a fantastic command line pdf reader that works on almost all the files I've thrown at it, and since pdf's are almost the only reason I ever have to load up X, consider me tickled pink.
Just thought I'd mention it here, as a google search brings up nothing on the issue. hope someone finds it useful

Army · 2010-10-15 12:46:12

Interesting!!!

But it doesn't work here with the file you provided. pdftotext is included in poppler, do you have poppler from the repos or a custom build?
edit: Sorry my mistake, I just saw that this command creates a text file. Works! Great!!

By the way, if you want to view pdf files in tty, there's an amazing software collection called fbida (available in the repos) which includes fbi for images and fbgs for pdf files. Give it a try!

Last edited by Army (2010-10-15 12:48:30)

frabjous · 2010-10-15 12:59:50

Yes, Adobe Acrobat has a feature that allows you OCR the text and insert the text as a separate layer behind the image. Such PDFs are called "Searchable image PDFs". There are no doubt some other commercial software options that can do this too. The copier/scanner/multifunction device in my office actually does this automatically when you scan in a text PDF.

Theoretically, it is even possible to create PDFs like this using free software on linux (a combination of ExactImage and Cuneiform; or WatchOCR).

See, e.g., this blog post:

Searchable PDFs with linux

and this slashdot story:

Open Source OCR that makes Searchable PDFs

My own experiments with trying to get something like this to work have not been very successful--at least the quality of the result is nowhere near what I'm getting with our work printer.

If you like pdftohtml, you should also know about pdfreflow, which takes the XML output of pdftohtml with the -xml flag and creates an html file that can be reflowed (I.e., is smart about paragraph breaks, removes page numbers and recombines words broken through hyphenation and so on).

pdfreflow

Last edited by frabjous (2010-10-15 13:03:00)

gaudencio · 2010-10-16 11:46:04

yeah, fbi is great for images, but I never use it for pdfs - it's a pain to have to sit and wait five minutes for it to render every page in the book before you can view anything, and my little netbook is way too slow for it really. I much prefer being able to view, search and extract plain text on a console the way god intended. I'm completely blown away by how many of my pdfs have the ocr'd text hidden away in them. I sat down yesterday evening and went through hundreds of them - and only a half dozen or so were plain images. I'm guessing the reason is what frabjous hinted at - that people are scanning in books in their offices, completely unaware they're being ocr'd at the same time.
What surprises me is that pdftohtml doesn't extract this text, even with the '-hidden' switch. If it did I could read ~90% of my pdf files at the console with images through w3m. But anyway, what I really need now is a little bit of sed magic to format the text file. So far I've got it to take out any extra spaces, but dealing with new lines is tricky.

If anyone wants to have a play, here's the file I'm working on formatting: (and yes I know they have a plain text version too....that's not the point)
Take for example http://www.archive.org/download/physics … 428mbp.pdf

Arch Linux

#1 2010-10-15 11:40:20

pdftotext extracting from image files mystery

#2 2010-10-15 12:46:12

Re: pdftotext extracting from image files mystery

#3 2010-10-15 12:59:50

Re: pdftotext extracting from image files mystery

#4 2010-10-16 11:46:04

Re: pdftotext extracting from image files mystery

Board footer