You are not logged in.
I just had a collegue return a pdf marked up with "annotations".
I first struggled with finding a viewer that views the annotations, and those that could (evince, okular) all have more dependencies that I'd want/need.
I decided there must be an lighter weight way to extract the text of annotations from pdfs. I searched, read, and learned about pdf annotation formats and I figured out how to extract adobe annotations from a pdf. Only then did I realize that my collegue didn't use adobe. The annotations were created in Mac OSX's preview. Preview, it would seem, does not use the adobe xfpdf format for annotations, it uses some other means to embed the annotations.
I searched the pdf file in text and hex editors, but I couldn't find anything resembling what I knew to be in some of the annotations. However Preview does it, the text must be encoded and unreadable in the "raw" file. This is in contrast, it would seem, with adobe's annotations.
My question, then, is two-fold. First, are there any text-annotation extractor tools that can get notes created in OSX's preview? If not, does anyone know of any documentation outlining how Preview embeds this information? I've been googling the latter question, but I'll I'm getting is "where-to-click" level tutorials on how to DO annotations in Preview. I can't find any documentation of how that information is embeded into the file.
Note: evince does get the text of the annotations, but I'd prefer not to keep that installed. Evince-gtk is only every so slightly lighter on dependencies. Also, in evince I get flooded with "sticky notes" with all the annotation text. It would take a while to move those around and close them one at a time to be able to actually read them. I'm hoping to be able to extract all annotation text and dump it into a text file.
Last edited by Trilby (2012-05-29 00:38:32)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
You aroused my curiosity and I found this. It seems to say that annotations are stored as "pdf objects" (however they are defined) and you probably need to use some API to extract them. If you have good python-fu, that would be a start?
Last edited by /dev/zero (2012-05-28 21:54:53)
Offline
Thanks. I did run across that post. I think I disregarded it because the documentation is all for the Objective-C API. I dabbled very briefly in Objective-C before running away screaming. Getting a linux Objective-C compiler and libraries would probably be more trouble than using one of the bigger PDF apps.
However, looking through the docs gives me hope that there would be a C API out there somewhere. Now to find it.
EDIT: I think I'm on to something, but if anyone more familiar comes along let me know if I'm wasting my time reading the poppler docs.
Last edited by Trilby (2012-05-28 22:07:35)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Holy crap, it works. A couple hours of reading ugly docs and 32 lines of code later, I have exactly what I was looking for.
I'd love it if some volunteers would try this out, beat it up, and see what breaks first so I can improve it. After a little polishing I might put it up in the AUR.
Get the code from my dropbox here
The Ugly Patchwork Makefile and a very brief TODO list are also posted.
I'll put together a PKGBUILD once this is in better order for distribution and installation. I just got the darn thing to work, it's time to celebrate, not code more.
Note to Mods: as in my "report", please move this thread to Community Contributions.
Last edited by Trilby (2012-05-29 00:34:38)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Nice job.
You could mark this thread as [Solved] and point to a new thread in Community Contributions with the details of your app.
Offline
Will do, thanks.
Edit: crossreference to the Community Contributions thread
Last edited by Trilby (2012-05-29 20:45:25)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline