Leela: a poppler library CLI tool

Trilby · 2012-05-29 15:45:15

The other day I was frustrated with a collegue who gave feedback on a 7MB pdf document by adding "annotations" to it and emailing it back. The equivalent of 1KB of text information (the feedback) was wrapped inaccesibly in a 7MB package sent via email.

I opened the pdf in my favorite lightweight viewers but I could not see the comments. Pdftotext, an otherwise wonderful tool, also failed to find extract the annotations. Evince was successful, but it has more dependencies and features than I need or want. Also, in evince, I was confronted with dozens of overlapping "Sticky notes" with the annotations. I had to sort through them, move them, and close them one at a time to get all the information written down. I thought there must be a better way to do this - but the annotations are encoded and not visible as text in the file (opened with text editor, hex editor, and scanned with GNU Strings). My frustrations were expressed here where I ended up making a tool to do just what I wanted.

I realized, though, that having sifted through the poppler docs and figuring out this problem, I am now equipped to make my tool into something much more versatile: a CLI tool to access the various poppler functions. This tool has been named Leela, and will gradually grow into a more powerful (but never more bloated) tool.

Edit 3 Jun 2012: Infant Leela is now crawling along nicely. I have remove some previous usage notes that were here as after a complete rewrite Leela developed in her own way. She developed in a way that was both easier to code, and I suspect will be eaasier to use. See the README on github for the most up to date notes.

I will create a PKGBUILD in the next few days for anyone who is willing to try her out.

Edit 5 June 2012: The PKGBUILD is up on git, and in the AUR

EDIT: (28 May 2013): Just in time for her one-year anniversary, Leela has been given a substantial makeover. Based on tricks I've learned from other cairo/pdf tools I've been working on, I've been able to come back to Leela (the project which launched those others) to give her the benefits of my learning. I've started revising the man page, but it needs a good bit of work yet. The builtin help (`leela help`) is the most up-to-date reference. I've revised annotation extraction to output in a more xml-ish format that should be parseable by any xml tools. I've also ditched the ghostscript dependency as I've learned cairo can render PDFs, and it does it quite a bit better.

Last edited by Trilby (2013-05-28 18:54:32)

cfr · 2012-09-08 23:23:57

I'm not sure this is the right place to post but I'm having some trouble getting this to work. I'm not sure if this is because (a) leela can't do what I want; (b) leela can do what I want but I'm asking all wrong; or (c) something else.

I found this thread because I was looking for a way to extract annotations from a PDF and I found your thread on annotations created in Preview. The PDF I've got was created on OS X but not, probably, in Preview. It was probably annotated on Windows but I'm not sure what software was used. Okular can see the annotations but when I try to open the annotation/see the contents of the text box/note, the notes are all empty. acroread can both see the annotations and reveal the contents of the notes but I can't figure out a good way to dump them all to a list of comments which is what I'd like to do.

What I hoped to do was write a quick script with leela which would basically take a pdf, find the number of pages and then dump the annotations to a file, printing the page number for references as it worked through them. However, I have fallen at the first hurdle...

Following the man page:

      Print out annotations found on pages 4 and 5 of filename.pdf:
              leela page 4 5 filename.pdf | leela annot

I tried this:

$ leela page 16 /tmp/x.pdf | leela annot 
Missing required options.
annot requires 0 additional argument.

So then I tried:

$ leela page 16 /tmp/x.pdf | leela annot -

This is successful but I get no output even though I know that there is definitely an annotation on page 16.

Last edited by cfr (2012-09-08 23:26:00)

Trilby · 2012-09-09 18:24:37

Two points:

1) If it is a text annotation it should work just like that. If it is anything other than a text annotation, Leela will not (yet) handle it.

2) ~~I have effectively abandonded leela. I suppose it should be removed from the AUR (have to look into that).~~ I put leela up in the aur when some of the basic functions were working and it was doing what I needed it to do. I anticipated completing all the unfinished bits and adding appropriate error handing to make it more stable. However, I then found that I was reinventing the wheel. I follishly never noticed that essentially all the features I was going to add to leela already exist in the poppler package (see "pacman -Ql poppler | grep bin" for the list of tools already there). Though it was somewhat appealing to be able to create my own unified interface to all of these capabilities of the poppler library, when I found that I'd be adding no functionality I lost most of the wind in the sails.

If you believe the annotation is a text annotation, you can see if it shows when you run just "leela annot <filename>". Also, if you email me a copy (at address in PKGBUILD) I'd happily take a look to see why/where leela failed. But my best advice is to check out the man pages on those tools already included in poppler.

EDIT: I am rather "fond" of leela as it was the reason I learned about a pretty cool lib API which allowed for the development of a separate project. This paired with the annotation functions which may not be available in the default poppler package makes leela worth continuing to develop. This development, however, will be seconday to two other projects I've prioritized (TTWM and Slider).

Last edited by Trilby (2012-09-09 20:11:17)

cfr · 2012-09-09 19:24:47

Thanks, Trilby.

Using

leela annot /tmp/x.pdf

I get two of the annotations out and then a segmentation fault. This is definitely a more successful method than my previous efforts, though.

I tried to figure out how to do this with the poppler tools as you suggested but I can't see which of them might be able to do the job. pdftotext just gives me the original text of the PDF; pdfinfo just gives me the original creation information; and the others don't seem to be relevant.

There's a tool called pdfannotextractor which I might try. It requires pdfbox from sourceforge. Unfortunately, it is java which I don't get on well with but I might give it a whirl. Should be OK if it has decent install instructions. pdfannotextractor can install it but I'm not terribly happy giving it the run of my machine.

Trilby · 2012-09-09 19:51:58

Ah yes, it seems most of the poppler capabilities are accessed through those tools, but perhaps annotations are not.

I will see if I can track down this bug and ensure leela is at least pretty good at the one feature that is unique to it.

cfr · 2012-09-09 22:59:32

That would be fantastic.

I tried getting the pdfannotextractor tool from TeX Live to work but I've not had much luck. I got the version of pdfbox it likes installed OK by finding the package README online but running it just runs me into further errors:

$ pdfannotextractor /tmp/x.pdf 
PDFAnnotExtractor 0.1l, 2012/04/18 - Copyright (c) 2008, 2011, 2012 by Heiko Oberdiek.
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true
* Processing file `/tmp/x.pdf' ...
!!! Error during write of entry `null'!

java.lang.NullPointerException
        at org.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:116)
        at org.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:772)
        at pax.PDFAnnotExtractor.cmd_pagenum(Unknown Source)
        at pax.PDFAnnotExtractor.parse(Unknown Source)
        at pax.PDFAnnotExtractor.processFile(Unknown Source)
        at pax.PDFAnnotExtractor.main(Unknown Source)
Exception in thread "main" java.lang.NullPointerException
        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
        at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
        at pax.PDFAnnotExtractor.parse_pages(Unknown Source)
        at pax.PDFAnnotExtractor.parse(Unknown Source)
        at pax.PDFAnnotExtractor.processFile(Unknown Source)
        at pax.PDFAnnotExtractor.main(Unknown Source)

I'm wondering if it is because Arch's java is cutting edge and the relevant version of pdfbox (0.7.3) is definitely anything but. However, java is even more of an opaque mystery to me than, say, C or python or something (and this is saying quite a lot).

So leela is still more successful than anything else I've tried bar acroread and pdfedit. pdfedit shows the comments buried deep in a hierarchy of content but is therefore even less convenient than acroread. It also doesn't indicate whether there are any annotations, where they might be or how many there might be so although the information is there, it is virtually useless as a way of extracting feedback on a paper. (I realise this isn't what pdfedit tries to be so this isn't really a criticism of the application - it just isn't much good for this particular purpose.)

EDIT: can you reproduce this or do you want the pdf? I can email it to you provided you won't circulate it publicly etc. (The actual paper is online but the annotations are comments from a colleague who might not be very pleased if I distributed his initial reactions to it for general consumption.)

Last edited by cfr (2012-09-09 23:04:39)

Trilby · 2013-05-28 18:58:31

Leela has just been given a substantial revision.

I had not done much with this project in a while - which is a shame, as tinkering with Leela is what spurred me to learn about the poppler and cairo libs which I've now used to much better effect in other projects.

It was time to come back to Leela and revise based on what I've since learned while working on those other tools (e.g., Slider). The syntax for using leela has changed a bit (see `leela help`), but leela can do all the same old tricks, plus several new ones.

I've implemented attachment extraction, but I've been unable to test it as I don't have pdfs with attachments. If anyone has any pdfs with all sorts of *stuff* (attachments, annotations of various sorts, embedded bits and bobs) I would be interested in getting some better test fodder.

Last edited by Trilby (2013-05-28 18:59:12)

deeenes · 2015-09-22 13:25:30

I am happy to report that Leela is very cool, it could extract all annotations properly, even those what Evince was unable.

Trilby · 2015-09-22 14:18:14

Thanks deeenes. As a coder I always appreciate any feedback.

But now as a forum moderator here I'd encourage you to peruse the forum ettiquette particularly on bumping old threads. "Community Contribution" threads may not really go stale - and Leela is still out in the wild and functioning - but please try to only revive old threads when there is new content/discussion (like if you had questions about Leela).

manouchk · 2020-12-11 03:10:05

I usually use zotero to export my pdf annotations but I need a more agile pdf annotation exporting tools and leela seems the perfect fit for me. I tried it but I found that what I need is not just annotation but highlited text too and leela seems not to be doing this. I usually need annotations and the corresponding highlighted text.

Last edited by manouchk (2020-12-11 03:14:01)

Trilby · 2020-12-11 03:55:01

I've not done anything with this in seven years or so, but it was successfully extracting highlighted text previously. Do you have an example pdf with the type of annotation you need extracted?

I just tested that it still compiles and runs ok. Basic functions that I can test still work fine - but I very rarely come accross pdfs with annotations, so my ability to actually test and patch this up is limited to pdfs others can provide that leela may not handle properly.

Last edited by Trilby (2020-12-12 13:38:09)

GSMiller · 2020-12-12 09:35:13

This is a great tool. I just tried it out, it is minimal and effective and does what it says it does. It is "simple and coherent"!
What language did you use? How did you structure your project?

schard · 2020-12-12 13:15:46

Trilbyfan wrote:

What language did you use? How did you structure your project?

@Trilbyfan: According to the GitHub page, it's written in C and contains one C file.
A single click on that link in the initial post (c|w)ould have answered your question.

GSMiller · 2020-12-12 21:18:01

Thank you schard, but I asked Trilby how they structured their project. That is not visible.

Trilby · 2020-12-12 21:27:56

I'm not sure what you mean by "structured" - I thought Schard's response was appropriate, but if you clarify your question perhaps I could address it. Though if it's not about this project, but just about coding in general, then this is not the right thread for it.

Further, I've tried to avoid engaging with you - you have no PM link to send my curiosity / concern to and either a moderator did not pass on to you my request for information / clarification or you did not respond to it, so I can only address this publicly: your approach to the forum, singling me out in your name /avatar / and maybe signature and commenting "+1" style posts to much of my activity is creepy to say the least.

Last edited by Trilby (2020-12-12 21:29:23)

GSMiller · 2020-12-16 18:28:37

Trilby wrote:

I'm not sure what you mean by "structured" - I thought Schard's response was appropriate, but if you clarify your question perhaps I could address it. Though if it's not about this project, but just about coding in general, then this is not the right thread for it.

I mean, how did you structure your work for coding this project, from an intial plan maybe to dividing it into blocks and such!

Trilby wrote:

Further, I've tried to avoid engaging with you - you have no PM link to send my curiosity / concern to and either a moderator did not pass on to you my request for information / clarification or you did not respond to it, so I can only address this publicly: your approach to the forum, singling me out in your name /avatar / and maybe signature and commenting "+1" style posts to much of my activity is creepy to say the least.

I am very sorry you feel creeped out by my posts.
I admit they can never be as sharp as yours, but I just try as good as I can.
I don't understand "signature and commenting "+1"?
Do you want me to change my avatar? Do you want me to change my username?

And I apologize for taking this thread "off topic", but I think if Trilby did it then it is the right way to respond.

Trilby · 2020-12-16 19:14:54

Trilbyfan wrote:

I mean, how did you structure your work for coding this project, from an intial plan maybe to dividing it into blocks and such!

No structure. I read the poppler-glib documentation and opened a text editor.

Arch Linux

#1 2012-05-29 15:45:15

Leela: a poppler library CLI tool

#2 2012-09-08 23:23:57

Re: Leela: a poppler library CLI tool

#3 2012-09-09 18:24:37

Re: Leela: a poppler library CLI tool

#4 2012-09-09 19:24:47

Re: Leela: a poppler library CLI tool

#5 2012-09-09 19:51:58

Re: Leela: a poppler library CLI tool

#6 2012-09-09 22:59:32

Re: Leela: a poppler library CLI tool

#7 2013-05-28 18:58:31

Re: Leela: a poppler library CLI tool

#8 2015-09-22 13:25:30

Re: Leela: a poppler library CLI tool

#9 2015-09-22 14:18:14

Re: Leela: a poppler library CLI tool

#10 2020-12-11 03:10:05

Re: Leela: a poppler library CLI tool

#11 2020-12-11 03:55:01

Re: Leela: a poppler library CLI tool

#12 2020-12-12 09:35:13

Re: Leela: a poppler library CLI tool

#13 2020-12-12 13:15:46

Re: Leela: a poppler library CLI tool

#14 2020-12-12 21:18:01

Re: Leela: a poppler library CLI tool

#15 2020-12-12 21:27:56

Re: Leela: a poppler library CLI tool

#16 2020-12-16 18:28:37

Re: Leela: a poppler library CLI tool

#17 2020-12-16 19:14:54

Re: Leela: a poppler library CLI tool

Board footer