Stapler - A python utility for manipulating PDF docs based on pypdf

Heller_Barde · 2009-08-05 20:52:49

Link to github:
http://github.com/hellerbarde/stapler/tree/master

* Dependencies *
Stapler depends only on the packages python and python-pypdf, both of
which can be found in the archlinux repositories.

* History *

Stapler is a pure python replacement for PDFtk, a tool for manipulating PDF
documents from the command line. PDFtk was written in Java, and natively
compiled with gcj. And it has been discontinued a few years ago and bitrot is
setting in (i.e. it does not compile anymore in archlinux).
Since I used it quite a lot, I decided to look for an alternative and found
pypdf, a PDF library written in pure Python. I couldn't find a tool which
actually uses the library, so I started writing my own.
At some point I plan on providing a GUI, but the command line version will
always exist.

* License *

A simplified BSD Style license describes the terms under which Stapler is
distributed. A copy of the BSD Style License used is found in the file "LICENSE"

* Usage *

I am too lazy at the moment to learn how to create a proper man page so this has
to suffice.

There are 4 modes in Stapler:
- cat:
Works like the normal unix utility "cat", meaning it con_cat_enates files.
The syntax is delightfully simple:
Syntax
stapler cat input1 [input2, input3, ...] output
Example:
stapler cat a.pdf b.pdf c.pdf output.pdf
# this would append "b.pdf" and "c.pdf" to "a.pdf" and write the whole
# thing to "output.pdf"

you can specify as many input files as you want, it always cats all but the
last file specified and writes the whole thing into the last file specified

- split:
Splits the specified pdf files into their single pages and writes each page
into it's own pdf file with this naming scheme:

${origname}p${zeropaddedpagenr}.pdf

Syntax:
stapler split input1 [input2 input3 ...]

Example for a file foobar.pdf with 20 pages:

$ stapler split foobar.pdf
$ ls
foobarp01.pdf foobarp02.pdf ... foobarp19.pdf foobarp20.pdf

Multiple files can be specified, they will be processed as if you called
single instances of stapler.

- select/delete (called with sel and del respectively)
These are the most sophisticated modes. With select you can cherrypick pages
out of pdfs and concatenate them into a new pdf file.

Syntax:
stapler sel input1 page_or_range [page_or_range ...] [input2 p_o_r ...]
Example:
stapler sel a.pdf 1 4-8 20-40 b.pdf 1-5 output.pdf
# this generates a pdf called output.pdf with the following pages:
# 1, 4-8, 20-40 from a.pdf, 1-5 from b.pdf in this order

What you _cannot_ do yet is not to specifying any ranges. I will probably merge
select and cat at some point in the future so that you can specify pages and
ranges, and if you don't, it just uses the whole file.

The delete command works almost exactly the same as select, but inverse.
It cherrypicks the pages and ranges which you _didn't_ specify out of the
pdfs.

contact me if you have questions about usage or anything, really
2009, Philip Stark (heller <dot> barde <at> gmail <dot> com)

Last edited by Heller_Barde (2009-08-05 21:02:54)

Sakurina · 2009-08-05 22:40:55

Ooh, I can see this being pretty useful. Thanks!

Heller_Barde · 2009-08-05 23:10:46

You're welcome I'm glad someone finds this useful

firecat53 · 2009-08-06 03:03:28

Are you planning on adding eventually the full functionality of pdftk -- rotate, watermark, encrypt, etc? I occasionally use pdftk, and I'd like to see a replacement since it's apparently having troubles keeping up to date. Great work!

Scott

Heller_Barde · 2009-08-06 06:40:08

firecat53 wrote:

Are you planning on adding eventually the full functionality of pdftk -- rotate, watermark, encrypt, etc? I occasionally use pdftk, and I'd like to see a replacement since it's apparently having troubles keeping up to date. Great work!
Scott

if you help me with ideas for the command syntax, sure why not. pypdf supports these things. can you make a complete list of things pdftk does that would be important to port over? and how the command line syntax for these features work. I'll then get working
EDIT: I just had a look at the pdftk man page (didn't think of that when i wrote the above...) there are some things that will not be possible with the current version of pypdf (and frankly i doubt there is going to be a new release):
update_info (because there is no way to write document properties with pypdf, you can read them just fine though)
fill_form (similar reason. it's just not supported)

the rest will be fine. I'll rename split to burst and then I'll mimic the cat function. the others should be no problem either.
although... i didn't find anything on rotate in the pdftk manual. how should i implement a rotate function? Rotate complete documents or single pages?

cheers
Phil

Last edited by Heller_Barde (2009-08-06 07:08:42)

firecat53 · 2009-08-06 16:12:09

From pdftk --help:

 Rotate the first PDF page to 90 degrees clockwise
     pdftk in.pdf cat 1E 2-end output out.pdf

       Rotate an entire PDF document to 180 degrees
     pdftk in.pdf cat 1-endS output out.pdf

Is pdftk well and truly dead? Since it does have the ability to write metadata, and some other functionality that pypdf doesn't have, is it worth it to instead put time into bringing pdftk and its dependencies current so we can continue to use it, rather than reinventing the wheel? Not being a programmer, I don't necessarily have a good idea of what that would entail, so please don't take my comments too seriously!! I don't know enough about the differences of the code being python vs whatever pdftk was coded with.

Thanks for your work on this!
Scott

Heller_Barde · 2009-08-06 19:30:52

pdftk was coded in java and compiled with gcc-gcj, whilst pdftk hasn't been maintained for 3-4 years, hasn't compiled in arch for as long as i can remember (+1 1/2 years) and gcc-gcj recently fell out of the official repositories probably because it didn't compile anymore or the maintainer decided to give it up. As far as I can tell, bitrot took its toll on pdftk and I for one certainly don't have the expertise to rescue it. Additionally, I was always pissed to have to install a huge eclipse package as a dependency for gcc-gcj which in turn was quite big, just to have a pdf utility. Stapler/pypdf on the other hand are pure python which cuts back on the deps quite a bit

cheers
Phil

Last edited by Heller_Barde (2009-08-07 02:16:29)

iBertus · 2009-08-06 19:59:52

I really like this idea and think it will be handy for me since I always have tons of PDFs laying around for different papers. Now, I can easily merge or split them to facilitate sending just the important parts to co-workers without having to fuss with pdftk...

tinhtruong · 2009-08-08 02:42:37

My most wanted features of pdftk to port to your new app is to able to uncompress pdf file to text file and the compress it back. Those feature is really useful in removing watermarks in PDF.

Violin · 2009-08-08 03:05:36

This is a great job. Please implement as many features as possible as long as pyPDF supports. And from my point of view, encryption and decryption are especially important. Thank you for your work. Hope it to be a good replacement for pdftk.

Violin · 2009-08-08 03:06:18

tinhtruong wrote:

My most wanted features of pdftk to port to your new app is to able to uncompress pdf file to text file and the compress it back. Those feature is really useful in removing watermarks in PDF.

I think that's just impossible.

ninian · 2009-08-08 19:54:21

@Philip:

Many thanks for creating this very handy little utility. I had already given up on pdftk and was using jpdftweak from AUR as an alternative (albeit it also uses Java). However your low overhead Stapler is most welcome and does most things I need.

One problem using Stapler: I get a few warning messages of the form:

/usr/lib/python2.6/site-packages/pyPdf/pdf.py:52: DeprecationWarning: the sets module is deprecated
  from sets import ImmutableSet

which can be easily suppressed using 2>/dev/null, but is there a cleaner way I wonder?

Heller_Barde · 2009-08-08 20:34:42

Violin wrote:

tinhtruong wrote:
My most wanted features of pdftk to port to your new app is to able to uncompress pdf file to text file and the compress it back. Those feature is really useful in removing watermarks in PDF.
I think that's just impossible.

I just thought exactly the same thing. converting it to pure text is possible, but the other route isnt. I'm pretty sure that pdftk doesn't do that either.
(although i haven't even seen the pdf-to-text feature in the pdftk documentation, but i'll implement it)

It will be a week or two until i'll work on it (exams coming up), be patient, i'll do it.

cheers
Phil

Heller_Barde · 2009-08-08 20:50:02

ninian wrote:

@Philip:
/usr/lib/python2.6/site-packages/pyPdf/pdf.py:52: DeprecationWarning: the sets module is deprecated
  from sets import ImmutableSet
which can be easily suppressed using 2>/dev/null, but is there a cleaner way I wonder?

unfortunately, that is out of my hands. it's a message from pypdf which is no longer maintained (meaning I can't reach the author). I patched it so it doesn't use these deprecated things. I will file a bugreport at some point. for the moment, i have a source package with the (actually very small) patch here and a prepackaged (i686) version of it here.

cheers
Barde

ninian · 2009-08-09 18:17:39

Heller_Barde wrote:

I patched it so it doesn't use these deprecated things. I will file a bugreport at some point. for the moment, i have a source package with the (actually very small) patch here and a prepackaged (i686) version of it here.

Great stuff - thanks for those links!

Stefan Husmann · 2009-08-09 21:49:22

I made a PKGBUILD . Look into AUR.

Heller_Barde · 2009-08-10 01:19:01

Stefan Husmann wrote:

I made a PKGBUILD . Look into AUR.

Thank you very much, I didn't actually want to make an AUR pkg (yet) because it's still kind of experimental... but I appreciate it

By the way:
there is a mistake in the PKGBUILD
you copy the license to /usr/share/icences instead of /usr/share/licenses

Stefan Husmann · 2009-08-10 04:30:56

Typo fixed. I think it is alway good to have a PKGBUILD, so that no so experienced people also can test a program.

Heller_Barde · 2009-08-10 07:20:17

Stefan Husmann wrote:

Typo fixed. I think it is alway good to have a PKGBUILD, so that no so experienced people also can test a program.

thanks and you're probably right about that

and now for something completely different:
I filed a bug for pypdf, lets hope it gets incorporated because the author didn't respond to my email inquiry

cheers
Barde

Last edited by Heller_Barde (2009-08-10 07:20:32)

aktinos · 2009-08-11 07:23:57

Nice job!
Another pretty good alternative is qpdf.

Heller_Barde · 2009-08-11 08:46:05

aktinos wrote:

Nice job!
Another pretty good alternative is qpdf.

oooooh, shiny *chasing butterflies*
I didn't know that existed. Looks really nice, but there seems to be no concatenation of files and that sort of thing, but I will surely keep qpdf close by for when I need to (un)set those restrictions on a pdf once in a while. the option to compress - uncompress streams seems nice too. I'll have to experiment with that

thanks for the tip aktinos

cheers
Barde

tinhtruong · 2009-08-14 02:23:39

Heller_Barde wrote:

Violin wrote:
tinhtruong wrote:
My most wanted features of pdftk to port to your new app is to able to uncompress pdf file to text file and the compress it back. Those feature is really useful in removing watermarks in PDF.
I think that's just impossible.
I just thought exactly the same thing. converting it to pure text is possible, but the other route isnt. I'm pretty sure that pdftk doesn't do that either.
(although i haven't even seen the pdf-to-text feature in the pdftk documentation, but i'll implement it)
It will be a week or two until i'll work on it (exams coming up), be patient, i'll do it.
cheers
Phil

pdftk can do that (uncompress and then compress) just fine. On the front page of pdftk there is a line said:
Uncompress and Re-Compress Page Streams
I have used it many times on Windows to remove watermarks on PDF files (because it's broken on Arch right now).
But I'm watching closely to your progress

Heller_Barde · 2009-08-14 08:15:56

tinhtruong:
I think we didn't quite understand you right, i apologize. I thought you meant dumping the text from the pdf, not the "source" code, but the text that gets displayed. I doubt that pypdf can do that, but i'll look into it. I see now all the advanced features of pdftk and the respective shortcomings of pypdf.

I'll do my best. It'll have to wait another week though, first i have to recover from a backup-snafu

cheers
Barde

ursa65 · 2009-08-15 21:30:14

I have a quick question. I used to use pdftk on debian to turn a bunch of single page pdf's in a directory into one pdf. Page order wasn't important. I just do it so I am sending one file instead of several and compressed files are bounced by the firewall. The syntax I used was pdftk *.pdf cat output funnies-$(date +%Y%m%d).pdf. Is there a way to read all the pdf files in a directory with stapler and turn them into one PDF?

Thanks in advance

Heller_Barde · 2009-08-16 07:58:01

ursa65 wrote:

I have a quick question. I used to use pdftk on debian to turn a bunch of single page pdf's in a directory into one pdf. Page order wasn't important. I just do it so I am sending one file instead of several and compressed files are bounced by the firewall. The syntax I used was pdftk *.pdf cat output funnies-$(date +%Y%m%d).pdf. Is there a way to read all the pdf files in a directory with stapler and turn them into one PDF?
Thanks in advance

of course, it's the cat option and it's detailed in the readme (and the first post of this thread)
it works like this:

stapler cat *.pdf your_output_file.pdf

the cat option concatenates all but the last file specified on the command line into the last file specified on the command line

EDIT: @tinhtruong:
I saw that there is no uncompress action in pypdf, but I think it needs to do it anyway, because pypdf can do the watermark thingie, thus it needs to uncompress the stuff, concatenate it, and compress it again. I'll take a look at the library, but I can't promise anything.

Cheerio
Barde

Last edited by Heller_Barde (2009-08-16 08:00:22)

Arch Linux

#1 2009-08-05 20:52:49

Stapler - A python utility for manipulating PDF docs based on pypdf

#2 2009-08-05 22:40:55

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#3 2009-08-05 23:10:46

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#4 2009-08-06 03:03:28

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#5 2009-08-06 06:40:08

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#6 2009-08-06 16:12:09

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#7 2009-08-06 19:30:52

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#8 2009-08-06 19:59:52

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#9 2009-08-08 02:42:37

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#10 2009-08-08 03:05:36

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#11 2009-08-08 03:06:18

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#12 2009-08-08 19:54:21

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#13 2009-08-08 20:34:42

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#14 2009-08-08 20:50:02

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#15 2009-08-09 18:17:39

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#16 2009-08-09 21:49:22

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#17 2009-08-10 01:19:01

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#18 2009-08-10 04:30:56

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#19 2009-08-10 07:20:17

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#20 2009-08-11 07:23:57

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#21 2009-08-11 08:46:05

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#22 2009-08-14 02:23:39

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#23 2009-08-14 08:15:56

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#24 2009-08-15 21:30:14

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#25 2009-08-16 07:58:01

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

Board footer