Stapler - A python utility for manipulating PDF docs based on pypdf

robmaloy · 2009-08-18 23:43:41

Heller_Barde wrote:

ursa65 wrote:
I have a quick question. I used to use pdftk on debian to turn a bunch of single page pdf's in a directory into one pdf. Page order wasn't important. I just do it so I am sending one file instead of several and compressed files are bounced by the firewall. The syntax I used was pdftk *.pdf cat output funnies-$(date +%Y%m%d).pdf. Is there a way to read all the pdf files in a directory with stapler and turn them into one PDF?
Thanks in advance
of course, it's the cat option and it's detailed in the readme (and the first post of this thread)
it works like this:
stapler cat *.pdf your_output_file.pdf
the cat option concatenates all but the last file specified on the command line into the last file specified on the command line
EDIT: @tinhtruong:
I saw that there is no uncompress action in pypdf, but I think it needs to do it anyway, because pypdf can do the watermark thingie, thus it needs to uncompress the stuff, concatenate it, and compress it again. I'll take a look at the library, but I can't promise anything.
Cheerio
Barde

maybe having an -o|--output $FILE option and printing to stdout by default would prevent that kind of confusion

edit: and maybe split (or all operations) could read from stdin if no input file specified

Last edited by robmaloy (2009-08-18 23:46:19)

Heller_Barde · 2009-08-19 07:29:02

robmaloy wrote:

maybe having an -o|--output $FILE option and printing to stdout by default would prevent that kind of confusion

edit: and maybe split (or all operations) could read from stdin if no input file specified

I don't want to type a separate command line option for output, I follow the same scheme like "mv" and "cp" where you can specify several "froms" and the last is the "to".
And frankly, i fail to see the point for reading from stdin or writing to stdout when dealing with pdfs.
(i understand of course the need for this in usual unix utils, but they deal usually with ascii stuff or at least encoded binary in base64 (which is ascii again) and pdf is a binary format, thus it's not very clever to write it to stdout, it can mess up your console)

I hope you understand.

cheers
Barde

PS: or if I'm wrong and everybody pipes pdfs around all day, i can of course still implement it, I just reasoned for my decision not to initially implement it.

kyle_white · 2009-09-15 04:19:52

Very useful program. I'm glad it's been placed on AUR too, or I wouldn't have known about it.
I updated the program to work with some nautilus scripts I made to join, burst, and collate pages. Here's the features I've added so far, and I'm thinking of adding others

allow for use of end as page number. eg "stapler sel in.pdf 10-end out.pdf" would select all but pages 1-9
parsing reverse ranges. eg "stapler sel in.pdf end-1 out.pdf" would reverse the order of the pages
burst mode, with printf formatted output name. eg "stapler burst in.pdf out%02d.pdf" would return out01.pdf, out02f.pdf, ...

I'm not sure if I'll keep burst mode or try to combine it with split (the final filename could be ambiguous whether an output name or an input file). If you'd like the diff to add to the official let me know. I'll probably be changing up some more but am trying to keep the style similar.

Last edited by kyle_white (2009-09-15 04:21:53)

kyle_white · 2009-09-16 09:18:17

I've rewritten almost all of the code into a new program now, and I don't know if it would be possible to merge anymore.
(If you'd like to merge or use parts, feel free to ask)

I'm calling it pydftk for the moment, as I'm using it in my scripts in place of pdftk
Here's what I've changed from the stapler functions

# code restructure
# added mode col: collate pages from selected files/ranges (file1p1, file2p1, file1p2, file2p2...)
# replace filename with "-" to use stdin or stdout for piping. selects based on context
# can use page ranges for all modes (including cat and burst) (this makes cat function the same as sel)
# if no page range is given for a file, all pages are assumed (again, sel=cat)
# added burst mode, originally used this for formatted output, but now it works the same as split
# burst/split mode now has optional printf formatted output name:
# %n filename, %e extension, %d printf format iterator, %p printf formated pagenumber (use %0p for stapler formatted)
# increased page range options
# "end" is replaced with the number of last page
# reverse ordering allowed, e.g. 3-1 selects the pages 3, 2, 1
# append "e" for even, "o" for odd, e.g. 5-1o selects pages 5, 3, 1
# verbose mode disabled by default (to avoid issues with using stdout), add -v to commandline to enable
# lots of bugs have been added and well hidden?

Usage: pydftk mode arguments [output]
modes:
sel: <inputfile> <pagenr>|<pagerange> ... (output needed)
cat: same as sel
del: <inputfile> <pagenr>|<pagerange> ... (output needed)
col: <inputfile> <pagenr>|<pagerange> ... (output needed)
split: <inputfile> <pagenr>|<pagerange> ... (formatted output optional)
burst: same as split
-h: this help
-v: verbose
arguments:
can be file names or - for <stdin> <stdout> (determined by context)
file names can be followed by page numbers and ranges. e.g. 1 2 4-6
keyword end can be used for the last page
page ranges can end with e for even, o for odd
formatting (for burst/split):
%n: is replaced by current filename being processed (no ext)
%e: is replaced by current file's extention being processed
%p: is replaced by stapler formatted page number

Heller_Barde · 2009-09-16 15:02:59

the forum just eated my message I already wrote. silly thing

anyway
@kyle_white:
thanks for your effort. maybe i can merge parts of your program into stapler, but I doubt it. I was going to rewrite it anyway, since the current form is not very well extendable. this first version was more of a proof of concept anyway (and i needed a pdf cat like _NOW_ )

I am not sure whether i correctly understand what the "col" mode does. as far as i understand it's an exact replica of the "sel" mode, but then again, i didn't look at your code yet (no time, hopefully this weekend i can spare some). maybe you can enlighten me on that part.

cheers Barde

PS:
fun fact: I thought about naming my utility pydftk too, but then decided not to.

kyle_white · 2009-09-17 08:34:54

col is for collating

if input is...
$pydftk col 1.pdf 1-4 2.pdf 1-4 out.pdf
...then out.pdf will have pages ordered like:
1.pdf page 1
2.pdf page 1
1.pdf page 2
2.pdf page 2
1.pdf page 3
2.pdf page 3
1.pdf page 4
2.pdf page 4

it could be duplicated with...
$pydftk cat 1.pdf 1 2.pdf 1 1.pdf 2 ... etc
...but this wouldn't allow for collating all pages in a file

My overall goal was to build a program that could take manually scanned 2 sided documents and turn them into one document, I was able to do it with stapler but it took a dash of BashMagic(tm) and lots of temporary files
My process is to scan a stack of pages, flip the entire stack and scan the back side (with the pages in reverse order of course), and then combine them
I should be able to do the entire combining with one command now...
$pydftk col front.pdf back.pdf end-1 combined.pdf
...or...
$pydftk cat back.pdf end-1 - | pydftk col front.pdf - combined.pdf

I'm going to sleep now and test it tomorrow (it took several hours to work out some problems between pyPdf and stdin/stdout

Heller_Barde · 2009-09-17 19:45:37

aaahh, i see very clever.

kyle_white · 2009-09-20 04:39:22

I think I've put everything into pydftk I can now:

# added col and zip: collate pages from selection (file1p1, file2p1, file1p2, file2p2...)
# replace filename with "-" to use stdin or stdout. selects based on context
# can use page ranges for all modes (including cat and burst) (this makes cat function the same as sel)
# if no page range is given for a file, all pages are assumed
# increased page range options
# "end" is replaced with the number of last page
# reverse ordering allowed, e.g. 3-1 returns the pages 3, 2, 1
# append "e" for even, "o" for odd, e.g. 1-5o returns 1, 3, 5
# append x# for each # page, e.g. 1-7x3 returns 1,4,7
# append :(command) to rotate pages, command can be r,l,u,d,cw#,ccw#. cw and ccw can take # in multiple of 90 degrees
# all modes now have optional printf formatted output name:
# split/burst: %n filename, %e extension, %i printf formatted iterator, %p printf formated pagenumber (use %0p for stapler formatted)
# all others: %n or %1n for first input filename, %2n for second... same for %e extensions, no %p or %i allowed
# verbose mode disabled by default, add -v to commandline to enable, -q to disable, forced disabled if writing to stdout
# -p=(password) following an input file to decrypt file with (password)
# -u=(password) set user password for output
# -o=(password) set owner password for output
# lots of bugs have been added and well hidden?
---------------------------------------------------
here's the help output for an overview:
---------------------------------------------------
Usage: pydftk MODE [-u=userpw] [-o=ownerpw] ARGUMENTS [OUTPUT]
modes, and required arguments:
sel,cat inputfile <pagerange> [inputfile2 <pagerange2>]..(output needed)
del inputfile <pagerange> [inputfile2 <pagerange2>]..(output needed)
col,zip inputfile <pagerange> [inputfile2 <pagerange2>]..(output needed)
split,burst inputfile <pagerange> [inputfile2 <pagerange2>]..(output opt))
-h,--help this help
-v verbose
-q quiet, disable verbose (forced when using <stdout>)
**note: - is replaced by <stdin> or <stdout> (determined by context)
<stdout> is disabled for split/burst

encrypt/decrypt:
follow filename with -p=(pw) to decrypt using (pw)
-u=(password) in arguments will set user password to (pw) for output
-o=(password) in arguments will set owner password to (pw) for output

pagerange formatting:
range[SELECTION][:ROTATION]
selection can be e for even, o for odd, x# for every #th page
rotation can be r for right, l for left, u for up, d for down (180 deg)
can also be cw# for clockwise, ccw# for counterclockwise
cw and ccw # must be a multiple of 90 degrees
pagerange examples:
3 process page 3
1-5 process pages 1,2,3,4,5
5-1 process pages 5,4,3,2,1
end-1 process all pages in reverse order (end replaced by last page)
1-5o:cw90 process odd pages 1,3,5 and rotates each 90 degrees right
1-7x3:d process every 3rd page 1,4,7 and rotates each 180 degrees

output formatting:
sel/cat/del/col/zip:
%n %n or %1n is first file's path/name, %2n is second...
%e %e or %1e is first file's extension, %2e is second...
split/burst only:
%n is replaced by current file's path/name
%e is replaced by current file's extention
%p is replaced by page number, can be printf formatted
%0p auto selects # of digits based on # pages in current file
%i is replaced by index number, can be printf formatted
output examples:
sel/cat/del/col/zip:
$pydftk cat f1.e1 f2.e2 f3.e3 %3n-o.%e
--would return a file named f3-o.e1
$pydftk cat input.pdf -
--would write input.pdf to <stdout>
split/burst only:
$pydftk burst input.pdf 5 %n.p%04p.%e
--would return a file named input.p0005.pdf
$pydftk burst input.pdf 5-6 %n.%03i.%e
--would return input.001.pdf for page 5, input.002.pdf for page 6

ninian · 2009-09-21 15:13:18

kyle_white wrote:

I think I've put everything into pydftk I can now

Looks good - will you be sharing the program?

kyle_white · 2009-09-22 03:14:30

Update 10/22/09 - fixed bug in passing single page selections, thanx achecht

I don't know how to get it onto git. I could just paste the whole (still a bit messy) source here...

#!/usr/bin/python
# -*- coding: utf-8 -*-

# pydftk
# Copyright 2009 W. Kyle White
# Rewrite of stapler; Copyright 2009 Philip Stark
#   original stapler license is found in the file "LICENSE"
#   if this file is missing, you can find a copy at
#   http://stuff.codechaos.ch/stapler_license
# Features added:
#   code cleanup and restructure
#   added col and zip: collate pages from selection (file1p1, file2p1, file1p2, file2p2...)
#   replace filename with "-" to use stdin or stdout. selects based on context
#   can use page ranges for all modes (including cat and burst) (this makes cat function the same as sel)
#   if no page range is given for a file, all pages are assumed
#   increased page range options
#      "end" is replaced with the number of last page
#      reverse ordering allowed, e.g. 3-1 returns the pages 3, 2, 1
#      append "e" for even, "o" for odd, e.g. 1-5o returns 1, 3, 5
#      append x# for each # page, e.g. 1-7x3 returns 1,4,7
#      append :(command) to rotate pages, command can be r,l,u,d,cw#,ccw#. cw and ccw can take # in multiple of 90 degrees
#   all modes now have optional printf formatted output name:
#      split/burst: %n filename, %e extension, %i printf formatted iterator, %p printf formated pagenumber (use %0p for stapler formatted)
#       all others: %n or %1n for first input filename, %2n for second... same for %e extensions, no %p or %i allowed
#   verbose mode disabled by default, add -v to commandline to enable, -q to disable
#   -p=(password) following an input file to decrypt file with (password)
#   -u=(password) set user password for output
#   -o=(password) set owner password for output
#   lots of bugs have been added and well hidden? :)
#
# Todo:
#   add N-up to sel/cat
#   possibly remove duplicate modes?
#   add watermark/merge page
#   add attach/extract files
#   add metadata reading/manipulation
#   add render PDF<>Image
#   add extract embedded object
#
# --Feature comparison to pdftk: *added, +todo
#  *Merge PDF Documents
#  *Split PDF Pages into a New Document
#  *Rotate PDF Pages or Documents
#  *Decrypt Input as Necessary (Password Required)
#  *Encrypt Output as Desired
#   Fill PDF Forms with FDF Data or XFDF Data and/or Flatten Forms
#  +Apply a Background Watermark or a Foreground Stamp
#   Report on PDF Metrics such as Metadata, Bookmarks, and Page Labels
#  +Update PDF Metadata
#  +Attach Files to PDF Pages or the PDF Document
#  +Unpack PDF Attachments
#  *Burst a PDF Document into Single Pages
#  +Uncompress and Re-Compress Page Streams
#   Repair Corrupted PDF (Where Possible)






import math
from pyPdf import PdfFileWriter, PdfFileReader
import sys
import re
import os.path
from StringIO import StringIO

#######################################################################################################
# bufferstdin and bufferstdout are needed to add functions to stdin and stdout to be usable for pyPdf #
# sys.stdout.write(self.getvalue()) can be used in close() instead of sys.stdout.write in write()     #
#######################################################################################################
class bufferstdin(StringIO):
    name = "<stdin>"
    def __init__(self):
        StringIO.__init__(self, sys.stdin.read())
class bufferstdout(StringIO):
    name = "<stdout>"
    def write(self,buffer):
        sys.stdout.write(buffer)
        StringIO.write(self,buffer)
    def close(self):
#        sys.stdout.write(self.getvalue())
        sys.stdout.flush()
        sys.stdout.close()
        StringIO.close(self)
###### end class definitions ######
###################################


##################################################
# Handle all command line arguments and dispatch #
##################################################
verbose = False
ownerpassword=""
userpassword=""


def parse_args(argv):
    global verbose
    global ownerpassword
    global userpassword

    mode = ""
    # Possible modes are:
    # cat,sel     - concatenate multiple PDFs
    # del         - delete single pages or ranges (opposite of sel)
    # col,zip     - collate PDFs
    # split,burst - split a pdf into single pages
    modes = ["cat", "col", "burst", "split", "del", "sel", "zip"]

    if (len(argv) < 3):
        sys.stderr.write( "too few arguments" )
        halp()
        sys.exit(0)
    
    if argv[1] in ["-h", "--help"]:
        halp()
        sys.exit(0)
    

    mode = argv[1]
    files = argv[2:]


    if (not mode in modes):
        sys.stderr.write( "please input a valid mode" )
        halp()
        sys.exit()

    if ("-v" in files):
        verbose = True
        files.remove("-v")
    if ("-q" in files):
        verbose = False
        files.remove("-q") 

    ###### disable verbose if outputting to <stdout> ######
    if (mode not in ["burst", "split"] and files[-1] == "-"):
        verbose = False
        
    if verbose: print "mode: "+mode

    #burst/split does not require an output argument, if none is given then add the default
    if (mode in ["burst", "split"]) and not "%" in files[-1]:
        files += ["%np%0p.%e"]

    infilelist=[]
    pages=[]
    for argument in files[:-1]:

        if re.match('.*?\.pdf', argument) or argument =="-":
            infilelist.append(openpdf(argument,"in"))
            pages.append([])
        elif(re.match("^-p=.*", argument)):
            password=argument.replace("-p=","")
            if verbose: print "Decrypting " + infilelist[-1].stream.name + " with password \"" + password + "\""
            infilelist[-1].decrypt(password)
        elif(re.match("^-u=.*", argument)):
            password=argument.replace("-u=","")
            if verbose: print "!!Setting USER password to: " + "\"" + password + "\""
            userpassword=password
        elif(re.match("^-o=.*", argument)):
            password=argument.replace("-o=","")
            if verbose: print "!!Setting OWNER password to: " + "\"" + password + "\""
            ownerpassword=password            
        else:
            pages[len(infilelist)-1] += [argument]

    if (mode in ["burst", "split"]):
        burst(zip(infilelist, pages), files[-1])
    elif (mode in ["cat","sel"]):
        concatenate(zip(infilelist, pages), files[-1], inverse=False)
    elif (mode == "del"):
        concatenate(zip(infilelist, pages), files[-1], inverse=True)
    elif (mode in ["col","zip"]):
        collate(zip(infilelist, pages), files[-1])
###### end function parse_args ######
#####################################

#################################
# functions to process input    #
#################################
###### getpagequeue accepts a list of string arguments, and a list of tuple replacements
###### returns a list of tuples (page number, rotation angle) 1 per page
def getpagequeue(arguments, replacements):
    pagequeue = []
    if not arguments: arguments = ["1-end"]
    for argument in arguments:
        for i in range(len(replacements)):
            argument=argument.replace(replacements[i][0], replacements[i][1])

        pagerange=re.search('^[0-9]+(-[0-9]+)?', argument).group()
        modifiers=argument.replace(pagerange, "")

        select = (lambda x: True)
        selectarg=re.search('[eo]|(x[0-9]*)',modifiers)
        if selectarg:
            selectarg=selectarg.group()
            modifiers=modifiers.replace(selectarg,"")
            selecttests=dict([("e",(lambda x: not int(x)%2)), ("o",(lambda x: int(x)%2) )])
            if selectarg in selecttests.keys():
                select = selecttests[selectarg]
            else:
                skipcount=int(selectarg.replace("x",""))
                select = (lambda x, skipcount=skipcount: not (x-begin)%skipcount)

        angle = 0
        rotateflag=':'
        rotatearg=re.search(rotateflag+'[rlud]',modifiers)
        if rotatearg:
            print rotatearg.group()
            rotatearg=rotatearg.group().replace(rotateflag,"")
            rotateangle=dict([("u",0), ("d",180), ("l",-90), ("r",90)])
            if rotatearg in rotateangle.keys():
                angle=rotateangle[rotatearg]
                modifiers=modifiers.replace(rotateflag+rotatearg,"")

        rotatearg=re.search(rotateflag+'(c?cw[0-9]*)',modifiers)
        if rotatearg:
            rotatearg=rotatearg.group().replace(rotateflag,"")
            if "ccw" in rotatearg:
                angle=-int(rotatearg.replace("ccw",""))
                modifiers=modifiers.replace(rotateflag+rotatearg,"") #we know that rotatearg is all of ccw command
            elif "cw" in rotatearg:                
                angle=int(rotatearg.replace("cw",""))
                modifiers=modifiers.replace(rotateflag+rotatearg,"") #we know that rotatearg is all of cw command
        
        if re.search('^[0-9]+-[0-9]+$',pagerange):
            ###set up tests for even, odd, or none

            (begin,sep,end) = pagerange.partition("-")
            begin=int(begin)
            end=int(end)
            if begin < end:
                pagelist = filter(select,range(begin, end+1))
            else:
                pagelist = filter(select,range(end, begin+1))
                pagelist.reverse()
            for pageout in pagelist:
                if (select(pageout)):
                    pagequeue += [(pageout,angle)]
        elif re.search('^[0-9]+$',pagerange):
            pagequeue += [(int(pagerange),angle)]
        else:
            sys.stderr.write( "error: misformat for pagelist at \"" + inputname + "\"... exiting now" )
            sys.exit()
    return pagequeue
###### end function getpagequeue ######

###### openpdf accepts a filename string (or '-'), a string "in", "r", "rb" or "out", "w", "wb"
###### returns a PdfFileReader or file, directs to <stdin> or <stdout> for '-'
def openpdf(infilename, inout):
    global verbose
    pdf = []
    buf = []
    if inout in ["in","r","rb"]:
        if infilename == "-":
            buf=bufferstdin()
        elif os.path.exists(infilename):
            buf=file(infilename, "rb")
        else:
            sys.stderr.write( "error: " + infilename + " does not exist... exiting now" )
            sys.exit(2)
        if verbose: print "**Opening: " + infilename
        try:
            pdf = PdfFileReader(buf)
        except:
            sys.stderr.write( "Error opening " + buf.name)
            sys.exit(2) # pdf file is no pdf file...

    elif inout in ["out","w","wb"]:
        if infilename == "-":
            pdf = bufferstdout()
        else:
            if not os.path.exists(infilename):
                pdf = file(infilename, "wb")
            else:
                sys.stderr.write( "error: "+ filename +" already exists... exiting now")
                sys.exit(2)
    return pdf
###### end function openpdf ######
##################################

##################################################
# functions which process input and output files #
##################################################

###### burst accepts a list of (PdfFileReader, [string pageselect]), string outfileformat
###### returns nothing, splits pdf into selected pages and writes to pdfs
def burst(inpdfargs, outfileformat):
    global verbose
    global ownerpassword
    global userpassword

    i=0
    j=0
    for pdf, args in inpdfargs:

        numpages=pdf.getNumPages()
        replacements=[["end",str(numpages)]]
        pagequeue = getpagequeue( args, replacements)
        if verbose: print "**Processing: " + pdf.stream.name

        log10pages = int(math.ceil(math.log10(numpages)))
        (name, dot, ext) = pdf.stream.name.rpartition(".")

        for pageitem in pagequeue:
            page, angle = pageitem
            if ( (page <= numpages) and (page > 0)):
                output = PdfFileWriter()
                if userpassword or ownerpassword:
                    output.encrypt(userpassword,ownerpassword)
                output.addPage(pdf.getPage(page-1).rotateClockwise(angle))


                ###### filename formatting ######
                outfilename=outfileformat
                replacementlist=[("%0p","%0" + str(log10pages) + "p") , ("%n",name) , ("%e",ext)]
                applyreplacement=(lambda str, pair: str.replace(pair[0],pair[1]))
                applyreplacementlist=(lambda str, list=replacementlist, func=applyreplacement: reduce(func, [str]+list))

                outfilename=applyreplacementlist(outfilename, replacementlist)
                
                namearg = re.search('%[0-9]+n',outfilename)
                if namearg:
                    namearg=namearg.group()
                    filenum=int(namearg.strip("%n"))
                    (name, dot, ext) = inpdfargs[filenum-1][0].stream.name.rpartition(".")
                    outfilename = outfilename.replace(namearg, name)

                extarg = re.search('%[0-9]+e',outfilename)
                if extarg:
                    extarg= extarg.group()
                    filenum=int(extarg.strip("%e"))
                    (name, dot, ext) = inpdfargs[filenum-1][0].stream.name.rpartition(".")
                    outfilename = outfilename.replace(extarg, ext)

                if re.search('%[0-9]*i',outfilename):
                    outfilename = outfilename % (j+1)

                pmatch = re.search('%[0-9]*p',outfilename);
                if pmatch:
                    outfilename = re.sub('%[0-9]*p', (lambda smatch: smatch.group().replace("p", "i")), outfilename) % page

                while os.path.exists(outfilename):
                    if verbose: print ('error: '+outfilename+' already exists... appending number')
                    (oname, odot, oext) = outfilename.rpartition(".")
                    outfilename=oname+"-1"+odot+oext
                ###### end filename formatting ######

                outputStream = openpdf(outfilename, "out")
                if verbose: print "--write p" + str(page) + ">>" + outputStream.name
                output.write(outputStream)
                outputStream.close()
            else:
                sys.stderr.write( "Page" + str(page) + " is not found in file" + pdf.stream.name )
                sys.exit(3) #wrong pages or ranges            
            j=j+1
        i=i+1
    if verbose: print (str(j)+" pages in "+str(i)+" files processed")
###### end function burst ######

###### collate accepts a list of (PdfFileReader, [string pageselect]), string outfileformat
###### returns nothing, collates selected pages and writers to PdfFileWriter
def collate(inpdfargs, outfileformat):
    global verbose
    global ownerpassword
    global userpassword

    output = PdfFileWriter()
    if userpassword or ownerpassword:
        output.encrypt(userpassword,ownerpassword)
    if outfileformat == '-':
        verbose = False

    pagequeues = []
    maxpagecount=0
    for pdf, args in inpdfargs:
        numpages=pdf.getNumPages()
        if numpages > maxpagecount: maxpagecount=numpages
        replacements=[["end",str(numpages)]]
        pagequeues += [getpagequeue( args, replacements)]


    for pagei in range(maxpagecount):
        for pdfarg in inpdfargs:
            pdf, args = pdfarg
            pdfi = inpdfargs.index(pdfarg)
            numpages=pdf.getNumPages()
            try:
                page, angle = pagequeues[pdfi][pagei]
            except:
                page = -1
            else:
                if ( (page <= numpages) and (page > 0)):
                    if verbose: print "---add " + pdf.stream.name + ":p" + str(page)
                    output.addPage(pdf.getPage(page-1).rotateClockwise(angle))
    
    ###### filename formatting ######
    outfilename=outfileformat

    replacementlist=[("%n","%1n") , ("%e","%1e")]
    applyreplacement=(lambda str, pair: str.replace(pair[0],pair[1]))
    applyreplacementlist=(lambda str, list=replacementlist, func=applyreplacement: reduce(func, [str]+list))

    outfilename=applyreplacementlist(outfilename, replacementlist)

    namearg = re.search('%[0-9]+n',outfilename)
    if namearg:
        namearg=namearg.group()
        filenum=int(namearg.strip("%n"))
        (name, dot, ext) = inpdfargs[filenum-1][0].stream.name.rpartition(".")
        outfilename = outfilename.replace(namearg, name)
    
    extarg = re.search('%[0-9]+e',outfilename)
    if extarg:
        extarg= extarg.group()
        filenum=int(extarg.strip("%e"))
        (name, dot, ext) = inpdfargs[filenum-1][0].stream.name.rpartition(".")
        outfilename = outfilename.replace(extarg, ext)
    
    while os.path.exists(outfilename):
        if verbose: print ('error: '+outfilename+' already exists... appending number')
        (oname, odot, oext) = outfilename.rpartition(".")
        outfilename=oname+"-1"+odot+oext
    ###### end filename formatting ######

    outputStream = openpdf(outfilename, "out")
    if verbose: print "**Writing to " + outputStream.name
    output.write(outputStream)
    outputStream.close()
###### end function collate ######

###### concatenate accepts a list of (PdfFileReader, [string pageselect]), string outfileformat, bool inverse
###### returns nothing, joins selected (unselected if inverse is true) pages from PdfFileReaders and writers to PdfFileWriter
def concatenate(inpdfargs, outfileformat, inverse=False):
    global verbose
    global ownerpassword
    global userpassword

    output = PdfFileWriter()
    if userpassword or ownerpassword:
        output.encrypt(userpassword,ownerpassword)
    if outfileformat == '-':
        verbose = False

    for pdf, args in inpdfargs:
        numpages=pdf.getNumPages()
        replacements=[["end",str(numpages)]]
        pagequeue = getpagequeue( args, replacements)

        feedback = "**Processing: " + pdf.stream.name + " : "

        if inverse == False:
            if verbose:    print (feedback + "Sel" + str(pagequeue))
            for pageitem in pagequeue:
                page, angle = pageitem
                if ( (page <= numpages) and (page > 0)):
                    if verbose: print "---add p" + str(page)
                    output.addPage(pdf.getPage(page-1).rotateClockwise(angle))
                else:
                    sys.stderr.write( "Page" + str(page) + " is not found in file" + pdf.stream.name )
                    sys.qexit(3) #wrong pages or ranges
        else:
            pagelist = []
            for pageitem in pagequeue:
                pagelist += [pageitem[0]]
            if verbose:    print (feedback + "Rem" + str(pagelist[len(inputs)-1]))
            for page in range(1,numpages+1):
                if (page not in pagelist):
                    if verbose: print "--add p" + str(page)
                    output.addPage(pdf[-1].getPage(page-1))


    ###### filename formatting ######
    outfilename=outfileformat

    replacementlist=[("%n","%1n") , ("%e","%1e")]
    applyreplacement=(lambda str, pair: str.replace(pair[0],pair[1]))
    applyreplacementlist=(lambda str, list=replacementlist, func=applyreplacement: reduce(func, [str]+list))

    outfilename=applyreplacementlist(outfilename, replacementlist)

    namearg = re.search('%[0-9]*n',outfilename)
    if namearg:
        namearg=namearg.group()
        filenum=int(namearg.strip("%n"))
        (name, dot, ext) = inpdfargs[filenum-1][0].stream.name.rpartition(".")
        outfilename = outfilename.replace(namearg, name)
    
    extarg = re.search('%[0-9]*e',outfilename)
    if extarg:
        extarg= extarg.group()
        filenum=int(extarg.strip("%e"))
        (name, dot, ext) = inpdfargs[filenum-1][0].stream.name.rpartition(".")
        outfilename = outfilename.replace(extarg, ext)
    
    while os.path.exists(outfilename):
        if verbose: print ('error: ' + outfilename + ' already exists... appending number')
        (oname, odot, oext) = outfilename.rpartition(".")
        outfilename=oname+"-1"+odot+oext
    ###### end filename formatting ######

    outputStream = openpdf(outfilename, "out")
    if verbose: print "**Writing to " + outputStream.name
    output.write(outputStream)
    outputStream.close()
###### end function concatenate ######

def halp():
    print ("""\nUsage: pydftk MODE [-u=userpw] [-o=ownerpw] ARGUMENTS [OUTPUT]
  modes, and required arguments:
    sel,cat     inputfile [pagerange] [if2 [pr2]].. outputname
    del         inputfile [pagerange] [if2 [pr2]].. outputname
    col,zip     inputfile [pagerange] [if2 [pr2]].. outputname
    split,burst inputfile [pagerange] [if2 [pr2]]..[outputname]
    -h,--help   this help
    -v          verbose
    -q          quiet, disable verbose (forced when using <stdout>)
    **note:     - is replaced by <stdin> or <stdout> (determined by context)
                <stdout> is disabled for split/burst
                outputname can use formatting (see below)

  encrypt/decrypt:
    follow filename with -p=(pw) to decrypt using (pw)
    -u=(password) in arguments will set user password to (pw) for output
    -o=(password) in arguments will set owner password to (pw) for output

  pagerange formatting:
    range[SELECTION][:ROTATION]
    selection   can be e for even, o for odd, x# for every #th page
    rotation    can be r for right, l for left, u for up, d for down (180 deg)
                can also be cw# for clockwise, ccw# for counterclockwise
                cw and ccw # must be a multiple of 90 degrees
  pagerange examples:
    3           process page 3
    1-5         process pages 1,2,3,4,5
    5-1         process pages 5,4,3,2,1
    end-1       process all pages in reverse order (end replaced by last page)
    1-5o:cw90   process odd pages 1,3,5 and rotates each 90 degrees right
    1-7x3:d     process every 3rd page 1,4,7 and rotates each 180 degrees

  output formatting:
    sel/cat/del/col/zip:
      %n        %n or %1n is first file's path/name, %2n is second...
      %e        %e or %1e is first file's extension, %2e is second...
      -         used alone, will write output to <stdout>
    split/burst only:
      %n        is replaced by current file's path/name
      %e        is replaced by current file's extention
      %p        is replaced by page number, can be printf formatted
                %0p auto selects # of digits based on # pages in current file
      %i        is replaced by index number, can be printf formatted
  output examples:
    sel/cat/del/col/zip:
      $pydftk cat f1.e1 f2.e2 f3.e3 %3n-o.%e
      --would return a file named f3-o.e1
      $pydftk cat input.pdf -
      --would write all of input.pdf to <stdout>
    split/burst only:
      $pydftk burst input.pdf 5 %n.p%04p.%e
      --would return a file named input.p0005.pdf
      $pydftk burst input.pdf 5-6 %n.%03i.%e
      --would return input.001.pdf for page 5, input.002.pdf for page 6""")

###### end function halp ######


parse_args(sys.argv)

Keep in mind I've been using it for several days now but it's still VERY experimental, meaning I may not have even tested some sections since some major changes.

Last edited by kyle_white (2009-10-24 05:51:38)

ninian · 2009-09-22 08:13:06

kyle_white wrote:

I don't know how to get it onto git. I could just paste the whole (still a bit messy) source here...

Many thanks - will try out!

Heller_Barde · 2009-09-25 10:36:57

@kyle_white

I looked at your source. It looks very reasonable. If you're okay with it, I would upload it to my github project and give you rights to commit to it too. I just would like to avoid suddenly having two very similar but different projects floating around, especially when we both have the same/similar goals and we're both okay with merging (I am formulating this very carefully, because I don't want you to think that I just want the glory of your source code )

Send me an email or something like that. I look forward to hearing what you think

cheers Barde

PS: i just noticed, merging isn't quite the right word, because by now the code is practically completely yours

kyle_white · 2009-09-27 22:00:15

Feel free to do with the code as you want. I'm not very consistent with maintenance or updating anyway

markoxx · 2009-10-15 02:23:21

Hey thanks for this
It was very usefull.

I needed the "cat" function to merge like 81 pdf's in one
Thanks!

Marco.-

achecht · 2009-10-22 12:14:32

@Kyle,

First, kudos on pydftk. It is a much simpler implementation of the PDFTK.org libraries.

My question is about cat/sel. I seem to only be able to specify one page range (i.e. cat my.pdf 1-4:d output.pdf). But what if I simply want to reorder the existing pages in a PDF (e.g. cat my.pdf 1 8 2 7 3 6 4 5 output.pdf) or skip one or more pages in the doc (e.g. cat my.pdf 1-4 6-8 output.pdf). Pydftk seems to give an errors that the parameters are not as expected.

Is there a way to specify multiple page ranges? I know I could do it in two passes (first output each page range as a single output-#.pdf and then merge all the "output-*.pdf" files together into a new one, but I am trying to do this as cleanly as possible.

Thanks. And if you reply via pm, include an address for a PayPal donation.

~Alan Hecht

kyle_white · 2009-10-24 05:34:06

Alan -

Well, that's what the disclaimer is for . I've never had use for single pages and it looks like it was just broken if passed single page numbers. I think I fixed it, and have updated the code in the post above. As for several page selections, it's always worked for me so let me know if there's any more problems. I tested it with pydftk cat in.pdf 1-3:e:d 5 9 10 3 out.pdf and got the expected result.

Something I noticed, which hopefully won't be a problem for most people, is that when a page is rotated it remains rotated. In the example above I had actually used 1-3:e:d 2 9 10 3, which prints page 2 twice at the beginning but it is rotated both times (I know it's confusing, but I don't know how else to say it). This is an issue with the python library's representation of the pdf pages before writing to file, and I'd have to look into creating copies of pages for each which is rotated. For now if anyone runs into that problem, they'll have to force a second copy of the page by passing the input pdf a second time. E.g.

pydftk cat in.pdf 1:cw90 in.pdf 1 out.pdf - this works for me

Rotating the page back by hand pydftk cat in.pdf 1:cw90 1:ccw90 out.pdf will not work, which is why the program can't un-rotate pages either

Last edited by kyle_white (2009-10-24 05:50:13)

ninian · 2009-10-24 09:56:25

Seems to work fine with single pages now, thanks.

achecht · 2009-10-24 13:24:43

This works as needed for me now. Thanks for the extra effort!

smartboyathome · 2009-10-29 19:25:53

Hi,

What I am trying to do with this is to change the font size in a PDF document. Reason I am trying to do this is because I am trying to create a PDF of a mindmap (so that others without the mindmap software can read it) by using the Print to File function of cups. When I do this and open it up in Adobe Reader, though, it cuts off part of the last letter on each topic/subtopic. I came to the conclusion that all I'd need to do is change the font size and it should be fine. How would I do this with your tool, or is it not possible?

achecht · 2009-10-30 10:42:47

From what I can tell, this tool is just for modifying a PDF in terms of "structure" not in terms of "content". So rotating pages, reordering pages, collating pages, etc. is possible with PYDFTK, but changing the font size for a document is not.

FWIW, PYDFTK is simply a nice front end for many of the features that could be achieved with the python library PYPDF (http://pybrary.net/pyPdf). And I don't see font adjustment in that library either.

newcar · 2010-03-14 17:07:28

Thx a lot for this tool.

I cant compile pdftk here, so this came in really good time!

Thx!

brianhanna · 2010-03-15 02:37:39

This is a great idea. A good replacement for Acrobat Pro would be a killer app for Linux I think. This is an excellent start.

Sara · 2010-03-28 02:33:56

Thank you for stapler! Allowed me to promptly uninstall pdfshuffler (I prefer the cli). It would be terrific if you can add a bookmarking feature to it. I can easily open the PDF file in a reader to see at what pages I wish to make my bookmarks, so if a bookmarking utility could be incorporated into stapler that accepts page numbers (and allows nice names for the bookmarks) I would deeply appreciate it. Thank you.

viniosity · 2010-04-09 17:06:12

Does anyone know which version of PDFs this works with? PDF-TK and several others I've found work well with version 1.2 and 1.3 but are spotty with 1.4 and 1.5.

psrivats · 2010-05-10 19:13:09

Stapler is awesome - thank you!

Not sure if this is the right place, but I have a slightly different question regarding a warning message. I need to reinstall Arch on my main computer and I upgraded to Lucid (ubuntu) on my 2nd computer. While stapler works as it should, I get this warning:

/usr/lib/pymodules/python2.6/pyPdf/pdf.py:52: DeprecationWarning: the sets module is deprecated
  from sets import ImmutableSet
too few arguments

Should I worry about this?

Arch Linux

#26 2009-08-18 23:43:41

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#27 2009-08-19 07:29:02

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#28 2009-09-15 04:19:52

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#29 2009-09-16 09:18:17

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#30 2009-09-16 15:02:59

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#31 2009-09-17 08:34:54

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#32 2009-09-17 19:45:37

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#33 2009-09-20 04:39:22

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#34 2009-09-21 15:13:18

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#35 2009-09-22 03:14:30

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#36 2009-09-22 08:13:06

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#37 2009-09-25 10:36:57

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#38 2009-09-27 22:00:15

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#39 2009-10-15 02:23:21

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#40 2009-10-22 12:14:32

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#41 2009-10-24 05:34:06

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#42 2009-10-24 09:56:25

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#43 2009-10-24 13:24:43

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#44 2009-10-29 19:25:53

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#45 2009-10-30 10:42:47

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#46 2010-03-14 17:07:28

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#47 2010-03-15 02:37:39

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#48 2010-03-28 02:33:56

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#49 2010-04-09 17:06:12

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

#50 2010-05-10 19:13:09

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

Board footer