pdfrw: the other Python PDF library

Introduction

As Tim Arnold explains in Manipulating PDFs with Python, even beautiful PDFs are often unspeakably ugly inside, and if you can avoid having to manipulate them, you should. Nonetheless, you've decided to ignore Tim's advice, and that's why you're here. (Or maybe you haven't actually seen Tim's tutorial, in which case you should go read it now, because it's chock full of good advice, and this article builds on it.)

Tim's article does a good job of describing pdfminer and PyPDF2, but it doesn't go into detail with pdfrw, and for good reason. It hadn't been updated since 2012. But now that's changed.

Since Google Code shut down, I finally moved the project to Github. During the transition I've fixed bugs, incorporated some tests, added support for Python 3, and merged some code that someone contributed for parsing PDF 1.5 stream objects. Now pdfrw is at version 0.2, and I hope not to get so far behind in the future.

Since I've started cleaning it up, I figured I might as well also put some effort into telling people about it. In this tutotial, I'll provide a primer on pdfrw, complete with an overview of its features and some examples.

What good is it?

As you may have garnered from either the introduction, or from the name of the library, pdfrw can read and write PDF files. It also has no dependencies except Python, and the current version (0.2) is available on PyPI for both Python 2 and Python 3 (2.6, 2.7, 3.3, and 3.4). As discussed in Tim's tutorial, the two most popular pure Python PDF libraries are pdfminer and PyPDF2. In terms of focus, pdfrw is much closer to PyPDF2 than it is to pdfminer, so the rest of this article discusses pdfrw in relation to PyPDF2. (I'm not an expert with PyPDF2 by any means, so please let me know in the comments if I have made any egregious errors.)

PyPDF2 supports more PDF features than pdfrw, including decryption and more types of decompression. It also has specialized functions for several things such as merging bookmarks from two different PDFs. I am actively working on bookmark support for pdfrw, but it has none at present.

One area where pdfrw shines is in reusing PDFs in conjunction with reportlab. Due to pdfrw's form XObject support, I believe that it is the only package, aside from reportlab's proprietary pagecatcher software, that supports reuse of elements from preexisting PDF files in reportlab output.

pdfrw has (I believe) a faster parser than the other libraries. Also, rather than trying to create full-featured objects that provide attributes for every single thing you could do with a document, pdfrw has a more simple model that is built on modelling low-level PDF objects, and then adding some domain-specific procedural code on top of that for a few different tasks. It also looks and feels a bit different, because of this focus on lower-level PDF container objects.

OK, show me the code!

There are several examples at the pdfrw home page, including examples that use pdfrw in conjunction with reportlab. They need a bit more documentation, and the library needs more documentation, but I'm slowly working on that. For the purposes of this article, I'm simply going to take the PyPDF2 examples from Tim's tutorial, and rework them to use pdfrw.

Merge (layer)

The layer merge example from Tim's tutorial applies a watermark to a PDF by opening a source PDF and a watermark PDF, and modifying each page object by drawing the first page of the watermark PDF on top of every source PDF page. Since pdfrw gives you low level access to PDF objects, you could mimic this behavior with pdfrw and a small bit of graphics code, but the canonical pdfrw version of this example uses a form XObject to represent the watermark. This is, in some ways, easier to get right in many cases, because there are fewer possible resource dictionary conflicts between the watermark page and the page it is applied to. The cases where it doesn't work well are where the contents of the watermark page are in an array of compressed objects that pdfrw doesn't yet know how to decompress. This is problematic because the contents of the form XObject have to be a PDF dictionary, not an array. If pdfrw rejects your watermark, you can probably fix it by running the watermark PDF through pdftk or some other package that can decompress it, perhaps even PyPDF2.

Unlike many of the pdfrw examples, this one leaves things like bookmarks intact, because it watermarks the pages in-place, and then writes out the pre-existing PDF file document. (Most other pdfrw examples construct new pages, and construct a new PDF document incorporating those.)

from pdfrw import PdfReader, PdfWriter, PageMerge

Open both the PDF files

ipdf = PdfReader('sample2e.pdf')
wpdf = PdfReader('wmark.pdf')

Watermarks are a special case, because we want to create a form XObject of the watermark page, and then reuse the same XObject on every page, so we don't increase the size of the output file too much.

Rather than complicate our loop by creating a watermarkon the first page, and then pulling it out to use on subsequent pages, we simply create, and watermark, a completely blank page. We immediately discard the blank page itself, after extracting the watermark XObject from it. (The watermark will be the first and only XObject in the PageMerge list.)

wmark = PageMerge().add(wpdf.pages[0])[0]

For each page in the input PDF, create a PageMerge object that will operate on the page, add in the watermark object, and render the changes (merge the watermark) back to the page.

for page in ipdf.pages:
  PageMerge(page).add(wmark).render()

Create a PdfWriter object, and use it to output the modified PDF.

PdfWriter().write('newfile.pdf', ipdf)

Tim mentioned that you should be careful to place non-transparent watermark PDFs so they don't mask any underlying text. Another option is to reverse the merge order and put the watermark under the page that is being watermarked. With pdfrw you can achieve this by passing 'prepend=True' to the PageMerge add() function in the loop.

Merge (append)

Tim's next snippet is a program that concatenates all the PDFs in the mypdfs subdirectory to create a new PDF named output.pdf. The pdfrw version of this is below. Note that the current version of pdfrw will not preserve bookmarks when doing this, because it simply copies the page display information into the new PDF.

import os
from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()

files = [x for x in os.listdir('mypdfs') if x.endswith('.pdf')]
for fname in sorted(files):
  writer.addpages(PdfReader(os.path.join('mypdfs', fname)).pages)

writer.write("output.pdf")

Removing Blank Pages

Tim gives an example of getting rid of blank pages by not writing out the ones that don't have associated Contents dictionaries. The next example shows pdfrw doing the same thing. Since I don't want to overpromise what is going on here, I will note that PDFs are complicated enough that it's certainly possible to have a blank PDF page that has a Contents dictionary, so we'll conveniently ignore that possibility for this example. Again (and probably unlike PyPDF2), pdfrw will not preserve bookmarks, because we are adding data to the PdfWriter object a page at a time, and letting it build a new document rather than re-use an existing one.

from pdfrw import PdfReader, PdfWriter

output = PdfWriter()

for p in PdfReader('source.pdf').pages:
  if p.Contents is not None:
    output.addpage(p)

output.write('newfile.pdf')

Splitting

Tim's next example splits one PDF into separate one-page PDFs. Note that you might want to increment i if you don't like numbering pages from 0.

from pdfrw import PdfReader, PdfWriter

infile = PdfReader('source.pdf')

for i, p in enumerate(infile.pages):
  PdfWriter().addpage(p).write('page-%02d.pdf' % i)

Stitching

Tim's penultimate example is an eminently practical one. You have two PDFs, resulting from separate scans of odd and even pages, and you want to stitch them together. Tim's example starts off and ends with a blank page, so we'll emulate that, as well. pdfrw doesn't have a specific addBlankPage function, but it can be emulated by creating a PageMerge object, and setting a media box on that.

Note that this code assumes that the odd and even source files have exactly the same number of pages. (The length of the sequence returned by Python's built-in zip function is the same as the length of the shortest input sequence.)

from pdfrw import PdfReader, PdfWriter, PageMerge

even = PdfReader('even.pdf')
odd = PdfReader('odd.pdf')
all = PdfWriter()
blank = PageMerge()
blank.mbox = [0, 0, 612, 792] # 8.5 x 11
blank = blank.render()

all.addpage(blank)

for x,y in zip(odd.pages, even.pages):
  all.addpage(x)
  all.addpage(y)

while len(all.pagearray) % 2:
  all.addpage(blank)

all.write('all.pdf')

As coded, the while statement above will execute exactly once.

Metadata

Tim's final example adds some metadata into a PDF file. Like watermarking, this is something that pdfrw can do quite easily without losing non-display information such as bookmarks. Also, note that, unlike PyPDF2, pdfrw is smart enough to automagically put the slashes in front of the dictionary keys -- because that's a requirement for all PDF dictionaries, it's something that pdfrw has easy syntax for.

from pdfrw import PdfReader, PdfWriter, PdfDict

def add_metadata(name, data):
  trailer = PdfReader('%s.pdf' % name)
  trailer.Info.update(data)
  PdfWriter().write('%s_update.pdf' % name, trailer)

metadata = PdfDict(hey='there', la='deedah')
add_metadata('myfile', metadata)

Elements of PdfDict instances can also be accessed via attributes instead of via indexing, so, for example, the second-to-last line could be replaced with:

metadata = PdfDict()
metadata.hey = 'there'
metadata.la = 'deedah'

Summary

pdfrw and PyPDF2 occupy similar but distinct niches. I hope that the provided discussion and code examples help you decide which might be best for your application. Currently, I'm working on code that will allow preservation (and merging) of bookmarks, and also on a more general-purpose command-line utility. If you have questions or comments about pdfrw, please feel free to leave feedback here or in Github issue tracker.

0 comments


Or enter your name and Email
No comments have been posted yet.