PDF manipulation tips, Part 2

The first part of my PDF manipulation tutorial covered PDF operations you can do with GhostScript. This part covers various PDF processing tasks using other programs, such as Pdftk, Xpdf and a few others. These tips came in handy during the process of preparing and submitting research papers with LaTeX. The programs used here are all free, open source tools. You will need to use the command line for these programs; I usually run these from a Cygwin terminal.

Be sure to check out the other parts of my PDF manipulation tips:

Split PDF into constituent pages with Pdftk

You can use Pdftk to do split up a PDF into individual pages, which are then saved as separate PDFs. Pdftk is a free toolkit for manipulating PDF files. I have only used it for splitting PDF files so far. Download Pdftk and then install it somewhere on your system path. Then run the following command:

pdftk in.pdf burst

This will take each page in in.pdf and produce a separate file for each page that is named pg_####.pdf, where the #### is a page number padded by leading zeros.

The GhostScript way of extracting pages from a PDF is less convenient, as it only allows you to extract one page at a time into a new PDF or a range of consecutive pages (e.g. create a new PDF from pages 2 to 5 of your original PDF).

Burst only a range of pages with Pdftk

Pdftk does not allow you to limit the burst operation to a range of pages, so the Ghostscript page extraction command is still useful. You can extract the desired range of pages to a new PDF containing only those pages with GhostScript:

gswin32c.exe -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dFirstPage=a -dLastPage=b -sOutputFile="out.pdf" "in.pdf"

You can then use the Pdftk burst command on the output PDF file from the GhostScript command out.pdf to save all the invidual pages to individual PDFs. The GhostScript command will first create a new PDF that contains only the pages in the desired range (pages a to b) and then Pdftk can be used to burst it into individual pages.

Extract images directly from a PDF

The images inside a PDF can be extracted or ripped directly to image files. PDF stores images internally as image data streams that can be copied and saved to image files. To do this, you need Xpdf, another set of free tools that can handle the PDF format. Once installed, you can run the various tools that come with Xpdf from the command line. One of these tools is called pdfimages.exe, which performs image extraction from a PDF file.

Here's the basic command:

pdfimages.exe infile.pdf BASENAME

This command will rip all the images from the PDF file to a sequence of image files named BASENAME-###.pbm (monochrome images), BASENAME-###.pgm (grayscale images) or BASENAME-###.ppm (color images), where ### is a number with leading zeroes. Each image will have a unique number corresponding to the order it appears in the PDF file. PBM, PGM and PPM files are a type of image format. The type of the image file you get depends on the color depth of the images that are actually stored in the PDF file. You can open all three of these image formats with the GIMP image editor and save them to whatever format you want.

The above will rip all the images from a PDF, including those that are stored with DCT compression. I believe that the program decompresses the image and saves it as an uncompressed PPM file. There is an option to rip the DCT compressed images directly to JPEG. Since JPEG uses DCT compression, the program is able to do this without recompressing anything. The command is as follows:

pdfimages.exe -j infile.pdf BASENAME

It will still save monochrome, Flate and uncompressed images as PBM, PGM and PPM but all DCT compressed images will be saved as JPEGs.

Extracting images from a PDF file can be helpful if you have lost the original source images that you used to create the PDF and need to get them back. This happened with one of my papers and I am glad that Xpdf has such a useful tool.

Convert PDF pages to image files

The pages of a PDF file can be converted to individual image files, such as individual PNGs. The free ImageMagick software suite is a set of tools for converting and processing images that happens to support PDF files. So if you have a PDF and you want to turn each page into a PNG, you can use this command after installing ImageMagick:

convert infile.pdf out.png

This will save each page as out-#.png, where # is the page number.

The default settings will produce low resolution PNG but the resolution of the saved images can be controlled with ImageMagick options:

convert -density 600x600 -resize 850x1100 infile.pdf out.png

The -density option will rasterize the PDF file at 600 x 600 dpi, which would produce a PNG file with a size of 5100 x 6600 for an 8.5” x 11” page. The -resize command resizes it to 850 x 1100 before saving it out to PNG files.

ImageMagick supports output formats other than PNG and can handle a variety of input formats as well. It also has a ton of other image manipulation capabilities. The large number of options and operations that are possible can make it hard to use and things sometimes do not make sense, but for a simple task like converting a PDF to a PNG, there is usually no problem.

I have used this tip for producing the example images in my other LaTeX tutorials. While it is possible to render LaTeX code on the web server for the LaTeX examples, it really depends on my web host allowing it, so I decided to just produce the PDFs and then convert them to images myself. Using ImageMagick saved me from having to save screenshots for all the images. I just produce a PDF file with PdfTeX with all the examples and then run:

pdfcrop.pl --margins 1 latex_tips.pdf

convert -density 120x120 latex_tips-crop.pdf out.png

The first command produces a cropped PDF file with all white space removed (except for a margin of 1 bp), which is then processed with ImageMagick.

ImageMagick's PDF renderer is not perfect. This can be problematic if your PDF contains images with complex vector or line art. Also, ImageMagick does not resample raster images in PDF very well. The PDF renderer in the GIMP is much better but is not as convenient to use on the commandline. However, you can still open PDFs and tell GIMP to render each page to a new image layer or image.

Discussion

Bruce, 2012/09/13 16:54
Most of the above tools, at least xpdf and pdftk fail with pdf sizes > 2GB.
Any recommendations for handling larger pdfs?
Peter Yu, 2012/09/13 21:15
I've never worked with such large PDFs before. Try to see if there are any 64-bit or x64 builds of xpdf and pdftk, which may be able to work with big files.
kumaran, 2015/02/17 07:43
One of your articles mentions that one would be able to open PDFs and tell GIMP to render each page to a new image layer or image. Does this imply that if the pdf has 5 pages, GIMP would be able to extract each page as a image and enhance each image's quality. Please advise if the above is possible and if there are any sample scripts to achieve this.
JonyGreen, 2015/10/08 04:27
I find a free online pdf to image converter(http://www.online-code.net/pdf-to-image.html), you can convert pdf to jpg online free.
I would love to hear your feedback. Enter your comment below [ Terms of Use ]:
UNCXM
 

About Peter Yu I am a research and development professional with expertise in the areas of image processing, remote sensing and computer vision. I received BASc and MASc degrees in Systems Design Engineering at the University of Waterloo. My working experience covers industries ranging from district energy to medical imaging to cinematic visual effects. I like to dabble in 3D artwork, I enjoy cycling recreationally and I am interested in sustainable technology. More about me...

Feel free to contact me with any questions about this site at [user]@[host] where [user]=web and [host]=peteryu.ca

Copyright © 1997 - 2017 Peter Yu