PDF manipulation tips, Part 1

Publishing academic papers with LaTeX / pdfTeX often requires working with PDFs. During grad school at the University of Waterloo, I published mainly in the IEEE Geoscience and Remote Sensing journals, which accepted PDF format submissions with LaTeX source files. On this page I will document some tips and tricks I learned as I handled (or, on some days, physically dueled with) PDF files. This will hopefully save you some headaches.

Be sure to check out the other parts of my PDF manipulation tips:

Part 1: GhostScript

Part 1 of this document focuses on things you can do with GhostScript. Part 2 focuses on things you can do with other programs like Pdftk.

For these tips, you will need to install at least GPL GhostScript, an open source software package that can manipulate PostScript and PDF files. Once installed, add the bin directory of the Ghostscript installation to your system path. The examples I give below are for the Windows version of Ghostscript, gswin32c.exe, for other systems you replace it with gs.

If you use GhostScript as described here to process a PDF which has images with alpha channels (transparency), the images may not show up in the resulting PDF (there seems to be an error in the image). If these are PDF files you created yourself with pdfTeX, then most likely the original images that you included with the \includegraphics command had alpha channels. Just remove the alpha channels from the original images and recreate the PDF with pdfTeX before using GhostScript.

Compress a large PDF for distribution

Oftentimes the PDF of your manuscript produced straight from pdfTeX is very large. This is especially the case when you are preparing for submission to a journal, which will require that color graphics be a certain dots per inch (dpi). IEEE Geoscience and Remote Sensing journals, for example, require color images to be 400 dpi and grayscale graphics be 300 dpi. This means that for a color graphic that you want to be 3 inches square, the image has to be 1200 pixels x 1200 pixels in size. If you have a lot of images, which is often the case in image processing research, then the PDF file can be quite large. This is fine for final submission to the journal, but what if you needed to distribute a draft to your co-authors, or wanted to put an online version on your website (e.g. my Research page) without straining your bandwidth?

Fortunately Ghostscript offers an easy way to recompress a PDF. This will use lossy compression (unless you tell it otherwise) and image resampling to a smaller dpi so the image quality will suffer but it will be much smaller. Just run the following command:

gswin32c.exe -sDEVICE=pdfwrite -dMaxSubsetPct=100 -dPDFSETTINGS=/ebook -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf"

The above will compress “infile.pdf” to “outfile.pdf”, compressing using the /ebook preset. You can specify several different presets for dPDFSETTINGS, which affects the dpi to which the images are resampled:

Presetdpi
/screen72 dpi
/ebook150 dpi
/printer300 dpi
/prepress300 dpi
/defaultDefault dpi setting

For the purposes of compressing a PDF made by pdfTeX, use /screen or /ebook. I like /ebook because it is higher quality but still smaller than the original file. The /prepress and /printer presets keep images at fairly high resolution (300 dpi), which is not what you want for making a file smaller. I am unsure of what the difference is between /printer and /prepress. I also do not know how to specify an arbitrary dpi setting. I will post it here when I find out.

As an example of the file size difference you can achieve with compression, consider a few papers that I have on this website:

File NameOriginal SizeCompressed Size
amsrqs_main.pdf1.6 MB594 KB
icesynth_ii.pdf2.8 MB418 KB
magic_cjrs.pdf41 MB1 MB

The magic_cjrs.pdf file was originally very large; it has a lot of high resolution SAR imagery that were included at nearly full resolution in the original file. My co-author on that paper had a lot of trouble sending it back and forth to our advisor for feedback due to the size. If I had known back then what I know now, this problem could have been avoided.

Windows-users can also try using PDFCreator 1) (see footnote before installing) to reprint a PDF generated by pdfTeX. While this makes the file small, it also completely garbles the text: you can read it, but the underlying text has been replaced by gibberish such that you cannot search or copy the text properly; additionally, search engines like Google cannot read them properly so no one can find your file.

There are some simple alternatives to re-compressing that I should mention:

  1. Use low resolution or placeholder images while preparing your draft and/or while producing your online version. This requires that you produce and keep track of your images.
  2. When using pdfTeX, you can include JPEG images, which you can compress to be smaller beforehand. They stay small in the final PDF file. The quality obviously goes down, which is why I personally always include PNG files with lossless compression, which are then converted to whatever format the journal needs at the end. I have noticed that with my submissions to IEEE, my original lossless colour images are recompressed by them; so it is best to give them lossless images in the first place to avoid possible further losses of quality.

Make GhostScript use lossless compression

If you are using GhostScript to process your PDF, you sometimes might want it to use lossless compression instead of lossy compression. In this case, you will need to specify some additional options. There are two options you need to specify for colour and gray scale images.

For colour images:

-dAutoFilterColorImages=false
-dColorImageFilter=/FlateEncode

For grayscale images:

-dAutoFilterGrayImages=false
-dGrayImageFilter=/FlateEncode

Finally, for mono (black and white) images, specify:

-dMonoImageFilter=/FlateEncode

These options will make it possible to use the lossless compression from GhostScript. They let you decide how you want to compress different types of images separately. The -dAutoFilter[…]Images options tell Ghostscript not to choose a compression method automatically, while the -d[…]ImageFilter options tell GhostScript which compression method you want. I believe Flate is the same as the Deflate algorithm, which is lossless and used in PNG images.

So if you have a file with gray scale and colour images, and you want to, for example, resample to 150 dpi with lossless compression, then use the following command:

gswin32c.exe -sDEVICE=pdfwrite -dMaxSubsetPct=100 -dPDFSETTINGS=/ebook -dAutoFilterGrayImages=false -dGrayImageFilter=/FlateEncode -dAutoFilterColorImages=false -dColorImageFilter=/FlateEncode -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf"

Concatenating, merging, splitting and extracting pages from PDF files

It is fairly simple to concatenate (join / merge) several PDF files together and extract certain pages from a PDF file into a separate file with GhostScript. It is also quite simple to split (or “burst”) a PDF file into a separate file for each page using a program called Pdftk. These operations are described in this section.

To join several PDF files together, you can use the following command:

gswin32c -sDEVICE=pdfwrite -sOutputFile="out.pdf" -dNOPAUSE -dBATCH "in1.pdf" "in2.pdf" "in3.pdf" "in4.pdf"

This will join the in*.pdf files into one out.pdf file. You can specify as many input files as you want.

To extract pages from a PDF file and put them into a separate file, you can use the following:

gswin32c.exe -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dFirstPage=a -dLastPage=b -sOutputFile="out.pdf" "in.pdf"

Specify the range of pages to extract by entering page numbers for a and b. You can extract just one page by having a equal to b.

GhostScript itself does not have the ability to split a PDF into separate files for each page. You can either write a bash script that runs the above command for each page in the file or you can use Pdftk to "burst" a PDF into separate pages. Follow the link to part 2 of my PDF manipulation tips for instructions.

Autocrop PDF white space

If you have a PDF file where the content only takes up a small part of the page, you can crop out this white space automatically. You might want to do this if you wanted to include the PDF content as a graphic in a LaTeX document.

Software pre-requisites: You will need the pdfcrop.pl Perl script in order to automatically crop the PDF automatically. To use this on a Windows system, you need to install Perl, GPL GhostScript and MikTeX. MikTeX comes with pdfcrop.exe which seems to be a compiled version of the Perl script, so you do not necessarily need to have pdfcrop.pl. On Windows, I was only able to run pdfcrop.exe from a Windows command prompt; it does not work with the Cygwin Bash shell; conversely, pdfcrop.pl does not work from the Windows command prompt but works from a Cygwin Bash shell.

Whether you use pdfcrop.exe or pdfcrop.pl, you will need the bin directory of Perl, GhostScript and MikTeX on your system path. For users on Unix or Linux systems, you will need pdfcrop.pl, Perl, GPL Ghostscript and PdfTeX or XeTeX installed.

Open up a command prompt or Cygwin Bash shell and run the following, substituting pdfcrop.exe (if using command prompt) or pdfcrop.pl (if using Cygwin Bash shell) for pdfcrop.

pdfcrop in.pdf cropped.pdf

This will automatically crop in.pdf to remove all white space and save it as cropped.pdf.

Sometimes you do not want to crop the white space completely; you might want to leave some margins around the PDF content. You can add margins with the –margins option:

pdfcrop --margins "1" in.pdf cropped.pdf

This will include a small margin of 1 bp (which is something like 1/72 inch) around each side.

Include or restore PDFmark DOCINFO after GhostScript resets it

When you use GhostScript to process PDF files for any of the operations on this page, it seems to remove the PDFmark DOCINFO information you originally had and replaces it with its own info. Most annoying is the fact that the PDF creator field is assigned to GhostScript's author, which can result in confusion among your own readers regarding the authorship of your PDF (for what it is worth, the PDF creator field seems to refer to the software used to create the PDF file, however, it can still be confusing for those not aware of this). You can get GhostScript to include your own custom PDFmark DOCINFO (such as Author, Title, Keywords, Creator and Producer).

Create a file named docinfo.txt and paste the following into it:

[ /Author (string)
/Creator (string)
/Producer (string)
/Title (string)
/Subject (string)
/Keywords (string)
/DOCINFO pdfmark

Replace the word string in parentheses with your own values.

Once you have made the file, then run GhostScript:

gswin32c.exe -sDEVICE=pdfwrite -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf" docinfo.txt

You should be able combine these with any of the other operations on this page; just put docinfo.txt as the last input file. Ghostscript merely concatenates the DOCINFO into the final file.

1) Be careful when installing PDFCreator; although it is open source, the installer deceptively installs unwanted browser toolbars. You have to use the custom install options, deselect the toolbar option and then on a subsequent screen, deselect another toolbar option.

Discussion

Frog, 2010/07/21 07:07
A couple of notes:

1. I found (at least for my pdf) that GS converted a light yellow shaded background to a light blue shaded background when using -dPDFSETTINGS=/printer but with -dPDFSETTINGS=/prepress it remained the original colour... so the differences between the two might be something to do with colour profiles? Not sure how/why it'd convert yellow to blue though!

2. I tried setting -dAutoFilter...Images=false but leaving out a specific image filter to see what happens (in the hope that it'd maintain whatever compression algorithm I was using for each image, i.e. images inserted as PNG would be resized as PNG and images inserted as JPEG resized as JPEG) but it seems to have applied a fairly ugly lossy filter in all cases this way.

For the record, the command would be something like: gswin32c.exe -sDEVICE=pdfwrite -dMaxSubsetPct=100 -dPDFSETTINGS=/ebook -dAutoFilterGrayImages=false -dAutoFilterColorImages=false -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf"

So in terms of using lossless compression, it may be that auto is actually re-compressing them in their original compression algorithm anyway? (I don't think I'm seeing any JPEG artefacts in my PNG images using this method) So if images are *already* in a lossless compression format (or at least PNG), it *might* be keeping that format, which is nice... not sure how to confirm this though :/

Anyway, thanks heaps for the info, very useful (:
Peter Yu, 2010/07/24 02:28
Thanks for the additional info. I tried what you did, leaving out the -d[…]ImageFilter options and tried it with a file that contained only PNG images. On my version of Ghostscript, at least, the PNGs looked like they are compressed.

It looks to me that as long as GhostScript has to recompress, it will use a uniform type of compression for all images of a certain type colour, gray or mono). That said, there might be a bug or a poorly documented feature that can allow this type of functionality, but I don't know what it is at this time.
David, 2010/08/11 19:09
This is one of the most useful posts I have seen in a long time.

Thanks for this.
Peter Yu, 2010/08/12 17:46
I am glad that it is helpful. When I have more time I will expand the page a bit with a few other useful processes that can be done with GhostScript.
Lucian Popescu, 2010/12/13 11:11
For compression purposes, I have played around with

<b> -dDisplayResolution=DESIRED_DPI_RESOLUTION </b>

I've tried with a value of 50, without altering the quality too much.

Good luck!
Michael Watts, 2013/06/09 12:00
Great article; just what I was looking for.

I was able to reduce the size of my 88-page paper, where most of each page is taken up with a high-res photo of an apple, by 90%.

Coming out of Xetex, the file was 26MB; running with -dPDFSETTINGS=/ebook got it down to 12MB; using /screen gets it down to 2.5MB.

Thanks again. Btw, great css on this site
Shivashankar D, 2013/06/25 07:12
Hi,



My name is Shivashankar, I saw your website which is very informative

and has lots of documentation.

I am having a PDF which contains two pages in a a4 sheet and I am

trying to separate them by cutting it

in the middle. I am able to extract left part but for the right part I

can see only empty pages after extraction. Commands which I used

to extract pages are as follows.



Input file is not scanned one



gs -o left-sections.pdf -sDEVICE=pdfwrite -g4210x5950 -c

"<</PageOffset [0 0]>> setpagedevice" -f bali-k.pdf --> success

gs -o right-sections.pdf -sDEVICE=pdfwrite -g4210x5950 -c

"<</PageOffset [421 0]>> setpagedevice" -f bali-k.pdf --> failure



Regards

Shivashankar
Yves, 2013/11/02 14:23
Your post is really excellent. I have found that using the gs compression on pdfs generated from LaTeX source created a shift in the color rendition of pictures. (They are "darker" after compression than before) However the PDF file made wiht LaTeX provides exactly the same rendition as the original pictures.
Is there an option that forces the use of the color profiles (or forces using it)?
I would love to hear your feedback. Enter your comment below [ Terms of Use ]:
JAGJZ
 

About Peter Yu I am a research and development professional with expertise in the areas of image processing, remote sensing and computer vision. I received BASc and MASc degrees in Systems Design Engineering at the University of Waterloo. My working experience covers industries ranging from district energy to medical imaging to cinematic visual effects. I like to dabble in 3D artwork, I enjoy cycling recreationally and I am interested in sustainable technology. More about me...

Feel free to contact me with any questions about this site at [user]@[host] where [user]=web and [host]=peteryu.ca

Copyright © 1997 - 2014 Peter Yu