PDF manipulation tips, Part 1

Publishing academic papers with LaTeX / pdfTeX often requires working with PDFs. During grad school at the University of Waterloo, I published mainly in the IEEE Geoscience and Remote Sensing journals, which accepted PDF format submissions with LaTeX source files. On this page I will document some tips and tricks I learned as I handled (or, on some days, physically dueled with) PDF files. This will hopefully save you some headaches.

Be sure to check out the other parts of my PDF manipulation tips:

Part 1: GhostScript

Part 1 of this document focuses on things you can do with GhostScript. Part 2 focuses on things you can do with other programs like Pdftk.

For these tips, you will need to install at least GPL GhostScript, an open source software package that can manipulate PostScript and PDF files. Once installed, add the bin directory of the Ghostscript installation to your system path. The examples I give below are for the Windows version of Ghostscript, gswin32c.exe, for other systems you replace it with gs.

If you use GhostScript as described here to process a PDF which has images with alpha channels (transparency), the images may not show up in the resulting PDF (there seems to be an error in the image). If these are PDF files you created yourself with pdfTeX, then most likely the original images that you included with the \includegraphics command had alpha channels. Just remove the alpha channels from the original images and recreate the PDF with pdfTeX before using GhostScript.

Compress a large PDF for distribution

Oftentimes the PDF of your manuscript produced straight from pdfTeX is very large. This is especially the case when you are preparing for submission to a journal, which will require that color graphics be a certain dots per inch (dpi). IEEE Geoscience and Remote Sensing journals, for example, require color images to be 400 dpi and grayscale graphics be 300 dpi. This means that for a color graphic that you want to be 3 inches square, the image has to be 1200 pixels x 1200 pixels in size. If you have a lot of images, which is often the case in image processing research, then the PDF file can be quite large. This is fine for final submission to the journal, but what if you needed to distribute a draft to your co-authors, or wanted to put an online version on your website (e.g. my Research page) without straining your bandwidth?

Fortunately Ghostscript offers an easy way to recompress a PDF. This will use lossy compression (unless you tell it otherwise) and image resampling to a smaller dpi so the image quality will suffer but it will be much smaller. Just run the following command:

gswin32c.exe -sDEVICE=pdfwrite -dMaxSubsetPct=100 -dPDFSETTINGS=/ebook -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf"

The above will compress “infile.pdf” to “outfile.pdf”, compressing using the /ebook preset. You can specify several different presets for dPDFSETTINGS, which affects the dpi to which the images are resampled:

/screen72 dpi
/ebook150 dpi
/printer300 dpi
/prepress300 dpi
/defaultDefault dpi setting

For the purposes of compressing a PDF made by pdfTeX, use /screen or /ebook. I like /ebook because it is higher quality but still smaller than the original file. The /prepress and /printer presets keep images at fairly high resolution (300 dpi), which is not what you want for making a file smaller. I am unsure of what the difference is between /printer and /prepress. I also do not know how to specify an arbitrary dpi setting. I will post it here when I find out.

As an example of the file size difference you can achieve with compression, consider a few papers that I have on this website:

File NameOriginal SizeCompressed Size
amsrqs_main.pdf1.6 MB594 KB
icesynth_ii.pdf2.8 MB418 KB
magic_cjrs.pdf41 MB1 MB

The magic_cjrs.pdf file was originally very large; it has a lot of high resolution SAR imagery that were included at nearly full resolution in the original file. My co-author on that paper had a lot of trouble sending it back and forth to our advisor for feedback due to the size. If I had known back then what I know now, this problem could have been avoided.

Windows-users can also try using PDFCreator 1) (see footnote before installing) to reprint a PDF generated by pdfTeX. While this makes the file small, it also completely garbles the text: you can read it, but the underlying text has been replaced by gibberish such that you cannot search or copy the text properly; additionally, search engines like Google cannot read them properly so no one can find your file.

There are some simple alternatives to re-compressing that I should mention:

  1. Use low resolution or placeholder images while preparing your draft and/or while producing your online version. This requires that you produce and keep track of your images.
  2. When using pdfTeX, you can include JPEG images, which you can compress to be smaller beforehand. They stay small in the final PDF file. The quality obviously goes down, which is why I personally always include PNG files with lossless compression, which are then converted to whatever format the journal needs at the end. I have noticed that with my submissions to IEEE, my original lossless colour images are recompressed by them; so it is best to give them lossless images in the first place to avoid possible further losses of quality.

Make GhostScript use lossless compression

If you are using GhostScript to process your PDF, you sometimes might want it to use lossless compression instead of lossy compression. In this case, you will need to specify some additional options. There are two options you need to specify for colour and gray scale images.

For colour images:


For grayscale images:


Finally, for mono (black and white) images, specify:


These options will make it possible to use the lossless compression from GhostScript. They let you decide how you want to compress different types of images separately. The -dAutoFilter[…]Images options tell Ghostscript not to choose a compression method automatically, while the -d[…]ImageFilter options tell GhostScript which compression method you want. I believe Flate is the same as the Deflate algorithm, which is lossless and used in PNG images.

So if you have a file with gray scale and colour images, and you want to, for example, resample to 150 dpi with lossless compression, then use the following command:

gswin32c.exe -sDEVICE=pdfwrite -dMaxSubsetPct=100 -dPDFSETTINGS=/ebook -dAutoFilterGrayImages=false -dGrayImageFilter=/FlateEncode -dAutoFilterColorImages=false -dColorImageFilter=/FlateEncode -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf"

Concatenating, merging, splitting and extracting pages from PDF files

It is fairly simple to concatenate (join / merge) several PDF files together and extract certain pages from a PDF file into a separate file with GhostScript. It is also quite simple to split (or “burst”) a PDF file into a separate file for each page using a program called Pdftk. These operations are described in this section.

To join several PDF files together, you can use the following command:

gswin32c -sDEVICE=pdfwrite -sOutputFile="out.pdf" -dNOPAUSE -dBATCH "in1.pdf" "in2.pdf" "in3.pdf" "in4.pdf"

This will join the in*.pdf files into one out.pdf file. You can specify as many input files as you want.

To extract pages from a PDF file and put them into a separate file, you can use the following:

gswin32c.exe -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dFirstPage=a -dLastPage=b -sOutputFile="out.pdf" "in.pdf"

Specify the range of pages to extract by entering page numbers for a and b. You can extract just one page by having a equal to b.

GhostScript itself does not have the ability to split a PDF into separate files for each page. You can either write a bash script that runs the above command for each page in the file or you can use Pdftk to "burst" a PDF into separate pages. Follow the link to part 2 of my PDF manipulation tips for instructions.

Autocrop PDF white space

If you have a PDF file where the content only takes up a small part of the page, you can crop out this white space automatically. You might want to do this if you wanted to include the PDF content as a graphic in a LaTeX document.

Software pre-requisites: You will need the pdfcrop.pl Perl script in order to automatically crop the PDF automatically. To use this on a Windows system, you need to install Perl, GPL GhostScript and MikTeX. MikTeX comes with pdfcrop.exe which seems to be a compiled version of the Perl script, so you do not necessarily need to have pdfcrop.pl. On Windows, I was only able to run pdfcrop.exe from a Windows command prompt; it does not work with the Cygwin Bash shell; conversely, pdfcrop.pl does not work from the Windows command prompt but works from a Cygwin Bash shell.

Whether you use pdfcrop.exe or pdfcrop.pl, you will need the bin directory of Perl, GhostScript and MikTeX on your system path. For users on Unix or Linux systems, you will need pdfcrop.pl, Perl, GPL Ghostscript and PdfTeX or XeTeX installed.

Open up a command prompt or Cygwin Bash shell and run the following, substituting pdfcrop.exe (if using command prompt) or pdfcrop.pl (if using Cygwin Bash shell) for pdfcrop.

pdfcrop in.pdf cropped.pdf

This will automatically crop in.pdf to remove all white space and save it as cropped.pdf.

Sometimes you do not want to crop the white space completely; you might want to leave some margins around the PDF content. You can add margins with the –margins option:

pdfcrop --margins "1" in.pdf cropped.pdf

This will include a small margin of 1 bp (which is something like 1/72 inch) around each side.

Include or restore PDFmark DOCINFO after GhostScript resets it

When you use GhostScript to process PDF files for any of the operations on this page, it seems to remove the PDFmark DOCINFO information you originally had and replaces it with its own info. Most annoying is the fact that the PDF creator field is assigned to GhostScript's author, which can result in confusion among your own readers regarding the authorship of your PDF (for what it is worth, the PDF creator field seems to refer to the software used to create the PDF file, however, it can still be confusing for those not aware of this). You can get GhostScript to include your own custom PDFmark DOCINFO (such as Author, Title, Keywords, Creator and Producer).

Create a file named docinfo.txt and paste the following into it:

[ /Author (string)
/Creator (string)
/Producer (string)
/Title (string)
/Subject (string)
/Keywords (string)
/DOCINFO pdfmark

Replace the word string in parentheses with your own values.

Once you have made the file, then run GhostScript:

gswin32c.exe -sDEVICE=pdfwrite -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf" docinfo.txt

You should be able combine these with any of the other operations on this page; just put docinfo.txt as the last input file. Ghostscript merely concatenates the DOCINFO into the final file.

1) Be careful when installing PDFCreator; although it is open source, the installer deceptively installs unwanted browser toolbars. You have to use the custom install options, deselect the toolbar option and then on a subsequent screen, deselect another toolbar option.

About Peter Yu I am a research and development professional with expertise in the areas of image processing, remote sensing and computer vision. I received BASc and MASc degrees in Systems Design Engineering at the University of Waterloo. My working experience covers industries ranging from district energy to medical imaging to cinematic visual effects. I like to dabble in 3D artwork, I enjoy cycling recreationally and I am interested in sustainable technology. More about me...

Feel free to contact me with any questions about this site at [user]@[host] where [user]=web and [host]=peteryu.ca

Copyright © 1997 - 2021 Peter Yu