Publishing academic papers with LaTeX / pdfTeX often requires working with PDFs. During grad school at the University of Waterloo, I published mainly in the IEEE Geoscience and Remote Sensing journals, which accepted PDF format submissions with LaTeX source files. On this page I will document some tips and tricks I learned as I handled (or, on some days, physically dueled with) PDF files. This will hopefully save you some headaches.
Be sure to check out the other parts of my PDF manipulation tips:
Part 1 of this document focuses on things you can do with GhostScript. Part 2 focuses on things you can do with other programs like Pdftk.
For these tips, you will need to install at least GPL GhostScript, an open source software package that can manipulate PostScript and PDF files. Once installed, add the bin directory of the Ghostscript installation to your system path. The examples I give below are for the Windows version of Ghostscript, gswin32c.exe, for other systems you replace it with gs.
If you use GhostScript as described here to process a PDF which has images with alpha channels (transparency), the images may not show up in the resulting PDF (there seems to be an error in the image). If these are PDF files you created yourself with pdfTeX, then most likely the original images that you included with the \includegraphics
command had alpha channels. Just remove the alpha channels from the original images and recreate the PDF with pdfTeX before using GhostScript.
Oftentimes the PDF of your manuscript produced straight from pdfTeX is very large. This is especially the case when you are preparing for submission to a journal, which will require that color graphics be a certain dots per inch (dpi). IEEE Geoscience and Remote Sensing journals, for example, require color images to be 400 dpi and grayscale graphics be 300 dpi. This means that for a color graphic that you want to be 3 inches square, the image has to be 1200 pixels x 1200 pixels in size. If you have a lot of images, which is often the case in image processing research, then the PDF file can be quite large. This is fine for final submission to the journal, but what if you needed to distribute a draft to your co-authors, or wanted to put an online version on your website (e.g. my Research page) without straining your bandwidth?
Fortunately Ghostscript offers an easy way to recompress a PDF. This will use lossy compression (unless you tell it otherwise) and image resampling to a smaller dpi so the image quality will suffer but it will be much smaller. Just run the following command:
gswin32c.exe -sDEVICE=pdfwrite -dMaxSubsetPct=100 -dPDFSETTINGS=/ebook -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf"
The above will compress “infile.pdf” to “outfile.pdf”, compressing using the /ebook preset. You can specify several different presets for dPDFSETTINGS, which affects the dpi to which the images are resampled:
Preset | dpi |
---|---|
/screen | 72 dpi |
/ebook | 150 dpi |
/printer | 300 dpi |
/prepress | 300 dpi |
/default | Default dpi setting |
For the purposes of compressing a PDF made by pdfTeX, use /screen or /ebook. I like /ebook because it is higher quality but still smaller than the original file. The /prepress and /printer presets keep images at fairly high resolution (300 dpi), which is not what you want for making a file smaller. I am unsure of what the difference is between /printer and /prepress. I also do not know how to specify an arbitrary dpi setting. I will post it here when I find out.
As an example of the file size difference you can achieve with compression, consider a few papers that I have on this website:
File Name | Original Size | Compressed Size |
---|---|---|
amsrqs_main.pdf | 1.6 MB | 594 KB |
icesynth_ii.pdf | 2.8 MB | 418 KB |
magic_cjrs.pdf | 41 MB | 1 MB |
The magic_cjrs.pdf file was originally very large; it has a lot of high resolution SAR imagery that were included at nearly full resolution in the original file. My co-author on that paper had a lot of trouble sending it back and forth to our advisor for feedback due to the size. If I had known back then what I know now, this problem could have been avoided.
Windows-users can also try using PDFCreator 1) (see footnote before installing) to reprint a PDF generated by pdfTeX. While this makes the file small, it also completely garbles the text: you can read it, but the underlying text has been replaced by gibberish such that you cannot search or copy the text properly; additionally, search engines like Google cannot read them properly so no one can find your file.
There are some simple alternatives to re-compressing that I should mention:
If you are using GhostScript to process your PDF, you sometimes might want it to use lossless compression instead of lossy compression. In this case, you will need to specify some additional options. There are two options you need to specify for colour and gray scale images.
For colour images:
-dAutoFilterColorImages=false -dColorImageFilter=/FlateEncode
For grayscale images:
-dAutoFilterGrayImages=false -dGrayImageFilter=/FlateEncode
Finally, for mono (black and white) images, specify:
-dMonoImageFilter=/FlateEncode
These options will make it possible to use the lossless compression from GhostScript. They let you decide how you want to compress different types of images separately. The -dAutoFilter[…]Images options tell Ghostscript not to choose a compression method automatically, while the -d[…]ImageFilter options tell GhostScript which compression method you want. I believe Flate is the same as the Deflate algorithm, which is lossless and used in PNG images.
So if you have a file with gray scale and colour images, and you want to, for example, resample to 150 dpi with lossless compression, then use the following command:
gswin32c.exe -sDEVICE=pdfwrite -dMaxSubsetPct=100 -dPDFSETTINGS=/ebook -dAutoFilterGrayImages=false -dGrayImageFilter=/FlateEncode -dAutoFilterColorImages=false -dColorImageFilter=/FlateEncode -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf"
It is fairly simple to concatenate (join / merge) several PDF files together and extract certain pages from a PDF file into a separate file with GhostScript. It is also quite simple to split (or “burst”) a PDF file into a separate file for each page using a program called Pdftk. These operations are described in this section.
To join several PDF files together, you can use the following command:
gswin32c -sDEVICE=pdfwrite -sOutputFile="out.pdf" -dNOPAUSE -dBATCH "in1.pdf" "in2.pdf" "in3.pdf" "in4.pdf"
This will join the in*.pdf
files into one out.pdf
file. You can specify as many input files as you want.
To extract pages from a PDF file and put them into a separate file, you can use the following:
gswin32c.exe -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dFirstPage=a -dLastPage=b -sOutputFile="out.pdf" "in.pdf"
Specify the range of pages to extract by entering page numbers for a
and b
. You can extract just one page by having a
equal to b
.
GhostScript itself does not have the ability to split a PDF into separate files for each page. You can either write a bash script that runs the above command for each page in the file or you can use Pdftk to "burst" a PDF into separate pages. Follow the link to part 2 of my PDF manipulation tips for instructions.
If you have a PDF file where the content only takes up a small part of the page, you can crop out this white space automatically. You might want to do this if you wanted to include the PDF content as a graphic in a LaTeX document.
Software pre-requisites: You will need the pdfcrop.pl Perl script in order to automatically crop the PDF automatically. To use this on a Windows system, you need to install Perl, GPL GhostScript and MikTeX. MikTeX comes with pdfcrop.exe
which seems to be a compiled version of the Perl script, so you do not necessarily need to have pdfcrop.pl
. On Windows, I was only able to run pdfcrop.exe
from a Windows command prompt; it does not work with the Cygwin Bash shell; conversely, pdfcrop.pl
does not work from the Windows command prompt but works from a Cygwin Bash shell.
Whether you use pdfcrop.exe
or pdfcrop.pl
, you will need the bin
directory of Perl, GhostScript and MikTeX on your system path. For users on Unix or Linux systems, you will need pdfcrop.pl
, Perl, GPL Ghostscript and PdfTeX or XeTeX installed.
Open up a command prompt or Cygwin Bash shell and run the following, substituting pdfcrop.exe
(if using command prompt) or pdfcrop.pl
(if using Cygwin Bash shell) for pdfcrop
.
pdfcrop in.pdf cropped.pdf
This will automatically crop in.pdf
to remove all white space and save it as cropped.pdf
.
Sometimes you do not want to crop the white space completely; you might want to leave some margins around the PDF content. You can add margins with the –margins
option:
pdfcrop --margins "1" in.pdf cropped.pdf
This will include a small margin of 1 bp (which is something like 1/72 inch) around each side.
When you use GhostScript to process PDF files for any of the operations on this page, it seems to remove the PDFmark DOCINFO information you originally had and replaces it with its own info. Most annoying is the fact that the PDF creator field is assigned to GhostScript's author, which can result in confusion among your own readers regarding the authorship of your PDF (for what it is worth, the PDF creator field seems to refer to the software used to create the PDF file, however, it can still be confusing for those not aware of this). You can get GhostScript to include your own custom PDFmark DOCINFO (such as Author, Title, Keywords, Creator and Producer).
Create a file named docinfo.txt and paste the following into it:
[ /Author (string) /Creator (string) /Producer (string) /Title (string) /Subject (string) /Keywords (string) /DOCINFO pdfmark
Replace the word string
in parentheses with your own values.
Once you have made the file, then run GhostScript:
gswin32c.exe -sDEVICE=pdfwrite -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf" docinfo.txt
You should be able combine these with any of the other operations on this page; just put docinfo.txt as the last input file. Ghostscript merely concatenates the DOCINFO into the final file.
Discussion
1. I found (at least for my pdf) that GS converted a light yellow shaded background to a light blue shaded background when using -dPDFSETTINGS=/printer but with -dPDFSETTINGS=/prepress it remained the original colour... so the differences between the two might be something to do with colour profiles? Not sure how/why it'd convert yellow to blue though!
2. I tried setting -dAutoFilter...Images=false but leaving out a specific image filter to see what happens (in the hope that it'd maintain whatever compression algorithm I was using for each image, i.e. images inserted as PNG would be resized as PNG and images inserted as JPEG resized as JPEG) but it seems to have applied a fairly ugly lossy filter in all cases this way.
For the record, the command would be something like: gswin32c.exe -sDEVICE=pdfwrite -dMaxSubsetPct=100 -dPDFSETTINGS=/ebook -dAutoFilterGrayImages=false -dAutoFilterColorImages=false -sOutputFile="outfile.pdf" -dNOPAUSE -dBATCH "infile.pdf"
So in terms of using lossless compression, it may be that auto is actually re-compressing them in their original compression algorithm anyway? (I don't think I'm seeing any JPEG artefacts in my PNG images using this method) So if images are *already* in a lossless compression format (or at least PNG), it *might* be keeping that format, which is nice... not sure how to confirm this though :/
Anyway, thanks heaps for the info, very useful (:
It looks to me that as long as GhostScript has to recompress, it will use a uniform type of compression for all images of a certain type colour, gray or mono). That said, there might be a bug or a poorly documented feature that can allow this type of functionality, but I don't know what it is at this time.
Thanks for this.
<b> -dDisplayResolution=DESIRED_DPI_RESOLUTION </b>
I've tried with a value of 50, without altering the quality too much.
Good luck!
I was able to reduce the size of my 88-page paper, where most of each page is taken up with a high-res photo of an apple, by 90%.
Coming out of Xetex, the file was 26MB; running with -dPDFSETTINGS=/ebook got it down to 12MB; using /screen gets it down to 2.5MB.
Thanks again. Btw, great css on this site
My name is Shivashankar, I saw your website which is very informative
and has lots of documentation.
I am having a PDF which contains two pages in a a4 sheet and I am
trying to separate them by cutting it
in the middle. I am able to extract left part but for the right part I
can see only empty pages after extraction. Commands which I used
to extract pages are as follows.
Input file is not scanned one
gs -o left-sections.pdf -sDEVICE=pdfwrite -g4210x5950 -c
"<</PageOffset [0 0]>> setpagedevice" -f bali-k.pdf --> success
gs -o right-sections.pdf -sDEVICE=pdfwrite -g4210x5950 -c
"<</PageOffset [421 0]>> setpagedevice" -f bali-k.pdf --> failure
Regards
Shivashankar
Is there an option that forces the use of the color profiles (or forces using it)?
http://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=1681c7ebb5a338002d5f7dd8da9bffda675f0656