Posts Tagged ‘mediabox’

Relatively sane conversion of PDFs to web-ready JPGs using ImageMagick.

Thursday, September 15th, 2011

Some people when confronted with a problem, think “I know,
I’ll use ImageMagick.” Now they have two problems.
*

For one of the sites I’m maintaining a lot of content is generated directly from (more or less print-ready) PDFs. The only free tool I’ve been able to find that can convert PDFs to decent quality JPGs or PNGs is ImageMagick.

But even when you’ve got ImageMagick’s convert and mogrify commands installed, conversion of PDFs still requires some careful tuning, that is: careful selection of arguments to convert. Also; a sacrificial chicken and lots of patience. Anyway, here’s what I ended up with. Most of this is also available in my clj-imajine clojure library.

Color space.

Many web browsers do not support any color space other than RGB/sRGB. If your PDFs are in the CMYK color space (usual for print) or any other color space, the resulting JPGs will look “weird” in many applications and web browsers; some viewers just show a blank image and others completely mess up the colors. To make sure the end result is in sRGB, use the option “-colorspace sRGB“.

Color depth.

For much the same reasons, you want to enforce that the output color depth is 8 bits for JPGs. To do that, use the option “-depth 8“.

Crop boxes.

PDFs are pretty complex documents and one potential pitfall is that there are at least 3 different indicators of the “boundaries” of the PDF. I’ve run into a few where the “right” boundaries were provided by the “cropbox” instead of the “media box”. This post by Joseph Scott provided the solution: use “-define pdf:use-cropbox=true“.

The final line becomes:

convert -define pdf:use-cropbox=true -colorspace sRGB -depth 8 pages.pdf pages.jpg

Note that if your PDF contains more than one page, this will generate a JPG for each one, named pages1.jpg, pages2.jpg etc… To select a single page you can use convert -define pdf:use-cropbox=true -colorspace sRGB -depth 8 pages.pdf[X] pages.jpg where X is the page number minus 1. You can find the page numbers in a PDF using ImageMagick’s identify command like this: identify -density 2 -format "%p," pages.pdf

*) paraphrased from Jamie Zawinski’s remark on regular expressions.