Extract Images from a PDF Document

Accusoft

2.00/5 (1 vote)

Sep 1, 2012

CPOL

5 min read

35094

The PDF Format is a very popular medium for document exchange around the world. PDF files are great for saving and exchanging files across all platforms and on the internet. This whitepaper focuses on how you can use PDF Xpress to extract images from these PDF documents. There are certain circumstances where you may need to extract an image from a PDF file to use in Web pages, word processing documents, PowerPoint presentations, etc. For example, you might need to reuse an image within the PDF in another document when the source image file is no longer available.

The pages of a PDF document are comprised of many different types of objects, such as clipping paths, attachments, videos, audio files, images, etc. An image in the PDF realm is defined by a sequence of samples obtained by scanning the image array in row or column order. One popular feature of PDF software is the ability to extract images. These images are typically saved to an image file format on disk or in memory.

There are many PDF image extraction applications available in today’s marketplace, however quality is not their strong suit. Many are written using open-source PDF libraries, which can typically complete the task, but often at the cost of accuracy. With PDF Xpress, developers have the ability to add reliable, fast, and thorough image extraction to their products. Also, in combination with other Accusoft products like ImagXpress, users can save those extracted images to a variety of formats, including JPEG-XR, JPEG 2000, GIF, TIFF, PNG, and more.

It’s not that Simple

There is somewhat of a misperception online (e.g. blogs) that an image file is inserted in its entirety into a PDF (this is only true in the case of an attachment) and it’s simply a matter of discovering this file and extracting it. This is simply not the case.

For example, a popular filter used in PDF is the DCTDecode filter (the Discrete Cosine Transform is used to encode JPG images). So often the mistake is made of simply extracting this DCTDecode stream data as a whole and saving it to a file with the extension .JPG. Magically, the image will sometimes look correct compared to the original PDF. This is only plausible when the Image XObject’s BitsPerComponent value is 8, the ColorSpace value is DeviceRGB, and the non-default Decode array values are specified. If another ColorSpace is in use for example then simply ripping the image data out is incorrect and not the proper solution.

PDF Image Data

A PDF only contains raw image data, there’s no format, no header, etc. Plus the data is in terms of the color space specified, so it’s up to the software to put that data together in a way that makes sense for the output format. For example, if the image data is specified in the Lab color space, you can’t just save that directly to a TIFF. PDF supports a variety of color spaces like Lab, CalRGB, ICCBased, DeviceCMYK, Separation, etc., Such image data isn’t easily expressed in most image formats, so the data needs to be converted to another color space that can be understood by the format. PDF Xpress hands you the data as a Bitmap (BMP), which is a lossless format, so there’s no loss of information and it gives you the flexibility to save the image data to any format you choose.

Deficiencies in Other Solutions

One common weakness of PDF image extraction software is found with images using the DeviceCMYK ColorSpace. Typically it’s not handled correctly, and many times not handled at all. It often uses a naïve numeric conversion to transform the data to the RGB color space, which frequently results in an image with washed out colors that is a crude approximation of the original image. PDF Xpress uses color management to properly convert such images.

Some PDF software claims to extract image data in their native format, one claimed format is TIFF. The problem is there’s no such thing as TIFF in a PDF. CCITTFaxDecode is a filter used in PDF for compressing image streams, a type of compression commonly used in bi-level TIFF images, but it’s just a data block; it’s not an actual TIFF file. So a TIFF file and a CCITT compressed block of data aren’t the same thing.

Many extractor tools provide output that is corrupt, have incorrect colors, skip images altogether on a page, can’t handle secure documents, or only return partial images. PDF Xpress handles extracting all Image XObjects correctly and because the extraction is done in unmanaged code, the extraction is fast. In fact some PDF Documents that result in the dreaded “Insufficient image data” error when opened in Acrobat can be extracted without issues using PDF Xpress.

Code

In less than 2 minutes you can write intuitive code to extract the images from a page of a PDF document and save them as JPEG-XR images as shown below in C#:

using (PdfXpress pdf = new PdfXpress())
{
    using (ImagXpress ix = new ImagXpress())
    {
        pdf.Initialize();
        
        using (Document doc = new Document(pdf, "document.pdf"))
        {
            ExtractImageOptions options = new ExtractImageOptions();

            int imageCount = doc.ExtractImages(0, options);

            PDFImage pdfImage = doc.GetExtractedImage(0, 0);

            using (Bitmap pdfBitmap = pdfImage.GetBitmap())
            {
                using (ImageX img = ImageX.FromBitmap(ix, pdfBitmap))
                {
                    Accusoft.ImagXpressSdk.SaveOptions so = 
                        new Accusoft.ImagXpressSdk.SaveOptions();
                    so.Format = ImageXFormat.JpegXR;

                    img.Save("image.jxr", so);
                }
            }
        }
    }
}

Summary

PDF image extraction is a widely desired feature in a PDF workflow, but is often misunderstood and mishandled. PDF Xpress is an intuitive library for easily working with PDF documents. With version 5 it now gracefully handles image extraction from PDF documents, quickly and correctly. Get a PDF Xpress trial or download our demo today.

About the Author

Joseph Argento is the Technical Lead for PDF Xpress and ImagXpress at Accusoft. He joined the company in 2007 as a Support Engineer. Joseph contributes to the Native product team as a Software Engineer and has a MS in Electrical Engineering.