Click here to Skip to main content
15,881,735 members
Articles / Desktop Programming / Windows Forms
Article

GIOS PDF Splitter and Merger

Rate me:
Please Sign up or sign in to vote.
4.92/5 (45 votes)
28 Nov 2006LGPL34 min read 936.6K   4.1K   111   64
The first open source PDF splitter and merger tool written in C#.

Image 1

This is a screenshot of the GIOS PDF splitter and merger v1.0, the first open source PDF splitter and merger tool written in C# .NET.

Introduction

After the success of the GIOS PDF .NET library released on April, 2005, I decided to invest more of my time for the community. Extending and improving the PDF library was one of the things I could do, but what about the new features to be added?

Well, I have to thank my friend Charles. Last month we were discussing about the new features to be added to the PDF library. He said: "if you need another challenge, how about developing a PDF merger program?" His words rocked me: There is no free Windows application that does this. Moreover, there is no open source project written in C#. So, I took the giant PDF reference by Acrobat for evaluating the possibility of doing that.

Background

Reading Adobe's Portable Document Format (PDF) Specification, Third Edition, Version 1.4, Section 3.4, you will find that a PDF is made of:

  • Header, which gives information about the kind of file it is (typically %PDF-1.4).
  • Body, which contains the data of the objects. It gets 99 percent of the PDF size.
  • Cross reference table, which gives the reader the capability of indexing objects without parsing the entire file. This is the secret behind fast navigation in a heavy document. A corrupted cross reference table doesn't compromise on the reading of the document but Acrobat takes too much time rebuilding it on the fly.
  • Trailer, which contains the necessary information for opening the document, like the ID of the root object named Catalog.

The Body is made of a nodal structure of generic objects. The Root or Catalog is a container of container of pages (Pages objects).

How to do it (basic concepts)

We have to point out what we need to change in order to split (merge) a PDF:

  • The header remains the same and it's the same for almost all of the PDF.
  • We have to reorganize the body discarding the objects that are not needed by the new document.
  • Rebuild the cross reference table, but during the testing phase Acrobat will do it for us on the fly. So this is not a big problem.
  • Override the settings in the trailer, but this object is so simple that it takes very little time to rewrite it entirely.

This is the schema of splitting a document of three pages into a new PDF made (in order) from the third and the first page of the original document:

Image 2

  1. object 1, 2, 4 and 5 will be discarded because they are the descriptors of the old document structure.
  2. object 7, 12, 13 and 14 will be discarded because they are the father and the children of the pages we want to discard.
  3. object 17 and 18 will be created in order to describe the new structure.

How to do it (through coding)

The application works with these engines:

  1. The objects parser for the original documents (PdfFile.cs and PdfFileObject.cs)
  2. The splitter (PdfSplitter.cs)
  3. The merger (PsdSplitterMerger.cs)

The objects parser

The objects parser parses the lines of the PDF and stores the objects in memory recognizing their types.

I'm really not proud of my object parser. It's not the best but it works. Here an extract of my code in which the object itself searches for some matches inside its content in order to know its own type. I've seen some better parsers here, for example in the article A pdf Forms parser, if you are a purist coder don't look inside! ;-).

The use of Regex here is not necessary, but it's surely a more elegant way of searching string matches:

C#
if (Regex.IsMatch(s, @"/Page")&!Regex.IsMatch(s, @"/Pages"))
{ 
    this.type = PdfObjectType.Page;
    return this.type;
} 
if (Regex.IsMatch(s,@"stream"))
{
    this.type = PdfObjectType.Stream;
    return this.type;
}
if (Regex.IsMatch(s, @"(/Creator)|(/Author)|(/Producer)")) 
{
    this.type = PdfObjectType.Info;
    return this.type; 
} 
this.type = PdfObjectType.Other;

The splitter

The splitter takes a collection of objects (input) and returns a collection of objects (output).

The input is provided by the objects parser, and the output is basically a filtered list of the original objects. This is how it works:

  1. Takes the original objects of the document (provided by the object parser).
  2. Takes the indexes of the selected pages.
  3. Uses a sort of spider for populating a list of objects needed by the selected pages.
  4. Erases from the original collection the objects not visited by the spider.
  5. Rebuilds the numeration of the objects (features needed by the merger).

This is a recursive method in PdfFileObject.cs used for exploring its children:

C#
internal void PopulateRelatedObjects(PdfFile PdfFile, 
                                    Hashtable container)
{ 
    Match m = Regex.Match(this.OriginalText, @"\d+ 0 R[^G]");
    while (m.Success)
     {
        int num=int.Parse(
                  m.Value.Substring(0,m.Value.IndexOf(" ")));
        bool notparent = !Regex.IsMatch(this.OriginalText, 
                                   @"/Parent\s+"+num+" 0 R"); 
        if (notparent &! container.Contains(num))
        {
            PdfFileObject pfo = PdfFile.LoadObject(num);
            if (pfo != null & !container.Contains(pfo.number))
            {
                container.Add(num,null);
                pfo.PopulateRelatedObjects(PdfFile, container);
            }
        }
    m = m.NextMatch();
    }
}

The merger

The merger is a simple class that is used to append the output of each splitter and write the necessary objects (in our example, objects 17 and 18). It also writes the header, the cross reference table and the trailer. Take a look into PdfSplitterMerger.cs, it's very simple.

Conclusion

I hope this project is useful for non-coders. Splitting and merging documents should be free. Let's hope that these projects demystifying the PDF will get some good result in the near future.

History

  • 21st December, 2005 - v1.0 release.
  • 4th January, 2006 - v1.1
    • Some minor bug fixed.
    • Good gain of performance due to some Regex optimization.
  • 24th November, 2006 - v1.12
    • Regex fix for supporting SQL Reporting Services.

License

This article, along with any associated source code and files, is licensed under The GNU Lesser General Public License (LGPLv3)


Written By
Web Developer
Italy Italy
Freelance software ASPNET / C# Software Developer

I live in Torino, Italy

my homepage is: http://www.paologios.com

Comments and Discussions

 
PraiseMy vote of 5 Pin
nickiov21-Oct-23 5:56
nickiov21-Oct-23 5:56 
QuestionExtend to Pdf read Pin
pra2716-Jul-13 23:55
pra2716-Jul-13 23:55 
GeneralMy vote of 5 Pin
Manoj Kumar Choubey3-Jul-12 23:46
professionalManoj Kumar Choubey3-Jul-12 23:46 
Questionrất hay Pin
ngoctinhquangngai8-Mar-12 19:51
ngoctinhquangngai8-Mar-12 19:51 
Questioncould GIOS PDF Splitter can split .pdf file to small file with specified file size Pin
lpbinh16-Dec-11 0:07
lpbinh16-Dec-11 0:07 
GeneralMy vote of 5 Pin
Member 786479019-May-11 14:19
Member 786479019-May-11 14:19 
GeneralNumbers the page Pin
fhsanchez31-Mar-11 10:39
fhsanchez31-Mar-11 10:39 
Generalproblem for reading pdf file in this code Pin
R a j en17-Nov-09 3:19
R a j en17-Nov-09 3:19 
GeneralMy vote of 1 Pin
c.sanz8-Oct-09 21:16
c.sanz8-Oct-09 21:16 
GeneralRe: My vote of 1 Pin
Paolo Gios8-Oct-09 21:56
Paolo Gios8-Oct-09 21:56 
GeneralSorry, It isn´t good job Pin
c.sanz8-Oct-09 21:15
c.sanz8-Oct-09 21:15 
GeneralRe: Sorry, It isn´t good job Pin
Paolo Gios8-Oct-09 21:46
Paolo Gios8-Oct-09 21:46 
QuestionWhere is the line read the PDF text? Pin
c.sanz6-Oct-09 23:08
c.sanz6-Oct-09 23:08 
QuestionSplitting Files using the Optional Selection Parameters (e.g. 1,2-4,5-7) Pin
HuskerMark21-Sep-09 12:04
HuskerMark21-Sep-09 12:04 
Questionhow to merge pdf-A using itextsharp in c# Pin
raj2313625-Aug-09 19:17
raj2313625-Aug-09 19:17 
QuestionDo you know how to extract image from PDF? Pin
juliet.chauhkg19-Jul-09 7:00
juliet.chauhkg19-Jul-09 7:00 
GeneralDocument without Trailers Pin
FChawla22-Apr-09 16:57
FChawla22-Apr-09 16:57 
GeneralMerging for duplex print Pin
Kethav2-Apr-09 4:08
Kethav2-Apr-09 4:08 
Generalsome files do not split properly Pin
bencohen1-Mar-09 9:50
bencohen1-Mar-09 9:50 
GeneralPDF Form's Field Data missing after Merge Pin
GerhardL6-Nov-08 7:06
GerhardL6-Nov-08 7:06 
GeneralCreating Layers in PDF - not just merging PDFs together Pin
zbaum0030-Sep-08 6:10
zbaum0030-Sep-08 6:10 
GeneralThanks for your hard work Pin
Member 238286827-Sep-08 11:55
Member 238286827-Sep-08 11:55 
GeneralNice PDF articles Pin
Rick Hansen22-Aug-08 7:16
Rick Hansen22-Aug-08 7:16 
GeneralMessage Closed Pin
10-Jul-07 11:33
winnovative10-Jul-07 11:33 
GeneralRe: PDF Merge/Split Library for .NET Pin
sasaatan11-Sep-07 8:06
sasaatan11-Sep-07 8:06 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.