Click here to Skip to main content
15,885,729 members
Articles / Programming Languages / XML
Tip/Trick

Extract Tables from PDFs

Rate me:
Please Sign up or sign in to vote.
5.00/5 (13 votes)
15 Oct 2018CPOL4 min read 32K   5.6K   9   10
A parser event listener for PDF tables to .NET Dataset object

Introduction

We know that PDF files store their data in a tree structure. So retrieving data from PDFs should be an easy process. But it is not, like Tables when they are untagged. Untagged (not marked content) Table extraction from PDF is uncommon and almost impossible, therefore some online tools and software like docParser, ABBYY FineReader, Tabula, etc. offer you this extraction facility but they are not free and reliable.

Requirements or dependencies: iText.kernel.dll, itext.oi.dll, BouncyCastle.Crypto.dll, Common.Logging.Core.dll, Common.Logging.dll --> all are included in packages folder of the zip project.

** iText is a opensource tools. For this project, I have modified the iText.kernel library. So the original iText.Kernel will not work here. To run, please add the reference of my modified iText.kernel.dll.

** To run this project, Visual Studio 2015 or above is needed.

Background

iText, iTextSharpe are very popular and opensource tools for read, write, parse and other various kind of PDF manipulations and operations. In this project, iText is used. So if you are in a situation like that, you have to extract untagged table data from PDF files. This article may help you to understand why it is complex and shows a simple way to do that with the help of iText.

* This article cover tables which have borders.

PDF file Structure: Four main parts of PDF files are:

  • Header- Contains PDF file signature
  • Body – Direct/indirect objects
  • Cross-reference Table - Map the objects to find them
  • Trailer- Has information about Cross-reference Table size, Root object(/Catalog), encryption information

Basic data types:

  • boolean
  • number
  • string
  • name
  • array
  • dictionary
  • stream
  • null

Here, I am not going to discuss details about PDF file structures, rather we focus on our specific problem.

From PDF data types, we see that there is nothing like table, cell, row or column. So how does a table data store in a PDF. It is simple just Text(stream/string) and path. Path is not a data type but most PDF parsers parse data from PDF in text, path, image, glyphs form. So in PDF files, a table is nothing but some pieces of lines, rectangles, text or shape. Let us see a simple PDF with a table and its corresponding internal structure.

Image 1

Image 2

From the above pictures, we see a PDF file and its code behind. In the real world, you can’t see code behind like this rather than some bunch of compressed unreadable stream. For better understanding, we will deal here with readable and much understable code behind. From the code behind, we see that there is no table, cell or row, rather than there are some line draw code and text render code instructions.

Text Render Code

BT /F1 10 Tf 80 700 Td (Cell 1)Tj ET
BT /F1 10 Tf 180 700 Td (Cell 2)Tj ET
BT /F1 10 Tf 280 700 Td (Cell 3)Tj ET

Line Draw Code

75 720 m 375 720 l 375 650 l 75 650 l h S
175 720 m 175 650 l 275 720 m 275 650 l S

Here, some instructions meaning: BT- Begin Text, ET- End Text, m- MoveTo, l- LineTo S- Stroke

Table is saved in PDF just some piece of line and text. Now we are quite clear why it is hard to retrieve table data from PDF. Now what we can do is retrieve data by its location. In this article, I will just show retrieving table data with border. Table without border extraction will be covered in my next article or in the next part of this article.

Using the Code

Add the required DLL references, add FilterTableEventListener.cs to your project or add TableExtractionFromPDFDLL.dll to your project. The following example reads a PDF file, parses a certain PDF page and extracts the table/tables (with full border) from the page into Dataset.

C#
using iText.Kernel.Pdf;
using TableExtractionFromPDFDLL;

static void Main(string[] args)
        {
            PdfReader reader = new PdfReader(@"your pdf file path\TableTest01.pdf");
            PdfDocument document = new PdfDocument(reader);
           
            //insert your required page number or loop through all pages
            PdfPage page = document.GetPage(1);
           
            FilterTableEventListener renderListener = new FilterTableEventListener(page, true);
            System.Data.DataSet ds = renderListener.GetTables();

            //get tables from a range of pages
            System.Data.DataSet[] dsList = new System.Data.DataSet[document.GetNumberOfPages()];
            int startPage = 1, index = 0;
            int endPage = 9 < document.GetNumberOfPages() ? 9 : document.GetNumberOfPages();
            
            for (int i = startPage; i <= endPage; i++)
            {
                PdfPage temPage = document.GetPage(i);
                renderListener = new FilterTableEventListener(temPage, true);
                dsList[index++] = renderListener.GetTables();
            }
            document.Close();
            reader.Close();
        }

Here, FilterTableEventListener has two arguments. The first one is the page where to extract table, the second one is a Boolean value that indicates whether table has borders or not.

Points of Interest

A table with the same data and cell can store differently in PDFs. By studying various PDF files, it is observed that a line can render in PDF by LineTo(l) command and also by Rectangle(re) command. That means that a line is saved in PDF as rectangle with very tiny width (vertical line) or very tiny height (horizontal line).

Same data of PDF can store many different ways in PDF. Same Text, same shape / structure or path can store different ways. That’s why table extraction from PDF mainly depends on location / position extraction and parsing data and object based on location/position.

This project, mainly parses the table border as path to map the table position and its cells position, discards the other paths rather than table border, extracts the cell data by border position.

Discussion about iText.kernel Modification

Some new features added to iText.kernel.dll. Following list of files are modified:

Image 3

** For readers help and some request modified iText.kernel development code have added to itext.kernel.zip attachment.

History

  • 11th October, 2018: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
Bangladesh Bangladesh
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestioniText.Kernel Pin
vishal8219-Nov-19 0:36
vishal8219-Nov-19 0:36 
Questionthank you for your work and i made some change Pin
hihi281013-Aug-19 15:23
hihi281013-Aug-19 15:23 
GeneralMy vote of 5 Pin
hihi28102-Aug-19 20:42
hihi28102-Aug-19 20:42 
QuestionI need to get the data from pdf table which has different format Pin
sankarapandiyan7-Mar-19 1:37
sankarapandiyan7-Mar-19 1:37 
QuestionExtract data without border table from PDF Pin
Thamotharan G7-Feb-19 2:29
Thamotharan G7-Feb-19 2:29 
GeneralRe: Extract data without border table from PDF Pin
swarooprajg22-Feb-19 20:24
swarooprajg22-Feb-19 20:24 
AnswerRe: Extract data without border table from PDF Pin
Member 125493611-Jul-19 19:38
Member 125493611-Jul-19 19:38 
QuestionIf the table does not fit on a single page (edited) Pin
georani14-Nov-18 0:34
georani14-Nov-18 0:34 
QuestioniText kernel changes Pin
Alex (RSA)12-Oct-18 2:22
professionalAlex (RSA)12-Oct-18 2:22 
AnswerRe: iText kernel changes Pin
moneema1414-Oct-18 21:54
moneema1414-Oct-18 21:54 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.