How to detect hidden text in PDF using iTextsharper using C#

Question

0.00/5 (No votes)

See more:

We have a PDF document below where some text is hidden (covered with the white box). Is that possible to use iTextSharp technique to detect if PDF has hidden text or not?

I do have an example of PDF with hidden text and two ways we extract text from PDF

Below are two examples of the way we extract text from PDF

Thank you all in advance for your help

C#

public string ReadFile(string Filename)
{
    PdfReader reader = new PdfReader(Filename);

    string pdfText = string.Empty;
    string OCRErrorPages = string.Empty;

    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        iTextSharp.text.pdf.parser.ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();

        String extractText = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, i, its);

        extractText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));

        if (extractText != "")
        {
            pdfText = pdfText + extractText;
        }
        else
        {
            OCRErrorPages = OCRErrorPages + i + extractText + "<br>";
        }
    }
    reader.Close();
    if (OCRErrorPages != "")
    {
        return OCRErrorPages + " This page contains no text";
    }
    else
    {
        return pdfText;
    }

}

public string ExtractText(string inFileName)
{
    string line = string.Empty;
    // Create a reader for the given PDF file
    PdfReader reader = new PdfReader(inFileName);

    int totalLen = 68;
    float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
    int totalWritten = 0;
    float curUnit = 0;

    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        line += ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ";

        var thing = reader.GetPageContent(page);

        // Write the progress.
        if (charUnit >= 1.0f)
        {
            for (int i = 0; i < (int)charUnit; i++)
            {
                Console.Write("#");
                totalWritten++;
            }
        }
        else
        {
            curUnit += charUnit;
            if (curUnit >= 1.0f)
            {
                for (int i = 0; i < (int)curUnit; i++)
                {
                    Console.Write("#");
                    totalWritten++;
                }
                curUnit = 0;
            }
        }
    }

    if (totalWritten < totalLen)
    {
        for (int i = 0; i < (totalLen - totalWritten); i++)
        {
            Console.Write("#");
        }
    }
    return line;
}

What I have tried:

I tried using the

PdfContentByte.TEXT_RENDER_MODE_INVISIBLE

option but not sure how to apply it into PDF reader

Posted 11-Mar-20 7:55am

Member 13304618

Updated 11-Mar-20 9:47am

Richard Deeming

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Dave Kreskowiak · Answer 1 · 2020-03-11T09:48:00

Solution 1

Not really possible since there's a lot of different ways you can "hide" text, or mistakenly cover text up with other elements.

Your code would have to "understand" exactly how the document is going to rendered on whatever page size it's being view/printed at. It would also have to "understand", in very precise terms, what you mean "hidden".

There is no property anywhere on a text stream that says it's not visible or is partially covered.

Posted 11-Mar-20 9:48am

Dave Kreskowiak

Comments

Member 13304618 11-Mar-20 17:13pm

Dave, thank you very much for your answer. However, there is a 'Hidden Text' property in the PDF acrobat reader.

Is there a way to access this property with the iTextSharp?

Dave Kreskowiak 11-Mar-20 17:48pm

That is a bit field to prevent Reader from attempting to render the object at all, even if it's not covered up by another object. There is also another field, called NoView, that does almost the same thing, except the tagged object is allowed to be printed.

It is NOT a property that describes text (or any other object) that should be visible but is covered up by another object.

This is a LOT more complicated than you realize.

If you really want to dig into this, you can read up on the entire PDF specification here:
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
...all ~1,000 pages of it.

I don't know or use iTextSharp, so I couldn't tell you how to get at the Hidden field.