Click here to Skip to main content
15,889,931 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
We have a PDF document below where some text is hidden (covered with the white box). Is that possible to use iTextSharp technique to detect if PDF has hidden text or not?

I do have an example of PDF with hidden text and two ways we extract text from PDF

Below are two examples of the way we extract text from PDF

Thank you all in advance for your help
C#
public string ReadFile(string Filename)
{
    PdfReader reader = new PdfReader(Filename);

    string pdfText = string.Empty;
    string OCRErrorPages = string.Empty;

    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        iTextSharp.text.pdf.parser.ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();

        String extractText = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, i, its);

        extractText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));

        if (extractText != "")
        {
            pdfText = pdfText + extractText;
        }
        else
        {
            OCRErrorPages = OCRErrorPages + i + extractText + "<br>";
        }
    }
    reader.Close();
    if (OCRErrorPages != "")
    {
        return OCRErrorPages + " This page contains no text";
    }
    else
    {
        return pdfText;
    }

}

public string ExtractText(string inFileName)
{
    string line = string.Empty;
    // Create a reader for the given PDF file
    PdfReader reader = new PdfReader(inFileName);

    int totalLen = 68;
    float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
    int totalWritten = 0;
    float curUnit = 0;

    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        line += ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ";

        var thing = reader.GetPageContent(page);

        // Write the progress.
        if (charUnit >= 1.0f)
        {
            for (int i = 0; i < (int)charUnit; i++)
            {
                Console.Write("#");
                totalWritten++;
            }
        }
        else
        {
            curUnit += charUnit;
            if (curUnit >= 1.0f)
            {
                for (int i = 0; i < (int)curUnit; i++)
                {
                    Console.Write("#");
                    totalWritten++;
                }
                curUnit = 0;
            }
        }
    }

    if (totalWritten < totalLen)
    {
        for (int i = 0; i < (totalLen - totalWritten); i++)
        {
            Console.Write("#");
        }
    }
    return line;
}


What I have tried:

I tried using the
PdfContentByte.TEXT_RENDER_MODE_INVISIBLE
option but not sure how to apply it into PDF reader
Posted
Updated 11-Mar-20 9:47am
v2

1 solution

Not really possible since there's a lot of different ways you can "hide" text, or mistakenly cover text up with other elements.

Your code would have to "understand" exactly how the document is going to rendered on whatever page size it's being view/printed at. It would also have to "understand", in very precise terms, what you mean "hidden".

There is no property anywhere on a text stream that says it's not visible or is partially covered.
 
Share this answer
 
Comments
Member 13304618 11-Mar-20 17:13pm    
Dave, thank you very much for your answer. However, there is a 'Hidden Text' property in the PDF acrobat reader.

Is there a way to access this property with the iTextSharp?
Dave Kreskowiak 11-Mar-20 17:48pm    
That is a bit field to prevent Reader from attempting to render the object at all, even if it's not covered up by another object. There is also another field, called NoView, that does almost the same thing, except the tagged object is allowed to be printed.

It is NOT a property that describes text (or any other object) that should be visible but is covered up by another object.

This is a LOT more complicated than you realize.

If you really want to dig into this, you can read up on the entire PDF specification here:
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
...all ~1,000 pages of it.

I don't know or use iTextSharp, so I couldn't tell you how to get at the Hidden field.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900