Click here to Skip to main content
15,892,768 members
Please Sign up or sign in to vote.
2.00/5 (3 votes)
See more:
Hi,

I work for a website in publishing dept and i work on PDF's with lot of content related to tax and accounting. Usually every PDF i receive will have a INDEX like below.

12230 Change in Accountants
12240 Change in Fiscal Year
12250 Auditor Issues

This text in INDEX will again repeat in the content as heading for a paragaraph like below.

12230 Change in Accountants
(Last updated: 6/30/2009)
12230.1 Unless the same accountant reported on the most recent financial statements of
both the registrant and the accounting acquirer, a reverse acquisition always


I will convert the PDF in to text and Then i will seperate the INDEX part and place it in textbox1 and rest of the content in textbox2.

When pasted in Textbox1 my INDEX will look like below:

<p>12230 Change in Accountants</p>
<p>12240 Change in Fiscal Year</p>
<p>12250 Auditor Issues</p>


And content in Textbox will look like below:


12230 Change in Accountants
(Last updated: 6/30/2009)
12230.1 Unless the same accountant reported on the most recent financial statements

12240 Change in Fiscal Year
12240.1 A Form 8-K filed in connection with a reverse acquisition should disclose under
Item 5.03 of the Form 8-K any intended change in fiscal year from the fiscal
year end used by the registrant prior to the acquisition.

12250 Auditor Issues
(Last updated: 6/30/2009)
12250.1 Reverse Recapitalization with a Public Shell Company


So whenever the text in INDEX(textbox1) is repeating in the content(textbox2 bold content) i need to add the heading levels(Prefix and suffix Tags)

Eg: "< hd1 >< name >12230 Change in Accountants< /name >" with a button click.

What i am actually looking for is a find and replace option. Find the content in textbox1(Index) individual line and replace that text with the tags suggested above in textbox2(Content).

The challenge here is find the individual line from textbox1 and replace that same content in textbox2 which is placed somewhere in the middle of the document.
Posted
Updated 10-Sep-12 1:40am
v6
Comments
Ashraff Ali Wahab 8-Sep-12 20:08pm    
Hi,

Could you please put the content of the textbox1 and textbox2 before button click and after button click.
vamshivarma 10-Sep-12 7:42am    
Sorry Ashraff, I didnt get that.

Anyways i have updated my question, please have a look and let me know if you need further info and Thanks for reponding.
Sandeep Mewara 9-Sep-12 1:04am    
What part exactly you are stuck with? PDF to Text? Data out of Index? Data out of Textbox?
vamshivarma 10-Sep-12 7:35am    
Sandeep I have updated my question, please have a look and let me know if you need further clarifications and Thankyou so much for responding.
Gun Gun Febrianza 11-Sep-12 1:20am    
are you mean like this?

yourtextboxtex = "blablablab";

1 solution

Wont a simple find and replace work or you are having any other issues with find and replace.

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Test
{
    class RegexProcessing
    {

        public void process()
        {
            string value = "12230 Change in Accountants\r\n" +
                           "12240 Change in Fiscal Year\r\n" +
                           "12250 Auditor Issues";
            string modifiedValue = "12230 Change in Accountants\r\n" +
            "(Last updated: 6/30/2009)" +
            "12230.1 Unless the same accountant reported on the most recent financial statements of\r\n" +
            "both the registrant and the accounting acquirer, a reverse acquisition always\r\n" +
            "12240 Change in Fiscal Year\r\n" +
            "12240.1 A Form 8-K filed in connection with a reverse acquisition should disclose under\r\n" +
            "Item 5.03 of the Form 8-K any intended change in fiscal year from the fiscal\r\n" +
            "year end used by the registrant prior to the acquisition\r\n" +
                "12250 Auditor Issues\r\n" +
            "(Last updated: 6/30/2009)\r\n" +
            "12250.1 Reverse Recapitalization with a Public Shell Company\r\n";

            string[] lines = System.Text.RegularExpressions.Regex.Split(value, "\r\n");

            foreach (string line in lines)
            {
                modifiedValue = modifiedValue.Replace(line, "<hd1><name>" + line + "</hd1></name> ");
            }
            Console.WriteLine(modifiedValue);
        }
    }
}


Output :

XML
<hd1><name>12230 Change in Accountants</hd1></name>
(Last updated: 6/30/2009)12230.1 Unless the same accountant reported on the most recent financial statements of
both the registrant and the accounting acquirer, a reverse acquisition always
<hd1><name>12240 Change in Fiscal Year</hd1></name>
12240.1 A Form 8-K filed in connection with a reverse acquisition should disclose under
Item 5.03 of the Form 8-K any intended change in fiscal year from the fiscal
year end used by the registrant prior to the acquisition
<hd1><name>12250 Auditor Issues</hd1></name>
(Last updated: 6/30/2009)
12250.1 Reverse Recapitalization with a Public Shell Company
 
Share this answer
 
v4
Comments
vamshivarma 10-Sep-12 15:26pm    
Error Regex does not exist in current context.

And i think this will modify the content which is there in tb1.
Ashraff Ali Wahab 10-Sep-12 16:23pm    
Updated the solution with complete code and output.
vamshivarma 11-Sep-12 15:12pm    
HI Ashraff,

Thankyou so much for the code, i really appreciate it.
But the problem is Index will change for PDF to PDF, so every time it is not possible to change the string value in coding.

string value should be textbox1 and button click should modify my output in textbox2 same as the output which you have posted.
Ashraff Ali Wahab 11-Sep-12 16:01pm    
Let me ask you this thing,Cant you modify the above process method so that it takes two argument,one is the content of textbox1 and other one is content of textbox2 and returns the replaced string.You can take the replace string and put it as value of textbox2.

public string process(string value, string modifiedValue)
{


string[] lines = System.Text.RegularExpressions.Regex.Split(value, "\r\n");

foreach (string line in lines)
{
modifiedValue = modifiedValue.Replace(line, "<hd1><name>" + line + " ");
}
return modifiedValue;
}
vamshivarma 13-Sep-12 17:25pm    
Thanks ashraff, will try this and let you know if i face any challenge.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900