Click here to Skip to main content
15,885,278 members
Articles / Programming Languages / C#

Text TakeOut

Rate me:
Please Sign up or sign in to vote.
0.00/5 (No votes)
13 Jan 2009CPOL3 min read 28.1K   462   14   1
Extract data from text documents, HTML, etc., and easily convert it to CSV.

Sample image

Introduction

Sample image

Text TakeOut takes out data from any messy ASCII document and exports it, after some defined fields from the user, into a Comma Separated Values (CSV) file. If you need something quick and easy to get some very useful data form a source (HTML, Text (txt), etc.) document, then Text TakeOut will work for you. The ExtractSet and CSV class may be useful for an application that needs an object to store a string Start, End, Value, and FieldName. The CSV is used to create a CSV file using a System.Collections.Generic.List<ExtractSet> object.

Background

I decided to create this project after being emailed an HTML document of over 100 external Trouble Tickets for one of our business applications. I bet I am not the only one to receive such a file, and I needed to make use of the data contained within it. Defining the fields I needed and exporting it into a common format was something I was looking for. Text TakeOut can be a powerful solution in some situations.

Using the Code and Using the Utility Application

The ExtractSet and CSV classes are the wheels of this car. ExtractSet stores the data which works well using a System.Collections.Generic.List or any other object that is easily iterated through. startvalue holds the beginning string tag of where the data you wish to 'TakeOut'/extract. endvalue holds the ending string tag of the desired data. For example, using the sample file which is included in the project's main source folder, 'Example ExtractFile.txt' has a repeating value we wish to grab, but the current format is "<HR>FGH123</HR>", "<HR>IJK156</HR>", etc. The first ExtractSet will be startvalue="<HR>", endvalue="</HR>", then we will define a fieldname that summarizes the data in-between these two points in the string, fieldnamevalue="Trouble Ticket#". Next, in the actual raw document is the description of the trouble ticket#, it is surrounded as follows: startvalue="</b>", endvalue= "<br>", then we define the fieldnamevalue="Description". In the utility app, the "Add->" button will not be enabled until you have all of the fields that are necessary to create the ExtractSet.

Sample image

The Find and Replace With features in the utility application will prepare the full string before iterating through your ExtractSet definitions. The application actually uses a List<ExtractSet> to store the user defined Find and Replace With sets. I found this useful for broken HTML tag sets that I needed to keep together so that when the iteration began, no data would be missed. For example, on one raw document I was working with, I had to perform a prep on the main string:

C#
tempcopy = tempcopy.Replace("Date \r\n      Entered:", "Date Entered:");
tempcopy = tempcopy.Replace("Date \r\n     Entered:", "Date Entered:");
tempcopy = tempcopy.Replace("Date \r\n    Entered:", "Date Entered:");
tempcopy = tempcopy.Replace("Date \r\n   Entered:", "Date Entered:");

Text TakeOut can automate this when the user creates sets as follows:

Find:"Date \r\n Entered:" 
Replace With:"Date Entered:"
Find:"Date \r\n Entered:" 
Replace With:"Date Entered:"
Find:"Date \r\n Entered:" 
Replace With:"Date Entered:"
Find:"Date \r\n Entered:" 
Replace With:"Date Entered:"

The basic steps to use the application are as follows:

  1. Click on the 'Browse to raw data file...' button, and select the file you wish to extract your data from.
  2. Create your Find Replace Sets to prep the main string.
  3. Create your main Extract Sets to extract and define the data you need from the main document.
  4. Click the 'Extract!' button.
  5. Click the 'Create CSV File' button to create the .csv file in the same location as your main.

Conclusion

Text TakeOut still needs some fine tuning, but it can get a lot of useful, sometimes critical, data from some pretty ugly sources into a very common format. You can do almost anything you need to do with CSV. The ExtractSet and CSV class may be useful for an application that needs an object to store a string Start, End, Value, and FieldName. The CSV is used to create a CSV file using a System.Collections.Generic.List<ExtractSet> object.

Updates

2/11/07

  • Article and Text TakeOut version 1.0.0.0 posted.

2/13/07

  • Updated Text TakeOut version 1.1.0.0 uploaded.
  • Automatically removes "\r\n" and "\n" from the extracted value.
  • Exception handling and user explanation if extraction error occurred during extraction.

1/13/09

  • Fixed download link text.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Chief Technology Officer Earthbotics.com
United States United States
Born in Pennsylvania (USA), just north of Philadelphia. Joe has been programming since he was ten[now much older]. He is entirely self-taught programmer, & he is currently working as an IT Manager in Seattle WA. He was previously U.S. Navy Active Reservist for (SPAWAR)
In '98 was honorably discharged from the USN. He served onboard the USS Carl Vinson (94-98) He was lucky enough to drink President Clinton's leftover wine, promoted by his Captain, and flew in a plane off the flightdeck but not all at the same time. His interests, when time allows, are developing
misc apps and Artificial Intelligence proof-of-concept demos that specifically exhibits human behavior. He is a true sports-a-holic, needs plenty of caffeine, & a coding junkie. He also enjoys alternative music and a big Pearl Jam, Nirvana, new alternative music fan, and the Alison Wonderland.
He is currently working on earthboticsai.net<> which he says is fun and cool. Cool | :cool: :cheers:

Joe is an INTP[
^] personality type. Joe "sees everything in terms of how it could be improved, or what it could be turned into. INTP's live primarily inside their own minds." INTPs also can have the "greatest precision in thought and language. Can readily discern contradictions and inconsistencies. The world exists primarily to be understood. 1% of the total population" [

Comments and Discussions

 
GeneralAlso see... Pin
Ravi Bhavnani12-Feb-07 6:46
professionalRavi Bhavnani12-Feb-07 6:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.