Click here to Skip to main content
15,912,400 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi,
I have a problem where I have a large number of files and I want to find a particular sentence/paragraph in the files which contain particular information.
I know that this is not a small problem, however I would like some references to material which can help me proceed in the right direction.

For example : I have a large number of text files which contain press releases from many companies announcing their new product. I would like to get the sentences which tell the date of release and the name of the product.

I have tried using regular expressions but the constructs seem to be very complex and do not work well at all. I know that this is a complex problem but any help in terms of references to websites or video lectures will be very helpful. Even sample projects would be very appreciated.
Posted
Updated 11-Sep-13 8:09am
v2
Comments
Sergey Alexandrovich Kryukov 11-Sep-13 11:21am    
NLP? Natural language processing? Nonlinear programming? Neuro-linguistic programming?
If anything of the above is the case, forget regular expressions — the problem is way more complex. Or, if you find regular expressions suitable, don't call it "NLP".
—SA
rohith naik 11-Sep-13 14:06pm    
I wanted to know if Natural language processing could solve my problems. Regexes were just my first (not very good) effort at the problem. I wanted to know other solutions to this. Thanks.
Sergey Alexandrovich Kryukov 11-Sep-13 14:47pm    
Then I would advise to remove the mention of Regular Expressions from your question, or just note it under the question.
Natural language processing is too serious topic for the Quick Questions and Answers forum though.
—SA

As I say, Natural Language Processing is a too serious topic for this forum. Despite of apparent progress (automatic translation), overall level of the world computer science in this field allows to call the available result merely experimental. Even though some technologies are even commercialized, more advanced applications or just the attempt to apply existing technology to some more complex languages shows ridiculous results.

First of all, you need more sober estimation of your possibilities. In my opinion, it may easily take your whole life time, and it may take working in the team with world-class scientists (computer science and linguistics) and engineers.

To get some ideas, start here, then follow the links:
http://en.wikipedia.org/wiki/Natural_language_processing[^].

—SA
 
Share this answer
 
You should go with Regex. You just need to find a pattern.

For Example - C# - find a line (regex) in a file and get the complete block of text according to another regex[^]

Here he is searching for title block.
So, the pattern should be like title[^!]* as suggested in the answers[^] there.
Quote:

Your Regex be changed to contain the unknown characters as well, like



  • first title
  • then [^!]* ([^ ] means something not in this set, so [^!]* is everything except ! in any number)


    C#
    Regex regex = new Regex("title[^!]*", RegexOptions.SingleLine);
    MatcheCollection matches = regex.Matches(text);

So, if you can create a pattern as per the requirement, then you can easily get the portion of the next. You just need to identify the start and end characters/strings.
 
Share this answer
 
Comments
rohith naik 11-Sep-13 14:11pm    
Hi Tadit, this was my very first go at the problem. However, I have more than 100k documents to process and making regular expressions to match even 1k documents is very very hard. I think I need to use something more complex. I wanted to know what are the various approaches to this problem and how people solved it and the references they used. Thanks
Oh fine. But, I don't know any other way.

Can you explain me the problem if you use Regex on these many documents?
rohith naik 12-Sep-13 3:42am    
The problem is that creating regexes for expressing the same thing in different ways(there are hundreds of ways documents can express information). It would be VERY hard to use regexes for this. There are no 'regular' patterns in the documents as they use free speech. That is why using regular expressions is so hard.
Oh I got it now... If you don't have a regular pattern, then you can't use Regex. Correct.

Let me think for sometime. I will reply you if I can think of some way to do it.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900