Text Processing Help - Regex/NLP/Content Search ?

Question

0.00/5 (No votes)

See more:

Hi,
I have a problem where I have a large number of files and I want to find a particular sentence/paragraph in the files which contain particular information.
I know that this is not a small problem, however I would like some references to material which can help me proceed in the right direction.

For example : I have a large number of text files which contain press releases from many companies announcing their new product. I would like to get the sentences which tell the date of release and the name of the product.

I have tried using regular expressions but the constructs seem to be very complex and do not work well at all. I know that this is a complex problem but any help in terms of references to websites or video lectures will be very helpful. Even sample projects would be very appreciated.

Posted 11-Sep-13 3:49am

rohith naik

Updated 11-Sep-13 8:09am

v2

Add a Solution

Comments

Sergey Alexandrovich Kryukov 11-Sep-13 11:21am

NLP? Natural language processing? Nonlinear programming? Neuro-linguistic programming?
If anything of the above is the case, forget regular expressions — the problem is way more complex. Or, if you find regular expressions suitable, don't call it "NLP".
—SA

rohith naik 11-Sep-13 14:06pm

I wanted to know if Natural language processing could solve my problems. Regexes were just my first (not very good) effort at the problem. I wanted to know other solutions to this. Thanks.

Sergey Alexandrovich Kryukov 11-Sep-13 14:47pm

Then I would advise to remove the mention of Regular Expressions from your question, or just note it under the question.
Natural language processing is too serious topic for the Quick Questions and Answers forum though.
—SA

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2013-09-11T08:53:00

As I say, Natural Language Processing is a too serious topic for this forum. Despite of apparent progress (automatic translation), overall level of the world computer science in this field allows to call the available result merely experimental. Even though some technologies are even commercialized, more advanced applications or just the attempt to apply existing technology to some more complex languages shows ridiculous results.

First of all, you need more sober estimation of your possibilities. In my opinion, it may easily take your whole life time, and it may take working in the team with world-class scientists (computer science and linguistics) and engineers.

To get some ideas, start here, then follow the links:
http://en.wikipedia.org/wiki/Natural_language_processing[^].

—SA

Tadit Dash (ତଡିତ୍ କୁମାର ଦାଶ) · Answer 2 · 2013-09-11T04:54:00

Solution 1

You should go with Regex. You just need to find a pattern.

For Example - C# - find a line (regex) in a file and get the complete block of text according to another regex[^]

Here he is searching for title block.
So, the pattern should be like title[^!]* as suggested in the answers[^] there.

Quote:
Your Regex be changed to contain the unknown characters as well, like
first title
then [^!]* ([^ ] means something not in this set, so [^!]* is everything except ! in any number)

C#
Regex regex = new Regex("title[^!]*", RegexOptions.SingleLine);
MatcheCollection matches = regex.Matches(text);

So, if you can create a pattern as per the requirement, then you can easily get the portion of the next. You just need to identify the start and end characters/strings.

Posted 11-Sep-13 4:54am

Tadit Dash (ତଡିତ୍ କୁମାର ଦାଶ)

Comments

rohith naik 11-Sep-13 14:11pm

Hi Tadit, this was my very first go at the problem. However, I have more than 100k documents to process and making regular expressions to match even 1k documents is very very hard. I think I need to use something more complex. I wanted to know what are the various approaches to this problem and how people solved it and the references they used. Thanks

Tadit Dash (ତଡିତ୍ କୁମାର ଦାଶ) 12-Sep-13 3:23am

Oh fine. But, I don't know any other way.

Can you explain me the problem if you use Regex on these many documents?

rohith naik 12-Sep-13 3:42am

The problem is that creating regexes for expressing the same thing in different ways(there are hundreds of ways documents can express information). It would be VERY hard to use regexes for this. There are no 'regular' patterns in the documents as they use free speech. That is why using regular expressions is so hard.

Tadit Dash (ତଡିତ୍ କୁମାର ଦାଶ) 12-Sep-13 4:31am

Oh I got it now... If you don't have a regular pattern, then you can't use Regex. Correct.

Let me think for sometime. I will reply you if I can think of some way to do it.