Click here to Skip to main content
15,868,016 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)
See more: , +
Hello,

I am supposed to take data from wikipeadia dump or freebase dump or dbpedia.
I am then supposed write code that gives as output what every datum in that database is. eg: name of a person or a bussines, address,... It does not matter in what language i write the code but, I’m only familiar with C, C++, Java and Python. Java is my preferred language.

Those databases have all types of data: title, person name, address, social security, phone...

I have three questions:

1) Since I have used machine learning a lot, I have decided to use a machine learning approach.
I have started looking into WEKA, a Java machine learning toolbox. It however has only a GPL license. Is there another tool box that i can use in commercial product.

2)The problem I am facing with a machine learning approach is that I don't know what features to use. All I can think of right now is: the length of the datum, the number of string characters it has, the number of integer character it has.
This is very little with all the type of data those databases have. Regular expression seems to not be a solution for this type of project.

3)Is there another approach I can use? I mean, is machine learning the only approach?

Thank you for your help.

Regards,

Herve
Posted

1 solution

This stuff's way beyond me, but to get a discussion started, here's how I'd approach it. . .

* Build up a dictionary of basic words (you can pull a list from project gutenberg to get you started). Classify these as verbs, nouns, adjectives, etc.

* Read up on the syntax of sentences (e.g. diagram[^]).

* Use this knowledge along with your dictionary to create a classification routine which can take a sentence and guess at a classification (verb, noun, etc) based on a word's position within the sentence.

* The nouns are the bits you're interested in (i.e. names, addresses, etc). Have another routine which you pass the sentence to if it contains an unknown noun and a keyword (named, called, he, she, lives at, etc). This can then add it to your list of likely candidates if the location of the keywords compared to the new noun is deemed as suggesting that the noun is a name/address.

* Break the data from your source down into sentences, pass them to the routine, and pull back the results.

This will still be a very rough approach, but with a bit of tweaking I reckon it'll be OK for starters.

Alternatively, check the web for videos and docs about the Wanderlust Natural Language project - I think they attempted something similar, but more advanced.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900