Click here to Skip to main content
15,890,506 members
Please Sign up or sign in to vote.
1.00/5 (3 votes)
I'm thinking about my graduation project to be word clustring project but i don't know what knowledge I should have before starting
I'm taking pattern recognition & classification course at college now
any help?
thanks in advanced
Posted

Why so many people ask about "starting"? There are too many ways to start, and many of those ways are good enough. This is not so critical which one you would choose.
For example, you can start with learning relevant part of mathematics: https://en.wikipedia.org/wiki/Cluster_analysis[^].

First, very elementary mathematical thing is this: to do clustering, you should have some space of elements of your set you want to cluster, and that space should have a correctly defined norm, to make a normal space: https://en.wikipedia.org/wiki/Normed_vector_space[^].

In simple words, you should have (and supply to the calculation algorithm) the function which calculates the distance between two objects (which you called words, in your case). Such functions could be different; but this is not a fully arbitrary function; it should behave to meet the requirements posed by the axioms of the metric space. Only then you can apply the cluster analysis to your set. The article referenced above is really just the staring point.

—SA
 
Share this answer
 
Comments
Mujeeba Haj Najeeb 2-Nov-14 9:40am    
I need to know the concept of word/text clustring
I haven't realise it well
e.g: in OCR we process a text to make letters extraction
my question is: what do we do in word/text clustring?
Sergey Alexandrovich Kryukov 2-Nov-14 15:59pm    
As I say, you define what is the distance between words, called "norm" for the space. No matter what is it, but it can meet the axioms. Say if A > B and B > C, then should be A > C (transitive rule), and so on. Just read on the topic. And then follow the general theory of clustering, which, from this point is abstracted from the nature of elements you want to cluster, as soon ans the distance (norm) is correctly defined and works.
—SA
Mujeeba Haj Najeeb 5-Nov-14 14:36pm    
I'll explain my question in another way
which applications need to use word/text clustring?
I need ideas to start
Sergey Alexandrovich Kryukov 5-Nov-14 16:12pm    
Do you want to implement clustering in some application of other software product, or do you need to understand why going for that at all?
If by "application" you mean using some available application, you are at the wrong forum. We usually advise on how to create software.
—SA
Mujeeba Haj Najeeb 6-Nov-14 14:56pm    
first of all I need to understand why going for that at all?
and I also need to know some applications in real life to realize it more specifically
I need to know where am going, I need a general perspective to make a decision to enter in such field or not
Mujeeba Haj Najeeb asked:

first of all I need to understand why going for that at all?
and I also need to know some applications in real life to realize it more specifically
I need to know where am going, I need a general perspective to make a decision to enter in such field or not.
Thank you for your clarification; fair enough.

It's hard to cover any comprehensive applications of this kind of analysis, but, generally, it is used in some fields of linguistics, in particular, natural language processing:
http://en.wikipedia.org/wiki/Natural_language_processing[^].

It can be considered as one of the many parts of computational linguistics:
http://en.wikipedia.org/wiki/Computational_linguistics[^].

See also:
http://en.wikipedia.org/wiki/Word-sense_induction[^],
http://en.wikipedia.org/wiki/Ambiguity[^].

Note that in the examples mentioned above, the norm (distance) itself (I discussed the role of norm in Solution 1) is extremely complex: it should reflect semantic similarity a very complex notion which is itself very hard to formalize. The norm values for some word set (thesaurus) can come from extensive statistical analysis, expert systems, and the like. It has nothing to do with the string comparison algorithms implemented in most libraries.

See also: http://en.wikipedia.org/wiki/Word-sense_disambiguation[^].

Maybe there are different applications which I never heard of and could only speculate about. For example, one of my friends specialized in computational linguistic and defended his dissertation on such thing as inferring individual characteristics of a writer based exclusively on statistics of the words found in the text samples.

I must say that computational linguistics is a developing branch of science which is not yet really close to, say, serious commercial use. I feel that the major bread-through works lie in future, which might look attractive to the newcomers. I believe every serious work in this field is on the cutting edge of both linguistics and applied mathematics (and maybe even "fundamental" mathematics. If you want to go for it (and it's good that you asked this question), you need to have solid mathematical background and seriously go into linguistic, which is really hard to do. I don't think that being "just a programmer", a part of technical stuff, can be practical or reasonable. Getting into real science is the only thing which makes sense, but this is not for everyone, so try to be realistic. I hope you won't consider my words as discouragement. I would be more than happy to know that you take this route and are successful.

—SA
 
Share this answer
 
v4
Comments
Mujeeba Haj Najeeb 7-Nov-14 2:41am    
Really thank you, I have a general understand about this now but I have to think seriously about it
maybe I'll take it maybe not
I'll tell you my decision anyway

Regards
Sergey Alexandrovich Kryukov 7-Nov-14 3:25am    
To think about it seriously? Good idea.
Thank you for your promise to notify me; I'll be glad to hear from you.
In all cases, consider accepting this answer formally as well.
—SA
Mujeeba Haj Najeeb 15-Nov-14 3:40am    
I've thought about your last solution (solution 2 )and read it more than one time
I've found that I haven't the experiance to enter such a field but on the other hand I don't know what to do with my graduation project
I want to stay in pattern clssification or clustring but not complex like this
I don't know what to do but for sure I won't work in word clustrig
what do you think about DBSCAN algorithm implementation in c#??
Sergey Alexandrovich Kryukov 15-Nov-14 18:32pm    
All, right, at least it's good that your decision is motivated.
As to DBSCAN: sorry, before your comment, I did not know this algorithm, so I don't want to confuse you with what I would think about it.
—SA
Mujeeba Haj Najeeb 16-Nov-14 2:22am    
It's ok, nothing to be sorry about
at least you didn't dive me any answer you are not sure about but can I make it as a new question maybe someone knows?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900