Click here to Skip to main content
16,018,394 members
Articles / Programming Languages / C#
Article

Finding Document Similarity using Cosine Theorem

Rate me:
Please Sign up or sign in to vote.
2.19/5 (12 votes)
7 Dec 20061 min read 45.6K   1.3K   19   8
Finding Similarity in Docs

 

Download sourcecode

 In collage we learned that from the origin in euclidean space if we have two points
we can draw a line to two points and then find the cosine of the two lines but in data mining we can use this technique to find the similarity of these documents. But how ? for example

i have to go to school.
i have to go to toilet.

the words of the first sentence are i , have , to , go  , school and all the words frequency is except to

the words of the second sentence are i , have , to , go , to , tioilet and agai all the words frequency is 1

and if we think n-dimensional space the points of the words in space is

1   [i , have , to , go,  school , toilet] = [1,1,2,1,1,0]
2   [i , have , to , go , school , toilet] = [1,1,2,1,0,1]

cos = 1*1 + 1*1 + 2*2 + 1*1 + 1*0 + 0*1 / sqrt((1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 0^2 ) + 1^2 + 1^2 + 2^2 + 1^1 + 0^0 + 1^2)

The interesting part is in the code is finding the non-existing words

<FONT size=2><P></FONT><FONT color=#0000ff size=2>private</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>static</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>void</FONT><FONT size=2> PrepareTwoHashTable(</FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table1, </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table2)</P><P>{ </P><P></FONT><FONT color=#008000 size=2>//for table1</P></FONT><FONT size=2><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table1)</P><P>	{</P><P></FONT><FONT color=#0000ff size=2>	if</FONT><FONT size=2> (!table2.ContainsKey(kv.Key))</P><P>		table2.Add(kv.Key, 0);</P><P>	}</P><P></FONT><FONT color=#008000 size=2>	//for table2</P></FONT><FONT size=2><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table2)</P><P>	{</P><P></FONT><FONT color=#0000ff size=2>		if</FONT><FONT size=2> (!table1.ContainsKey(kv.Key))</P><P>		table1.Add(kv.Key, 0);</P><P>	}</P><P>}</P></FONT>


 

Term Frequency's aim is to set all words' frequencies to set [0,1] interval to normalize so we implement this to our project.

<FONT size=2><P></FONT><FONT color=#0000ff size=2>private</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>static</FONT><FONT size=2> </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> TfFactorized(</FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table)</P><P>{</P><P></FONT><FONT color=#0000ff size=2>	double</FONT><FONT size=2> sum = 0;</P><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table)</P><P>	{</P><P>		sum += kv.Value;</P><P>	}</P><P> </P><P></FONT><FONT color=#008080 size=2>	Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> tfTable = </FONT><FONT color=#0000ff size=2>new</FONT><FONT size=2> </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>>();</P><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table)</P><P>	{</P><P>		tfTable.Add(kv.Key, kv.Value / sum); </P><P>	}</P><P></FONT><FONT color=#0000ff size=2>	return</FONT><FONT size=2> tfTable;</P><P>}</P></FONT></FONT>

 


License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Turkey Turkey
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 2 Pin
Member 104828471-Jan-14 17:25
Member 104828471-Jan-14 17:25 
Generalfinding similarity Pin
jamal saad19-Apr-11 2:29
jamal saad19-Apr-11 2:29 
GeneralMy vote of 2 Pin
Andrew Rissing26-Feb-10 9:52
Andrew Rissing26-Feb-10 9:52 
GeneralPrepareAllHashTables Pin
Best Jetty2-Apr-07 23:14
Best Jetty2-Apr-07 23:14 
Generalgood article Pin
margiex7-Dec-06 18:02
margiex7-Dec-06 18:02 
any else samples?
GeneralRe: good article Pin
m0nt0y424-Dec-06 7:16
m0nt0y424-Dec-06 7:16 
GeneralRe: good article Pin
margiex25-Dec-06 22:29
margiex25-Dec-06 22:29 
GeneralRe: good article Pin
harijayakumar9-Aug-07 18:52
harijayakumar9-Aug-07 18:52 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.