Download sourcecode
In collage we learned that from the origin in euclidean space if we have two points
we can draw a line to two points and then find the cosine of the two lines but in data mining we can use this technique to find the similarity of these documents. But how ? for example
i have to go to school.
i have to go to toilet.
the words of the first sentence are i , have , to , go , school and all the words frequency is except to
the words of the second sentence are i , have , to , go , to , tioilet and agai all the words frequency is 1
and if we think n-dimensional space the points of the words in space is
1 [i , have , to , go, school , toilet] = [1,1,2,1,1,0]
2 [i , have , to , go , school , toilet] = [1,1,2,1,0,1]
cos = 1*1 + 1*1 + 2*2 + 1*1 + 1*0 + 0*1 / sqrt((1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 0^2 ) + 1^2 + 1^2 + 2^2 + 1^1 + 0^0 + 1^2)
The interesting part is in the code is finding the non-existing words
<FONT size=2><P></FONT><FONT color=#0000ff size=2>private</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>static</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>void</FONT><FONT size=2> PrepareTwoHashTable(</FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table1, </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table2)</P><P>{ </P><P></FONT><FONT color=#008000 size=2>
Term Frequency's aim is to set all words' frequencies to set [0,1] interval to normalize so we implement this to our project.
<FONT size=2><P></FONT><FONT color=#0000ff size=2>private</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>static</FONT><FONT size=2> </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> TfFactorized(</FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table)</P><P>{</P><P></FONT><FONT color=#0000ff size=2> double</FONT><FONT size=2> sum = 0;</P><P></FONT><FONT color=#0000ff size=2> foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table)</P><P> {</P><P> sum += kv.Value;</P><P> }</P><P> </P><P></FONT><FONT color=#008080 size=2> Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> tfTable = </FONT><FONT color=#0000ff size=2>new</FONT><FONT size=2> </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>>();</P><P></FONT><FONT color=#0000ff size=2> foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table)</P><P> {</P><P> tfTable.Add(kv.Key, kv.Value / sum); </P><P> }</P><P></FONT><FONT color=#0000ff size=2> return</FONT><FONT size=2> tfTable;</P><P>}</P></FONT></FONT>
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.