65.9K
CodeProject is changing. Read more.
Home

Finding Document Similarity using Cosine Theorem

starIconstarIcon
emptyStarIcon
starIcon
emptyStarIconemptyStarIcon

2.19/5 (11 votes)

Dec 7, 2006

1 min read

viewsIcon

45730

downloadIcon

1264

Finding Similarity in Docs

 

Download sourcecode

 In collage we learned that from the origin in euclidean space if we have two points
we can draw a line to two points and then find the cosine of the two lines but in data mining we can use this technique to find the similarity of these documents. But how ? for example

i have to go to school.
i have to go to toilet.

the words of the first sentence are i , have , to , go  , school and all the words frequency is except to

the words of the second sentence are i , have , to , go , to , tioilet and agai all the words frequency is 1

and if we think n-dimensional space the points of the words in space is

1   [i , have , to , go,  school , toilet] = [1,1,2,1,1,0]
2   [i , have , to , go , school , toilet] = [1,1,2,1,0,1]

cos = 1*1 + 1*1 + 2*2 + 1*1 + 1*0 + 0*1 / sqrt((1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 0^2 ) + 1^2 + 1^2 + 2^2 + 1^1 + 0^0 + 1^2)

The interesting part is in the code is finding the non-existing words

private static void PrepareTwoHashTable(Dictionary<string, double> table1, Dictionary<string, double> table2)

{

//for table1

foreach (KeyValuePair<string,double> kv in table1)

{

if (!table2.ContainsKey(kv.Key))

table2.Add(kv.Key, 0);

}

//for table2

foreach (KeyValuePair<string, double> kv in table2)

{

if (!table1.ContainsKey(kv.Key))

table1.Add(kv.Key, 0);

}

}


 

Term Frequency's aim is to set all words' frequencies to set [0,1] interval to normalize so we implement this to our project.

private static Dictionary<string,double> TfFactorized(Dictionary<string,double> table)

{

double sum = 0;

foreach (KeyValuePair<string, double> kv in table)

{

sum += kv.Value;

}

 

Dictionary<string, double> tfTable = new Dictionary<string, double>();

foreach (KeyValuePair<string, double> kv in table)

{

tfTable.Add(kv.Key, kv.Value / sum);

}

return tfTable;

}