Finding Document Similarity using Cosine Theorem






2.19/5 (11 votes)
Dec 7, 2006
1 min read

45730

1264
Finding Similarity in Docs
Download sourcecode
In collage we learned that from the origin in euclidean space if we have two points
we can draw a line to two points and then find the cosine of the two lines but in data mining we can use this technique to find the similarity of these documents. But how ? for example
i have to go to school.
i have to go to toilet.
the words of the first sentence are i , have , to , go , school and all the words frequency is except to
the words of the second sentence are i , have , to , go , to , tioilet and agai all the words frequency is 1
and if we think n-dimensional space the points of the words in space is
1 [i , have , to , go, school , toilet] = [1,1,2,1,1,0]
2 [i , have , to , go , school , toilet] = [1,1,2,1,0,1]
cos = 1*1 + 1*1 + 2*2 + 1*1 + 1*0 + 0*1 / sqrt((1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 0^2 ) + 1^2 + 1^2 + 2^2 + 1^1 + 0^0 + 1^2)
The interesting part is in the code is finding the non-existing words
private static void PrepareTwoHashTable(Dictionary<string, double> table1, Dictionary<string, double> table2){
//for table1 foreach (KeyValuePair<string,double> kv in table1){
if (!table2.ContainsKey(kv.Key))table2.Add(kv.Key, 0);
}
//for table2 foreach (KeyValuePair<string, double> kv in table2){
if (!table1.ContainsKey(kv.Key))table1.Add(kv.Key, 0);
}
}
Term Frequency's aim is to set all words' frequencies to set [0,1] interval to normalize so we implement this to our project.
private static Dictionary<string,double> TfFactorized(Dictionary<string,double> table){
double sum = 0; foreach (KeyValuePair<string, double> kv in table){
sum += kv.Value;
}
Dictionary<string, double> tfTable = new Dictionary<string, double>(); foreach (KeyValuePair<string, double> kv in table)
{
tfTable.Add(kv.Key, kv.Value / sum);
}
return tfTable;}