Finding Document Similarity using Cosine Theorem

m0nt0y4

2.19/5 (11 votes)

Dec 7, 2006

1 min read

45730

1264

Finding Similarity in Docs

Download sourcecode

In collage we learned that from the origin in euclidean space if we have two points
we can draw a line to two points and then find the cosine of the two lines but in data mining we can use this technique to find the similarity of these documents. But how ? for example

i have to go to school.
i have to go to toilet.

the words of the first sentence are i , have , to , go , school and all the words frequency is except to

the words of the second sentence are i , have , to , go , to , tioilet and agai all the words frequency is 1

and if we think n-dimensional space the points of the words in space is

1 [i , have , to , go, school , toilet] = [1,1,2,1,1,0]
2 [i , have , to , go , school , toilet] = [1,1,2,1,0,1]

cos = 1*1 + 1*1 + 2*2 + 1*1 + 1*0 + 0*1 / sqrt((1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 0^2 ) + 1^2 + 1^2 + 2^2 + 1^1 + 0^0 + 1^2)

The interesting part is in the code is finding the non-existing words

private static void PrepareTwoHashTable(Dictionary<string, double> table1, Dictionary<string, double> table2)
{ 
//for table1
	foreach (KeyValuePair<string,double> kv in table1)
	{
	if (!table2.ContainsKey(kv.Key))
		table2.Add(kv.Key, 0);
	}
	//for table2
	foreach (KeyValuePair<string, double> kv in table2)
	{
		if (!table1.ContainsKey(kv.Key))
		table1.Add(kv.Key, 0);
	}
}

Term Frequency's aim is to set all words' frequencies to set [0,1] interval to normalize so we implement this to our project.

private static Dictionary<string,double> TfFactorized(Dictionary<string,double> table)
{
	double sum = 0;
	foreach (KeyValuePair<string, double> kv in table)
	{
		sum += kv.Value;
	}
 
	Dictionary<string, double> tfTable = new Dictionary<string, double>();
	foreach (KeyValuePair<string, double> kv in table)
	{
		tfTable.Add(kv.Key, kv.Value / sum); 
	}
	return tfTable;
}