I'm working on a Java project where I use min-hashing to compute the Jaccard similarity between two documents.Both documents represent texts which are given in the form of unsorted integer arrays ex. arr[0] = first word(as an int ) ... I compute the min hashing similarity between two sets and then use it to compute the jaccard coefficient .
The problem is that when I divide the min hashing similarity with the number of elements in the union of the 2 arrays I get a number not accurate to the division
ex. arr1={4,5,6,7} arr2={6,7}
min hashing similarity : 0.5
union array = {0,1,2,3,6,7} length = 6
jaccard coefficient = min-hashing similarity / length = 0.5/6 = 0.0833333333333
but I get 0.096 when I compute the jaccard coefficient
I have the code down below .
Thank you for your time .
What I have tried:
@SuppressWarnings("static-access")
public double jaccard(Document doc)
{
return this.minhash(doc)/(double(this.unionArrays(a,b,a.length,b.length));
}
public static int unionArrays(int[] a ,int[] b ,int m , int n)
{
int counter=0;
if (m > n)
{
int tempp[] = a;
a = b;
b = tempp;
int temp = m;
m = n;
n = temp;
}
Arrays.sort(a);
for (int i = 0; i < m; i++)
{
counter++;
}
for (int i = 0; i < n; i++)
{
if (binarySearch(a, 1, m-1 , b[i]) == -1)
{
counter++;
}
}
return counter;
}
private static int binarySearch(int[] arr, int l, int r, int x) {
if (r >= l)
{
int mid = l + (r - l) / 2;
if (arr[mid] == x)
return mid;
if (arr[mid] > x)
return binarySearch(arr, l, mid - 1, x);
return binarySearch(arr, mid + 1, r, x);
}
return -1;
}