Click here to Skip to main content
15,891,253 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
hai..any one tell me the coding for calculating the distance between XML documents

Thanks in advance
Posted
Updated 6-Aug-12 20:30pm
v2
Comments
Reza Ahmadi 7-Aug-12 2:33am    
I'm a little bit confused! Could you please explain it in some more details?
Andreas Gieriet 8-Aug-12 5:50am    
Hello nandhakumar.p.k
You got some suggested solutions to your question.
Any feedback from your side? Any rating? Anyone accepted?
Cheers
Andi

Hi,

U can try this Link
 
Share this answer
 
Comments
nandhakumar.p.k 7-Aug-12 3:15am    
hi..am new for .Net...i want to develop that paper only(XML Distance Measure)..so i need some idea's..
See Document Distance Problem Definition[^] for the algorithm.

From that, it should be easy to write a C# program calculating that, e.g. in pseudo code:
C#
double DocumentDistance(string textA, string textB)
{
    Dictionary<string, int> binsA = CalculateWordFrequencies(textA);
    Dictionary<string, int> binsB = CalculateWordFrequencies(textB);
    double innerProduct = CalculateInnerProduct(binsA, binsB);
    double normA = CalculateNorm(binsA);
    double normB = CalculateNorm(binsB);
    return Math.Acos(innerProduct / (normA * normB));
}
Dictionary<string, int> CalculateWordFrequencies(string text)
{
    Dictionary<string, int> bins = new Dictionary<string, int>();
    foreach(string word in GetWords(text))
    {
        if (bins.ContainsKey(word)) bins[word]++;
        else bins.Add(word, 1);
    }
    return bins;
}
IEnumerable<string> GetWords(string text)
{
    return Regex.Matches(text, @"\b\w+\b").Cast<Match>().Select(m=>m.Value);
}
double CalculateInnerProduct(Dictionary<string, int> binsA, Dictionary<string, int> binsB)
{
    double product = 0.0;
    foreach(string word in binsA.Keys.Concat(binsB.Keys).Unique())
    {
        int frequencyA = binA.ContainsKey(word) ? binA[word] : 0;
        int frequencyB = binB.ContainsKey(word) ? binB[word] : 0;
        product += (double)(frequencyA * frequencyB);
    }
    return product;
}
double CalculateNorm(Dictionary<string, int> bins)
{
    double sum = 0.0;
    foreach(int frequency in bins.Values)
    {
       sum += (double)(frequency * frequency);
    }
    return Math.Sqrt(sum);
}


To my understanding, it works for plain text files as well as for XML files: the word splitting algorithm takes also tags and attributes as words - if they match to 100%, the distance will be 0. If some elements or attributes differ, the distance will be greater than 0.

Cheers
Andi

PS: The pseudo code above follows the description of the referenced document - optimization is left as exercise to you (e.g. calculating the inner product can be improved by taking the Intersection of both bins' Keys and no need to check for existance in each of the bins. Reason: all words that are only in one of the bins do not contribute to the product - they are 0).
 
Share this answer
 
v4

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900