Click here to Skip to main content
15,742,882 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
How is the TF-IDF values are calculated in scikit-learn by python and how to seize the same result below ????


Document 1 : ['includ', 'name', 'function', 'type', 'argument']

Document 2 : ['name', 'function', 'type', 'argument']


##I run the following code to calculate tf-idf for the terms in both Doc 1 and Doc 2

tfidf = TfidfVectorizer(tokenizer=processData, stop_words='english')

tfs = tfidf.fit_transform(rawContentDict.values())

tfs_Values = tfs.toarray()

tfs_Term = tfidf.get_feature_names()

I get the following output of tf-idf values :

Document 1 : [includ = 0.630099, name = 0.448320, function = 0.448320 , type = 0.448320, argument = 0.448320]

Document 2 : [includ = 0 , name= 0.577350 , function = 0.577350 , type= 0.577350, argument= 0.577350]

Now I don't understand how these scores are computed. I tried but I got different results than the program output. How is the TF-IDF score calculated in scikit-learn and how to seize the same result above . ?? Your help is much appreciated

What I have tried:

i read this helpful contents [1] , [2] and implemnt the mentioned steps and still don't get the same results



Best close result I got by following [1] stpes is
[ includ = 0.57496]
and the one i want is [ includ = 0.630099 ]
Updated 24-Aug-21 0:25am
Richard MacCutchan 24-Aug-21 6:53am    
You will need to study the scikit documentation. The first link above contains an explanation of the formulas it uses.
Diyar talal 24-Aug-21 10:23am    
I did. and nothing is clear..
Richard MacCutchan 24-Aug-21 10:30am    
Sorry, but this forum is for Quick Answers, there is not space, or time, to explain some algorithm that you found on the internet.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900