Click here to Skip to main content
15,888,351 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hii friends....

I need a help...

I want to know about some of the efficient techniques by which we can identify the languages of the text part in MS office files(ppt,doc,xls),pdf files etc..

I know about unicode... but by comparing the chart we will be able to identify only a small number of languages..

So please help me if you know any efficient techniques ....

[edit]SHOUTING removed - OriginalGriff[/edit]
Posted
Updated 23-Jan-12 0:07am
v3
Comments
OriginalGriff 23-Jan-12 6:07am    
DON'T SHOUT. Using all capitals is considered shouting on the internet, and rude (using all lower case is considered childish). Use proper capitalisation if you want to be taken seriously.
Dagma D 23-Jan-12 6:15am    
Sorry I didnt know that... & Thank you for editing...

1 solution

Just a few notes. The problem of language recognition is very difficult, so you hardly can hope to get a solution through Quick Questions & Answers.

First of all, the language is not the attribute of a document. In particular, it means that a particular document may not have a certain language, as it can have fragments written in different languages, even in the same clause or sentence. The real problem would be to clusterize the document into fragments in different languages and identify this. I think this is apparent.

The problem formulated above can not be solved with 100% fidelity even theoretically. This is because some different languages have identical words. More then that, some different language have different words with 100% identical spelling. By "different" here I mean different set of meanings for two words which are spelled identically in different language. One typical example is Russian (http://en.wikipedia.org/wiki/Russian_language[^]) vs. Ukrainian (http://en.wikipedia.org/wiki/Ukranian_language[^]). On can construct a phrase that even a human reader knowing the context would fail to recognize without ambiguity: it could be interpreted as either Russian phrase quoting Ukrainian, Russian-only, etc., with different meanings. Even though the meaning itself is irrelevant to the problem, this can make fragment boundaries ambiguous.

The identification of Unicode code point subsets which belong to different cultures also cannot help. This is because different languages use the same writing system/script; and sometimes these languages are very different. One striking example is Perso-Arabic script. It is used by Persian (http://en.wikipedia.org/wiki/Persian_language[^]) from so called Indo-European language family; from the other hand, it is used by Arabic (http://en.wikipedia.org/wiki/Arabic[^]) which belongs to Semitic family of languages; and these families are very, very different. Another such example is Devanagari writing system, which is used in India and other countries by many languages (http://en.wikipedia.org/wiki/Devanagari[^]). A number of Slavic Languages use Cyrillic (http://en.wikipedia.org/wiki/Cyrillic_script[^]), another part uses Latin (http://en.wikipedia.org/wiki/Latin[^]) as well as Germanic, Romanian and a great number of language groups, and so on.

[EDIT]

To make things worse, let's not that writing system based on logograms (http://en.wikipedia.org/wiki/Logogram[^]) like Chinese or Korean make the whole concept of language different. After all, what is Chinese language, http://en.wikipedia.org/wiki/Chinese_language[^]? This is not a language in the same sense as English, Arabic or Russian. It is a family of more or less different languages united by the same written language. Those languages are essentially different, with different degree of mutual intelligibility (http://en.wikipedia.org/wiki/Mutual_intelligibility[^]). Naturally, there is no a chance to tell one from another using just a text document and any kind of computer algorithm.

[END EDIT]

So, the results of language clusterization in the document can be only very approximate and often just impossible, not just very difficult. We all know one practical example where language detection can be more helpful than irritating: http://translate.google.com/[^]. But this is just to provide a convenient way to switch input language based on the system "guess" with certain acceptable uncertainty (pun unintended), not classification of whole document.

—SA
 
Share this answer
 
v4
Comments
Manfred Rudolf Bihy 23-Jan-12 18:13pm    
Outstanding answer, alas I can only give you 5 even if it deserves more!
Sergey Alexandrovich Kryukov 23-Jan-12 18:30pm    
Thank you very much, Manfred.

I think I can quickly construct a sentence using German and English words the way even the human reader could not certainly classify as English or German. Let's see. For example:

Not only the lecturer mixed English and German in his speach; it looks like he intentionally tried to confuse the listeners by using the words which made it difficult to understand what exactly language he was using at the moment, such as: "Forest", "Revolution", "Hand", "Bird", "Analysis", "Scientific", "Pony", "Report", "Respect" and the like.

Not only it is impossible to tell the language of each of the words written in quotation signs, the semantic of the phrase does not assume that the reader should be able to determine it. :-)

--SA
thatraja 23-Jan-12 20:51pm    
Wow, Big 5!
Sergey Alexandrovich Kryukov 23-Jan-12 22:01pm    
Thank you, Raja.
--SA
Dagma D 23-Jan-12 23:33pm    
Hi SAKryukov...Thank you for your beautiful answer...
Can I ask you 1 more question??
We need to do this project so if I limit the the number of languages that can be identified to a small number say like 10 or 15 and then apply N-gram technique then will I be able to get an output of about 90%???
Thanks in advance...

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900