Language identification methods!!!

Question

0.00/5 (No votes)

See more:

Hii friends....

I need a help...

I want to know about some of the efficient techniques by which we can identify the languages of the text part in MS office files(ppt,doc,xls),pdf files etc..

I know about unicode... but by comparing the chart we will be able to identify only a small number of languages..

So please help me if you know any efficient techniques ....

[edit]SHOUTING removed - OriginalGriff[/edit]

Posted 22-Jan-12 23:49pm

Dagma D

Updated 23-Jan-12 0:07am

OriginalGriff

v3

Add a Solution

Comments

OriginalGriff 23-Jan-12 6:07am

DON'T SHOUT. Using all capitals is considered shouting on the internet, and rude (using all lower case is considered childish). Use proper capitalisation if you want to be taken seriously.

Dagma D 23-Jan-12 6:15am

Sorry I didnt know that... & Thank you for editing...

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Accepted Answer · 2012-01-23T08:51:00

Just a few notes. The problem of language recognition is very difficult, so you hardly can hope to get a solution through Quick Questions & Answers.

First of all, the language is not the attribute of a document. In particular, it means that a particular document may not have a certain language, as it can have fragments written in different languages, even in the same clause or sentence. The real problem would be to clusterize the document into fragments in different languages and identify this. I think this is apparent.

The problem formulated above can not be solved with 100% fidelity even theoretically. This is because some different languages have identical words. More then that, some different language have different words with 100% identical spelling. By "different" here I mean different set of meanings for two words which are spelled identically in different language. One typical example is Russian (http://en.wikipedia.org/wiki/Russian_language[^]) vs. Ukrainian (http://en.wikipedia.org/wiki/Ukranian_language[^]). On can construct a phrase that even a human reader knowing the context would fail to recognize without ambiguity: it could be interpreted as either Russian phrase quoting Ukrainian, Russian-only, etc., with different meanings. Even though the meaning itself is irrelevant to the problem, this can make fragment boundaries ambiguous.

The identification of Unicode code point subsets which belong to different cultures also cannot help. This is because different languages use the same writing system/script; and sometimes these languages are very different. One striking example is Perso-Arabic script. It is used by Persian (http://en.wikipedia.org/wiki/Persian_language[^]) from so called Indo-European language family; from the other hand, it is used by Arabic (http://en.wikipedia.org/wiki/Arabic[^]) which belongs to Semitic family of languages; and these families are very, very different. Another such example is Devanagari writing system, which is used in India and other countries by many languages (http://en.wikipedia.org/wiki/Devanagari[^]). A number of Slavic Languages use Cyrillic (http://en.wikipedia.org/wiki/Cyrillic_script[^]), another part uses Latin (http://en.wikipedia.org/wiki/Latin[^]) as well as Germanic, Romanian and a great number of language groups, and so on.

[EDIT]

To make things worse, let's not that writing system based on logograms (http://en.wikipedia.org/wiki/Logogram[^]) like Chinese or Korean make the whole concept of language different. After all, what is Chinese language, http://en.wikipedia.org/wiki/Chinese_language[^]? This is not a language in the same sense as English, Arabic or Russian. It is a family of more or less different languages united by the same written language. Those languages are essentially different, with different degree of mutual intelligibility (http://en.wikipedia.org/wiki/Mutual_intelligibility[^]). Naturally, there is no a chance to tell one from another using just a text document and any kind of computer algorithm.

[END EDIT]

So, the results of language clusterization in the document can be only very approximate and often just impossible, not just very difficult. We all know one practical example where language detection can be more helpful than irritating: http://translate.google.com/[^]. But this is just to provide a convenient way to switch input language based on the system "guess" with certain acceptable uncertainty (pun unintended), not classification of whole document.

—SA