Just a few notes. The problem of language recognition is
very difficult, so you hardly can hope to get a solution through
Quick Questions & Answers.
First of all, the language is not the attribute of a document. In particular, it means that a particular document may not have a certain language, as it can have fragments written in different languages, even in the same clause or sentence. The real problem would be to clusterize the document into fragments in different languages and identify this. I think this is apparent.
The problem formulated above
can not be solved with 100% fidelity
even theoretically. This is because some different languages have identical words. More then that, some different language have different words with 100% identical spelling. By "different" here I mean different set of meanings for two words which are spelled identically in different language. One typical example is Russian (
http://en.wikipedia.org/wiki/Russian_language[
^]) vs. Ukrainian (
http://en.wikipedia.org/wiki/Ukranian_language[
^]). On can construct a phrase that even a human reader knowing the context would fail to recognize without ambiguity: it could be interpreted as either Russian phrase quoting Ukrainian, Russian-only, etc., with different meanings. Even though the meaning itself is irrelevant to the problem, this can make fragment boundaries ambiguous.
The identification of Unicode code point subsets which belong to different cultures also cannot help. This is because different languages use the same writing system/script; and sometimes these languages are very different. One striking example is Perso-Arabic script. It is used by Persian (
http://en.wikipedia.org/wiki/Persian_language[
^]) from so called Indo-European language family; from the other hand, it is used by Arabic (
http://en.wikipedia.org/wiki/Arabic[
^]) which belongs to Semitic family of languages; and these families are very, very different. Another such example is Devanagari writing system, which is used in India and other countries by many languages (
http://en.wikipedia.org/wiki/Devanagari[
^]). A number of Slavic Languages use Cyrillic (
http://en.wikipedia.org/wiki/Cyrillic_script[
^]), another part uses Latin (
http://en.wikipedia.org/wiki/Latin[
^]) as well as Germanic, Romanian and a great number of language groups, and so on.
[EDIT]
To make things worse, let's not that writing system based on
logograms (
http://en.wikipedia.org/wiki/Logogram[
^]) like Chinese or Korean make the
whole concept of language different. After all, what is Chinese language,
http://en.wikipedia.org/wiki/Chinese_language[
^]? This is not a language in the same sense as English, Arabic or Russian. It is a family of more or less different languages united by the same written language. Those languages are essentially different, with different degree of
mutual intelligibility (
http://en.wikipedia.org/wiki/Mutual_intelligibility[
^]). Naturally, there is no a chance to tell one from another using just a text document and any kind of computer algorithm.
[END EDIT]
So, the results of language clusterization in the document can be only
very approximate and often just
impossible, not just
very difficult. We all know one practical example where language detection can be more helpful than irritating:
http://translate.google.com/[
^]. But this is just to provide a convenient way to switch input language based on the system "guess" with certain acceptable uncertainty (pun unintended), not classification of whole document.
—SA