There is no such thing as "language of a Web site".
Any site can be in many languages. In many cases, you cannot even tell the language of a single phrase, because a phrase can be a mixture of two or more different languages; and some words can be valid words in two or more different languages. If course, you can parse HTML and read the content of the
lang
attribute:
http://www.w3.org/TR/html401/struct/dirlang.html[
^].
Even if such attributes are available, they don't give you a solution. This information is only used for technical purposes by the authors of the site:
Language information specified via the lang attribute may be used by a user agent to control rendering in a variety of ways. Some situations where author-supplied language information may be helpful include:
- Assisting search engines
- Assisting speech synthesizers
- Helping a user agent select glyph variants for high quality typography
- Helping a user agent choose a set of quotation marks
- Helping a user agent make decisions about hyphenation, ligatures, and spacing
- Assisting spell checkers and grammar checkers
Similar thing can be said about content encoding. By the very nature of things, content does not have to have any certain language.
In general case, the problem is not only not solvable; it cannot be even formulated in any valid way. You only can invite the users to indicate the language of the request in some way, for example, via a selected item in some list box.
—SA