IETF language tags and IANA language subtag registry

Andrea Simonassi

5.00/5 (3 votes)

Oct 31, 2016

CPOL

22 min read

19690

105

A simplified guide to the RFC5646 document; what is a language tag, how must be parsed, what is the IANA registry. How to implement a library, in any programming lang, to lookup user preferred language against a list of supported languages. A C# implementation is provided.

Download bcp47.zip - 105 KB

Introduction

This article is about how to tag data with language tags, to provide our users information using the best language we can offer them, without even ask to select a language from the supported languages list but doing an automated lookup.

All started with me trying to figure out how to store translation for a product database, I wanted my "producer" users to insert arbitrary translations, tag them with arbitrary language tag, then let my "consumer" users to consume translations into their preferred language.

I was not able to use the .NET framework to do the above then I start study the issue, the following is what I discovered on the matter.

Some philosophy

First define language: we can define language in many ways, for example a language could be seen as a set of syntactic rules, but for my purpose I will prefer, in this case, to define a language as a relation between the "meaning" set and the "phrases" set; that suggests that for a single meaning we can use different phrases that express the same meaning, using different languages;

∀ c ∈ concepts, ∀ l ∈ languages, ∃ P ∈ Phrases : P = l(c), P ⊂ Phrases

where P is a subset of phrases and eventually empty.

So we have one and only one concept, but 0-to-many phrases can express it on each language (language is a relation and not a function) .

Imagine we have a "meaning" or "concept", we can encode the "meaning" using a phrase, the language dictate which words we have to use to encode the "meaning" into a message, if our language is expressive enough then the receiver should be able to reconstruct the concept quite closely, with the condition that who receive the message knows the language.

The above is nice and could help us to model an entity relation database, but, the problem with human languages is that we can't consider "British English" equals to "American English", but those are quite similar; we said that a language is a "concept" encoder: can we consider language A similar to language B if a person, speaking the A language, can decode a message expressed on language B by reconstruct a concept that is close enough to the intended one? I think so.

So we could partition languages set by grouping similar languages, we could name the partition set the "language families" partition.

IETF language tags application

Return to earth: we, as software developers, need our software to communicate information to users, we need our users to understand concepts like "if you press that button I will save your file”. Remember: a concept is one and only one but many languages exist to express it. It is practically impossible for most of us to implement all the languages spoken on planet Earth into an application, but we can take advantage of the fact that many languages are similar to reach a bigger audience.

I mean, in some cases we may want to do a high quality work by doing translation into every single language variant, but in many cases using big language families would be enough, a localization system should handle both use case.

Now we will, just for a minute, forget about the fact similar languages exists and that it is possible to group them together, in this case our software could store, for each supported language, the text that is necessary to express the information to the user, take the "save the file" concept as example

Example:

Our concepts storage could be a subset of the cartesian product of concepts x languages x phrases

SaveFile, Italian Language, Salvare il file
SaveFile, French Language, Enregistrer le fichier
... etc

Using such a storage, our UI can communicate the SaveFile concept using the right language by just selecting the triple having the language value equals to the user's preferred language, or default to a given language if no match is found.

The above sounds simple, but, we were going too fast, we are close but we have to complicate things a little bit: how if we support language "American English" but the user preference is "British English"?

US English is different than British English. People living in US would prefer en-US, but most of them would still prefer en-GB to Saharan Arabic or Cantonese Chinese, a good lookup should return a similar language instead of the default one.

Our UI can decide to support both "US English" and "British English" or "Generic English" or maybe only one of them. In the latter case we need to handle approximate lookup, if we want to extend our audience and still provide an acceptable user experience.

Now, a minute is past, we have to remember that we have "language family" subsets on our languages set: if our UI support a language that is in the same "language family subset", maybe we could provide that language instead of the default one.

Knowing that language families exist, we could do an approximate lookup on our "concepts" storage, but not if the storage is the naive one I represented above: we need to store somewhere the information that "US English" is a good approximation for "British English" or we have to define the language tag in a much more expressive way, IETF's BCP47 standard does both things.

If we label each language on the set using an IETF language tag, we could infer relations between languages, by just looking at the language tag itself, and also we may get additional help from an external language database, that is called the the IANA language subtag registry.

The following is the whole point and is a recap of what said 'till now, all the article could be reduced to:

how do we match the user language with our concepts store? If we have a perfect match, it's perfect, but if we don't we could still find a relation between what users prefers and any of the language we support, that would be a good approximate match. Of course we will never match "US English" with "GB English" using a string comparison, we need a way to weight similar languages on the same "language family".

Why a standard

In good old days, to support languages, we just needed a storage like the naive one I proposed on the above paragraph, we did not have any need for any fancy language tag syntax because we did not have to do approximate match on user language preference: we had to provide a list of languages, the user picked one of them, then our software did pick the translation corresponding to what the user selected on the list, everybody was happy.

But nowadays, users expect the software to automatically pick the best language depending on user's preferences. Applications have to read somewhere which is the user preference then decide which language to use to communicate with her depending on the setting, with the complication that maybe the user preferred language is not supported but that a similar one (a more generic one is preferred/ a less generic one is second choice) is.

Now of course, we need our application to understand how to get the user preference. That's why we need a standard way of representing languages.

The standard I'm talking about is the IETF's bcp47 , bcp47 is widely adopted, complying we can communicate (I mean inter software communication) with the world. All operating systems (that I know of) support that standard and so do all the web clients and servers. That is good because most of the time we do not have to ask the user what language he/she prefer, we get user language by calling some API that reads into some configuration file.

In .NET, for example, we get the current thread language calling the following API

string preferredUserLanguageName = Thread.CurrentThread.CurrentCulture.Name;

The idea of the ietf language tag syntas is: if we call our language en-US we should consider it an almost good match of en-GB because they share the first tag. Well that's the big picture, but things are a little more complicate: ieft's bcp47 create a hierarchy of partitions on the languages set.

How localization is implemented in .NET

Usually the localized resource storage is composed of resx files: there are one or more sets of resource files, each set has a main resource file - <ResourceFile>.resx - then there are a bunch of child files containing the localized version - <ResourceFile>.<ietf-language-tag>.resx - the compiler will burn them into the assembly's satellite DLLs, the runtime will automagically lookup the best localized resources available for the users as the UI requires them, you have to do nothing but populate the localized resx files.

To .NET people: to further refine your localized string you could use this beautiful tool SmartFormat by Scott Rippey

Why did you study and implement BCP47?

The standard .NET Framework implementation is so good, and I do not want to compete there, 99% it is enough, but also have some limit, or maybe I do have some limit, so I decided to implement the lookup anyway.

I was trying to find a way to have translations stored on an SQL database, where, for example, a product description record has to be translated; but I was not happy with resource files, because I wanted the user to be able to add language tags without to recompile and deploy satellite DLLs.

Another reason was that it was less effort to go with a new implementation rather than to study existing tools, because the concepts are quite simple.

Last, I thought that maybe in a future (that usually never comes :)) the possibility to customize the lookup policies could be useful, for example would be nice to implement a different kind of lookup, where the user provides a list of preferred languages instead of a single one, as the web browsers do.

Ok but why an article

Because I did not find any simple resource to read on the matter except the complete RFC 5646 itself.

Background

The Tags for Identify a Language are defined by this publicly available document: https://tools.ietf.org/html/bcp47.

Chapter 2 of that document defines the syntax of a language tag.

The language tag is made of blocks separated by dashes, as simple as en-US or as complicated as zh-yue-CN-a-anyext-x-private-x-otherprivate; chapter 2 also specify how these sub tags can be combined to generate a valid language tag.

Chapter three, introduce the IANA sub tag registry, the registry defines all the valid sub tags, except the application specific ones.

You can download the registry here http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry, it’s just a Unicode Text database.

Lookup a language

Well, we know there is a standard way to encode language tags in our store and we got the idea that those tags are composed of a hierarchy of subtags like Lang-Region or Lang-Script-Region-Variant, all except Lang tag is optional; the next question is: how do we match the user language vs the supported ones? Suppose our UI support three languages: en (generic English), es (generic Spanish) and fr-FR (French as spoken in France)

Now also suppose a user comes to our UI and that she prefers the es-CO language (Spanish as spoken in Colombia).

What the UI have to do is to lookup the best match for language tag es-CO against (es, en, fr-FR); in this particular case I would love our lookup function to return es, because language subtag is matching;

The above lookup might be difficult to implement, well not that difficult, but not even that easy, because of the fact many subtags are optional to support both complex or simple use cases: the bcp47 supports complex use cases, requiring complex language tags to account for dialects, orthography variants, extensions and many more.

MSDN discourage developers to implement language lookup, the best practice is to use the framework's integrated globalization functions.

But we'll ignore MSDN; to do the lookup, we need to parse the language tag strings, in order to decompose it to its parts and be able to compare region with region, script with script, and not region with script nor extlang with variant (see syntax chapter).

After parsing we need to compare the single tokens of the string (lang, script, region,...) in their correct order to assign to each available language a rank, then we have to return the best ranking language; a naive implementation could be:

Lang user = parse("es-CO");
Lang avail1 = parse("en");
Lang avail2 = parse("es");
Lang defaultL = avail1;

int rank1 = 
          user.Language == avail.Language ? 32: -32  
       + 
          user.Script == avail.Script ? 16 : -16
       +
           //... and so on ..
     //decide your policy for matching language in order to provide better service to the user
;
int rank2 = //..

if(rank1 > rank2 && rank1>0)
       return avail1;
else if(rank2 > rank1 && rank2>0)
       return avail2;
return defaultL;

Being the es tag the result, the UI can look into the store and provide strings tagged with the es language; es is a good match over es-CO that is better than the default English language, for our user.

Ok the above pseudo-code, beside its ugliness and zero-reusability issues, could be good, in fact the BCP47 states that the syntax of the language tag is designed to allow an implementation to do a useful lookup without to read the IANA Language Sub Tag Registry. But the above pseudo code ignore a lot of information we may find on the registry.

Why is using the registry better?

The fact is, that is better to look at the registry. An advantage is that the user might have a preference for some language that have been renamed or that is not canonical.

Example:

our app does support language yue-HK and the user preference is zh-yue-HK , we could not match those strings without the knowledge that they represent the same language, zh-yue is a redundant form of yue, only the registry can help.

Canonicalization is possible using the registry.

Another example:

suppose our UI have Chinese Cantonese yue as the default language but also support Mesopotamian Arabic acm. A user preference is Saharan Arabic aao, I do not know a single word on Mesopotamian Arabic nor in Saharan Arabic but I guess that the user would prefer to have a Mesopotamian Arabic UI rather than a Chinese Cantonese UI.

Well the registry creates a relation between acm and aao languages, they both share the same macro language that is ar Arabian, this is how our matching function will behave in this specific case, a language in the same macro language relation will be preferred (my opinion) over the default language.

Last but not least, using the registry eases the writing of the language tag parser.

Net Framework Localization Reference

You can find more info on .NET really nice implementation of BCP47 here: https://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo(v=vs.110).aspx (hope the link will land you to the correctly localized page :-))

BCP47 language tag syntax

In this chapter I am going to show how a well formed language tag is composed, you can consider it a simplified introduction to what you get on chapter 2 of RFC 5646.

This chapter and the next one are about technical details, skip both if you wish, reading those chapter would be beneficial if you intend to create your own implementation.

Language-Tags are case insensitive string, there are three kinds of tags: normal language tags, private use tags and grandfathered tags.

Private Tags

A private tag begins with the character (case insensitive) “x” followed by dash “-” and then by an arbitrary number of private sub tags: such as “x-private-more-private”, length of a sub tag must be between 1 and 8 characters.

Grandfathered Tags

Grandfathered tags are special cases, their structure has no syntactical meaning, for example the string i-klingon is a grandfathered tag, to validate i-klingon the parser just have to look up the registry for any Grandfathered record matching the whole string i-klingon, case insensitive, here is a sample registry entry for a grandfathered language tag:

Type: grandfathered
Tag: i-klingon
Description: Klingon
Added: 1999-05-26
Deprecated: 2004-02-24
Preferred-Value: tlh

As you may note the above i-klingon grandfathered record is deprecated in favor of the value tlh that also means that tlh is an exact match of i-klingon during lookups.

Normal Language Tag

A normal language tag always begins with a language sub tag followed by (optional) sub tags; normal language tags are parsed as a structure having fields language, extlang, script, region, variant, extension, private.

The extlang structure field could be omitted because the registry contains a primary language entry to substitute any valid combination of language-sublanguage.

That said, to be able to parse the normal language tag we have to know its syntax, here it is its ABNF syntax (please see RFC 5646 sec 2.1 to get detailed syntax)

primarylanguage
 ["-" extlang]
 ["-" script]
 ["-" region]
*("-" variant)
*("-" extension)
 ["-" privateuse]

Primary Language Sub Tag

The first sub tag of the language tag is always the primary language sub tag, made of 2 or 3 characters representing the ISO 639 code of the language: complication: the ABNF syntax allows for longer sub tags (up to 8 chars) but these have to be considered reserved use: it's another good reason to use the registry, because it only contains valid primary language sub tags.

Example:

zozo is not a valid language tag, it is syntactically valid but not valid ISO 639 code
zoz is syntactically valid but not a valid ISO 639 code
zoo is a valid language tag, because begin with a valid language sub tag (Asunción Mixtepec Zapotec)

Extlang Sub Tag

Although the preferred form of a language tag does not contain the extlang sub tag, language tags can also be composed of a primary language followed by an extension language sub tag, for compatibility reasons. Each extension language sub tag on the IANA registry have a corresponding primary language tag; to clear it up consider following example: the tag ar-aoo and the tag aoo represent the same language, that means that while parsing we can normalize input string ar-aoo to aoo.

The extlang sub tag is always three characters long, the ABNF syntax allows up to three extlang sub tags per language tag but the use is reserved, my implementation will consider a language tag with more than one extlang sub tag is invalid.

Example:

ar-aao-acm is syntactically ok according to ABNF syntax but not valid because the use of more than one extlang sub tag is reserved

Script Sub Tag

Represent the script to be used, for example Arabic, Cyrillic, Latin and many more; the registry contain the list of all valid script sub tags.

This sub tag is optional, can only be parsed after parsing lang and extlang, if present, and must be 4 characters long exactly, they recommend to use Camel Case to ease human readability, but as always must be treated as case insensitive.

Example:
tag sr-Cyrl must be parsed as:

Language: sr (sr can only be the language, because the first sub tag is always the language sub tag except when it is “i” or “x”)
Script: Cyrl (can only be the script, because it is four char length and come after language, so can't be region)

Note that the registry can specify the Suppress-Script attribute for a given language, so for example it-Latn is not correct because Latn subscript have been suppressed, can be tolerated as input value, not as output.

Region Sub Tag

Region represent the geographic location where the language is spoken.

Region sub tag is optional and must follow lang, extlang and script, if present, must be 2 char country iso code or 3 digits UN-m49 code.

Example:
The tag it-756 is well formed (thus not valid on the registry) and must be parsed as

Language: it
Region: 756 (can only be the region sub tag “756” because it is exactly 3 DIGITS, can't be script)

The IANA registry will not keep all the UN-m49 codes, in this case the 756 subtag is not included, so my software will fail to recognize this tag as valid IT-756 even if it is syntactically OK and even if 756 is a valid UN-m49 value.

Variant Sub Tag

The variant represents a variant on the language like a dialect.

The variant sub tag cannot be confused with other sub tags while parsing because it has to be 5 chars long if begins with a “a-z” or at least 4 if begins with a digit.

Example:
sl-nedis tag must be parsed as
Language: sl
Variant: nedis (it is 5 chars long so it cannot be script nor region, only variant)

For instance, “sl-nedis” represent the dialect of Slovenian as spoken in Nadiza.

Example:
the tag de-CH-1996 must be parsed as
Language: de
Region: CH (cause after language subtag and 2 chars long)
Variant: 1996 (it's a variant because begin with a digit and it is 4 chars long).

For instance, the above tag represents German as used in Switzerland and as written using the spelling reform beginning in the year 1996 C.E.

Extension Sub Tag and Private Sub Tag

Less used but the presented parser will handle them more or less.

Both begin with a single character, if first char is x then it is a private sub tag else it is an extension sub tag. All sub tags following the singleton x have to be treated as private even if match some entry on the registry.

Refer to the RFC 5646 or look at the source code.

The IANA Language Subtag registry

The registry file can be downloaded from http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry then cached locally.

I suggest to read the RFC if comprehensive knowledge on it is needed.

The registry helps to parse the language tags and validate them, allows to canonicalize obsolete and redundant language tags and also helps to keep a relation between languages belonging to the same macro language.

The registry itself consist of a series of record, each record contains a list of key-value pairs. The RFC well explain how the records are structured.

All records must have a “Type” entry, except the very first record on the file that only have the File-Date entry.

The types of records are: language, extlang, script, region, variant, grandfathered and redundant

The parser I’ve built reads the registry file and load all its records into a bunch of dictionaries indexed by tag value (case insensitive), here is the list of dictionaries on the registry:

valid languages records
valid extension language records
valid scripts records
valid regions records
valid variants records
valid grandfathered records
valid redundant records

All the above dictionaries are used to help the parsing of a language tag and to validate.

Using the code

The usage is quite simple; you have to provide a list of language tags. Those language tags will represent the language your UI will support. You add supported languages this way:

bcp47.LangSet s = new bcp47.LangSet();
s.Add("en-US"); //this will be the default language, the first added
s.Add("de");//German
s.Add("fr");//French
s.Add("ja");//Japanese
s.Add("es");//Spanish

The s variable will now contain the list of the supported language and can be queried against a user provided preferred language string.

Here an example of the best-fit-language lookup.

string userPreferredLanguage = "es-CO";
Lang bestMatchOrDefault = s.Lookup(test); //will return "es"
//good now i can show the user the product description searching my database for 
//the description which have bestMatchOrDefault.Canonical tag
string ProductDescriptionText = myTranlationDatabase.Translate("ProductDescription", bestMatchOrDefault.Canonical);

Points of Interest

About the provided implementation

What could be improved

A thing I am not happy with this code is that it loads the registry from the IANA website under the hood if it does not find a cached file locally, maybe the registry file should be an injectable dependency: at the time of writing this, the LangSet.Add method will in turn call Lang.Parse method, the Lang class have a static constructor that will create a new instance of the registry by calling Registry.Load() which in turn check the presence of a local file named ".iana-language-registry" and if not present download it from the site, I will do better in future. Where I am installing that library I also ensure a copy of the cached file is present, the attached project will show how to do (have the cache file on the project and copy it to the output folder upon compile)

Other thing i don't like and maybe will change in the future is the thread-safety implementation, at the moment it is safe but sub-optimal because I am serializing all read and write.

Registry Parser

The registry parser is quite simple: read first record from the registry file; it must be a file-date record, then while there are bytes, read next record into a key-value dictionary and, depending on the record type, add the record to the correct index, else ignore the record

public static Registry Load(StreamReader sr)
{
    Registry u = new Registry();
    var d = NextRecord(sr);

    if (d.Count != 1 || !d.ContainsKey("File-Date") || DateTime.TryParseExact(d["File-Date"], "yyyy-mm-dd", System.Globalization.CultureInfo.InvariantCulture, System.Globalization.DateTimeStyles.AllowTrailingWhite, out u.registryUpdated) == false)
        throw new FormatException();

    while (!sr.EndOfStream)
    {
        d = NextRecord(sr);
        CheckRecord(d);

        if (d["Type"] == "language")
        {
            u.languageIndex.Add(d["Subtag"], ParseRecord(d));
        }
        else if (d["Type"] == "extlang")
        {
            u.extlangIndex.Add(d["Subtag"], ParseRecord(d));
        }
        else if (d["Type"] == "script")
        {
            u.scriptIndex.Add(d["Subtag"], ParseRecord(d));
        }
        else if (d["Type"] == "region")
        {
            u.regionIndex.Add(d["Subtag"], ParseRecord(d));
        }
        else if (d["Type"] == "variant")
        {
            u.variantIndex.Add(d["Subtag"], ParseRecord(d));
        }
        else if (d["Type"] == "grandfathered")
        {
            u.grandfatheredIndex.Add(d["Tag"], ParseRecord(d));
        }
        else if (d["Type"] == "redundant")
        {
            u.redundantIndex.Add(d["Tag"], ParseRecord(d));
        }
    }
    return u;
}

after parsing the registry file, all registry's dictionaries are filled, indexed by sub tag. The content of each dictionary entry is the registry record.

Registry record

A registry record is made of few read only fields, we can use this record to validate a language tag or find properties about a language sub tag.

 public class Record
    {
        public readonly string Type;
        public readonly string Subtag;
        public readonly string Tag;
        public readonly string PreferredValue;
        public readonly string Description;
        public readonly DateTime Created;
        public readonly string SuppressScript;
        public readonly string MacroLanguage;
        readonly HashSet<string> Prefix;
//...
   }

the PreferredValue field is what helps us normalize redundant forms of tags, for example i-klingon can be normalized to tlh

The Language Tag Parser

The language tag parser purpose is to generate the Language Tag Record from a Language Tag string.

The Language Tag record have the following structure

    public class Lang : IEquatable<Lang>
    {
        //all fields are readonly thus i'm thread safe
        public readonly Record Language;
        public readonly Record ExtLang;
        public readonly Record Script;
        public readonly Record Region;
        public readonly ReadOnlyCollection<Record> Variants;
        public readonly ReadOnlyDictionary<char, ReadOnlyCollection<string>> Extensions;
        public readonly string Private;
        public readonly string Canonical;
   }

For example, the result of parsing string es-CO, will be a Lang record that contains references to the records on the registry for each sub tag as:

Lang l = Lang.Parse("es-CO");
Console.WriteLine("Lang: {0}", l.Language?.Description.Replace('\n',','));
Console.WriteLine("ExtLang: {0}", l.ExtLang?.Description);
Console.WriteLine("Script: {0}", l.Script?.Description);
Console.WriteLine("Region: {0}", l.Region?.Description);
//Lang: Spanish,Castilian
//ExtLang:
//Script: Latin
//Region: Colombia

The parse algorithm itself is quite simple, divide the input string into chunks "es-CO" then get the first token and begin :

Ensure first token is a valid a primary language sub tag on the registry or fail.
Get next token
Try to find an extlang on the registry if found get next token.
Try to find script on the registry, if found get next token.
The same as points 3 and 4 applies to other subtag types ...

The lookup algorithm

The actual algorithm is not yet optimized, its logic is to start from most important tag (primary language tag) and to remove mismatch until we only have one candidate left: what it does is like: start with a list of candidates and remove not matching ones at each step, first step is matching primary language, then script, then region, you got the idea; you may personalize the algorithm.

Example:

we have to match es-CO against es, en, fr

Remove all supported Lang that does not match the es primary language, so we start with (es, en, fr) and we remove (en, fr) so we will only have (es), that is the only one that survived first step so we return this.

The code is commented, worth looking at it if you need to personalize your lookup algorithm.

History

1.0.0 - 2016-10-31 first version