Parsing Wikipedia XML Dump

Ilia Reznik

4.94/5 (11 votes)

Apr 27, 2016

CPOL

8 min read

31344

1014

Parser for Wikipedia pages from XML dump is presented. Extraction of biographical data and categories with their parents is shown as an example.

Download source files - 7.3 MB

Introduction

Wikipedia is a perfect object for data mining and much research focuses on various techniques to retrieve information of interest from it. For online extraction, Rion Williams utilized in his project a powerful library designed by Petr Onderka. Wikipedia itself provides GUI tools, etc. The advantage of an online approach is that it always returns up-to-date information. While Wikipedia changes rapidly, the information about particular subjects largely remains constant over time. Thus, the freshness of the dumps is not very significant. At the same time, online methods have obvious limitations in the size of returned results and in programming queries of interest.

For offline research, Wikipedia periodically updates its dump files for every language. Their structure may be slightly different. WikiExtractor.py is a Python script for obtaining the clean text of Italian pages. There are many other parsers. However, no single project can fit all needs.

In this article, we:

present pretty simple parser for "pages-articles" XML dump file, capable of extracting body texts of pages, titles of pages with categories a page belongs, and names of categories with their parent categories. Body texts are not cleaned from markup, in particular, because it may contain important information.
for each Wikipedia's article page, calculate number of references to it from other pages. These numbers may be used for a kind of "citation index", measuring significance of pages.
produce a list of biographical data, containing more than million persons with known birth and/or death year, preserving categories to which they belong. About half of them have both dates, thus age may be calculated. Such a list may be used in sociological or historical research similar to the one we did earlier using a smaller amount of data.

Output, generated by the parser, is enough for simple categorization. If a category is listed in page's parents, the page matches it. However, the category of interest is often not mentioned in the immediate parents. The deepest child category is (or should be, at least) listed instead. To reveal multilevel parent-child relationships between categories, the whole graph of categories hierarchy should be constructed. We will present such a hierarchy, along with proper classifier, in the next article.

Background

We recommend being familiar with the process of creation of Wikipedia pages, their structure and with their categorization. Please take a look at the contents of actual "pages-articles" XML file. We used enwiki-20160305-pages-articles-multistream.xml.bz2 file (about 12.7G archive containing 52.5G) file. The latest dumps may be found at generic Wikipedia's place for database download. At least, please preview the 0.05% random selection from, it, 24M only, located at ...\bin\Debug directory of the project, This file contains the same heading part as huge original one.

From C# side, it is good to have experience with XmlReader, Dictionary<>, HashSet<> classes, with some LINQ, and with RegEx.

Main Program and Test Run

Provided main program illustrates usage of the parser:

private static void Main(string[] args)
{
    WikiXmlParser parser = new WikiXmlParser(@"enwiki-pages-articles-test.xml");
    parser.ExtractPagesAndHierarchy(
        "Pages&Parents.txt",
        "Categories&Parents.txt", 
        "BiographicalPages.txt"
    );
}

Constructor of WikiXmlParser object requires path to existing XML dump. In the code snippet above, relatively small sample XML file, located at ...\bin\Debug directory, is used. Assuming that the program was launched from this folder, it creates there three tab-delimited files:

"Pages&Parents.txt" file with three columns: page title, citations count, and concatenated with '|' categories. A sample row:

Alchemy 433 Alchemy|Hermeticism|Esotericism|Alchemists|
"Categories&Parents.txt", two columns: category and its concatenated with '|' parents. A sample row:

Computer science Applied sciences|Computing|
"BiographicalPages.txt", six columns: name (page title), birth year, death year, age, citations count, and categories, concatenated with '^|'. A sample row:
Aristotle -380 -320 60 3958 ...Academic philosophers|Empiricists|Meteorologists|Zoologists|...

Now please open "BiographicalPages.txt" in MS Excel, MS Access or in another database and sort the table numerically by links (citations) count, You will see that only a few of rows have non-zero values of citations. That's because only a very small subset of pages from the real XML dump was used in the test. However, Michael Jackson got four citations even in this case... Please sort the table numerically by birth, death and age year columns. The results are reasonable.

When the structure of the output files and their purpose become clear, it's time to download the full archive using links above, extract the XML file from it and specify the path to XML in the WikiXMLParser constructor. That's all needed to process Wikipedia XML dump of articles.

Now please analyze the full results, sorting the biographical table as described above. You'll see about 25 rows, with negative or unrealistically big age. The exact count of such rows has little sense because dump you used may be a bit different from the one above mentioned dump. They represent actual errors in Wikipedia. Pages are written by people, so it's hardly possible that all mistakes can be eliminated. However, the share of such obvious errors is less than 0.005%, proving a very high quality of Wikipedia data. On the other hand, the ability to discover such errors is an extra benefit of our parser.

Page Class

Page is a class, providing public access to Wikipedia page properties such as:
string Title is the title of current page. It also is a suffix of the proper online page address: https://en.wikipedia.org/wiki/[ Title ].
string Text is the contents of <text> tag from dumped Wikipedia page.
int Namespace is the "namespace" of page. Namespaces are briefly explained in the beginning of XML. The value 0 is assigned to generic articles, 14 - to categories, 6 - to files, 10 - to templates, 118 - to drafts, etc.

Other possible properties like "id", "revision", "timestamp", "contributor", etc., were not of our interest, but they can be extracted in the same way, if needed. Please look at the XML dump to learn the possible names of elements, their formats and meanings, or read proper Wikipedia explanations.

After processing Text with:

private void GetReferences()

method, two important collections become publicly available:

HashSet<string> Links keeps titles of links to other pages. Each link is counted once.
HashSet<string> Parents contains distinct titles of categories to which the page belongs.

These collections are calculated internally, just once, upon first call to them.

There is a public method for parsing "Infobox":

bool ParseInfobox(
    out string title,     
    out List<KeyValuePair<string, string>> keysAndValues)

It is not executed in our test program, but may be interesting for researchers.

Let's familiarize ourselves with "infobox". It is not the necessary part of the page, but sometimes may provide valuable information not contained in the parent categories. The following quote from XML file is an example:

{{Infobox writer
| name = George MacDonald
| image = George MacDonald 1860s.jpg
| imagesize   = 225px
| caption     = George MacDonald in the 1860s
| pseudonym   =
| birth_date = {{birth date|1824|12|10|df=y}}
| birth_place = [[Huntly, Scotland|Huntly]], 
                [[Aberdeenshire (traditional)|Aberdeenshire]], Scotland
| death_date = {{death date and age|1905|9|18|1824|12|10|df=y}}
| death_place = [[Ashtead]], Surrey, England, 
                [[United Kingdom of Great Britain and Ireland]]
| occupation = [[Minister (Christianity)|Minister]], Writer (poet, novelist)
| nationality =  Scottish/British
| period      = 19th century
| genre       = Children's literature <!-- [[Fantasy literature|Fantasy]], 
                [[Christian apologetics]] -->
| subject     =
| movement    =
| signature   =
| website     =
| notableworks = ''[[Lilith (novel)|Lilith]]'', ''[[Phantastes]]'', 
                 ''[[David Elginbrod]]'', '...
| influences  = [[Novalis]], [[Friedrich de la Motte Fouqué|Fouqué]], 
                [[Edmund Spenser|Spenser]],...
| influenced  = [[C. S. Lewis]], [[J. R. R. Tolkien]], 
                [[G. K. Chesterton]], [[Mark Twain]],... 
}}

Actually, infobox is a list of pretty arbitrary keys and values. ParseInfobox (...) extracts title of infobox and List<KeyValuePair<string, string>, containing all keys and their non-empty values. Researcher may focus on specific keys and parse their values. We do not provide a "universal" parser for that, there are too many formats. We want to keep our code as simple as possible. Fortunately, many infobox values do not need any special interpretation at all. Constructions like [[...|...|...]] are used in the same way as in body text and present links to Wikipedia pages. First string, separated by '|', is the title of referenced page, others serve display needs. Interpretation of dates requires more work, but it is straightforward.

Object of Page class is designed to be constructed within WikiXMlParser class when it encounters Wikipedia page while reading XML dump.

Public Methods of WikiXmlParser Class

Constructor

WikiXmlParser(string pathToXMLDumpFile)

simply creates private System.Xml.XmlReader object to process given XML file.

Page GetNextPage() is the function that creates current Page object, reading the XML file page by page, forward-only, using <font face="Courier New">XmlReader</font>. The method should be called in loop:

while ((page = GetNextPage()) != null)
{
     // Do something using properties of page.
}

Inside GetNextPage() we read the XML subtree related to the current page and recognize the NodeType of each element using its Name. If the page shows a "redirect" (title is in reasonable spelling and meaning, but actual article is named differently), we skip it. Elements named "title", "ns", "text" give us Title, Namespace and Text properties of Page object, accordingly. This method skips pages that are not from articles or categories namespaces. Including other namespaces, if needed, is trivial.

public void ExtractPagesAndHierarchy(
    string pathPagesAndParents, 
    string pathCategoriesAndParent)

It does the whole job of our test project, producing three files, mentioned in the "Main program" section.

First, it reads XML dump using GetNextPage() above and creates two temporary files tempPagesAndParents and tempBiographicalPages. Temporary files are almost the same as final ones, except they do not have the citations count column. During the process, we collect all links to other pages from each page.

At the second stage, when all links are gathered, we compute the number of citations of each page in Dictionary<string, int> linksCount.

Memory, required for collecting linksCount for every page of the whole Wikipedia, may exceed the standard 2Gb stack size limit. That's why the "Prefer 32 bit" checkbox is unchecked in the project's build options.

Of course, memory usage may be optimized by using indexes instead of strings, particularly. However, we want to keep code as simple as possible and want the results to be easy for viewing. Also, spending a lot of time on such optimization may not be very reasonable because parsing a huge dump takes just about an hour (or less) and should be performed once or just several times.

When linksCount dictionary is computed, we read temporary files, add values of missing column, and write the final output. The job is done.

History

27^th April, 2016: Initial version