A Brief Look at the World Health, in R





5.00/5 (7 votes)
An introduction to data analytics, using R, by taking a quick look at the state of World Health.
1. Introduction
Data from the World Health Organization (WHO) relating to social, economic, health and political indicators are compiled by this organization, and are available as a file called WHO.csv. This file, as the name indicates, is a Comma Separated Value file. This CSV file has 358 columns, and 202 rows, one row pertaining to each country of the world. As examples, some of the columns are titled "Country", "Continent", "Population (in thousands) total", and "Number of confirmed poliomyelitis cases".
In this article, we try to get some meaningful data from this file, by means of the R programming language. About two years ago, I attended an online course called the The Analytics Edge offered by MIT, on the edX online platform. They had introduced the R language using a reduced form of the WHO CSV file mentioned above. We try to introduce the R language using a different version of the reduced form of the WHO file. Before we embark on a journey of R, let us take a look at the contents of this reduced WHO data file. This is a file which has the same 202 rows as above, but has only 15 columns, so that our understanding is simpler. This reduced WHO data file is available for download as WHOReduced.csv, at the top of this page. The columns of this reduced WHO data file are:
Country
: Name of the countryCountryID
: Unique numerical ID for the countryContinent
: Numerical ID of the continent to which this country belongs, having one of 7 valuesAdultLiteracyRate
: In percentage, for the countryGNI
: Gross National Income of the country, per capitaPopulation
: of the country in thousandsPopGrowth
: Population growth rate as a percentageUrbanPop
: Urban population of the country as a percentageBPLPop
: Population of the country, below the poverty line, as a percentageMedianAge
: The median age of the population, in yearsAbove60
: Percentage of the population of the country, above 60 years of ageBelow15
: Percentage of the population of the country, below 15 years of ageFertilityRate
: Fertility rate as a percentageHospitalBeds
: Number of hospital beds per 1000 peopleNumberOfPhysicians
: in the country
We use this reduced data set to understand some nuances of the data, and use the R programming language for this.
2. Introduction to R
R is a software environment for data analysis, statistical computing and graphics. It is also a programming language, which enables one to code a set of steps to achieve a statistical or machine learning outcome. R is open-source. Though there are many choices for data analysis software like SAS, Stata, SPSS, Microsoft Excel, Matlab, Minitab, pandas, we will be using R for purposes of this article.
The latest version of R can be downloaded from here. There are some graphical user interfaces for R, for example, RStudio and Rattle. However, for purposes of this article, we will be using the command line interface for R, and running the commands through the R console. This is shown below for the version of R that I have.
In the remainder of this article, we get introduced to R by a series of questions and their corresponding answers.
3. Getting Useful Information from the WHO File, using R Commands
In this section, we pose a set of questions and get answers to these using R commands. This will serve as our introduction to R.
- How do I read in the CSV data into R?
Data from a CSV file can be loaded onto R by reading it into a data frame. Before getting into data frames, we need to know what a vector is. A vector is a series of numbers or characters stored as the same object. For example, the R command
v = c(1, 2, 3, 4, 5)
creates a vector namedv
, and this vector has five elements, the numbers 1, 2, 3, 4, 5. It is not correct to combine characters and numbers in the same vector. Two or more vectors of the same length can be combined into a data frame, which is an important data structure in R. If we consider two vectorsv1 = c(1, 2, 3, 4, 5)
andv2=c(100, 200, 300, 400, 500)
, then these two can be combined into a single data frame which has five rows and two columns, with the first column being the first vectorv1
, and the second column being the second vectorv2
. In its simplest form, a data frame can be construed as a matrix. However, a data frame is more general than a matrix since the different columns can have quantities of different data types, as we see below.Since we are working with a CSV file, R has a simple command to read in the entire CSV file into a single data frame. You will have to use the R menu to change directory to the folder where the file WHOReduced.csv is located, before executing this command.
> who = read.csv("WHOReduced.csv")
This command loads the entire CSV file into the data frame named
who
. Just type this command into the R console, and hit Enter, for this command to run.Next, we take a look at the structure of this data.
- How do I start understanding the structure of this data?
R has a useful command called
str
which enables one to understand the structure of the data loaded into a data frame.> str(who)
Upon running this command, the R console outputs the following output.
Looking at this output, one can get to know that there are 202 observations of 15 variables. What this means is that there are 202 rows, with each row having 15 variables. The 15 different variables in this data frame areCountry, CountryID, Continent, AdultLiteracyRate, GNI, Population, PopGrowth, UrbanPop, BPLPop, MedianAge, Above60, Below15, FertilityRate, HospitalBeds, NumberOfPhysicians
. Some of these variables are ofint
type, containing integer values. Some others are ofnum
type containing floating point values. The first variableCountry
is of typeFactor
, which is a categorical variable. The above screenshot shows thatCountry
has 202 categories, aka levels, with each level being the unique country name.A small note on the continent labeling in this file. This is shown in the following table. These are strictly not the names of the continents, but we will take these for the purpose of this article.
Continent Label Continent Name 1
Eastern Mediterranean
2
Europe
3
Africa
4
North America
5
South America
6
Western Pacific
7
Asia
Next, we take a look at the summary of this data.
- How do I get a summary of this data?
R has another useful command called
summary
which enables one to understand the summary of the data loaded into a data frame.> summary(who)
Upon running this command, the R console outputs the following output:
Looking at this output, we find that R has output a summary of all the 15 different variables within this data frame. For quantities which have numerical values, R has output these quantities - the minimum value, the first quartile value (which is the value for which 25 percent of the values fall below this value), the median value (the value for which 50 percent of the values fall below this), the mean, the third quartile (the value for which 75 percent of the values fall below this), and the maximum value. For example, for the variable
MedianAge
, these values are Minimum = 15.00, First quartile = 20.00, Median = 25.00, Mean = 26.74, Third quartile = 35.00, and Max = 43. We also see an entry calledNA's : 23
corresponding to the variableMedianAge
. This indicates that there are 23 entries for which the median age is not listed in the data set, and hence in the data frame. These 23 values are not available. In a similar manner, the summary of all the other 13 integer/numerical variables can be understood. For the factor variableCountry
, the summary has listed the first six entries in the screenshot above.The R commands
str()
andsummary()
are very helpful for getting information on the structure of the data, and the summary of the data respectively.Next, we pose some interesting questions on this data, and seek their answers.
- Which is the country having the minimum, and maximum population percentage under 15 years of age?
For answering this question, we need to identify the index of this country. The R command for this is:
> which.min(who$Below15)
Upon running this command, the R console outputs the answer as 4. Now, the country name is found using the following command:
> who$Country[4]
The answer is Andorra. The above two commands can be combined into a single command as:
> who$Country[which.min(who$Below15)]
The yields the same answer as Andorra as the country which has the minimum percentage of population under 15 years of age.
Similarly, the following command can be used to find the country which has the maximum of this number:
> who$Country[which.max(who$Below15)]
The answer to this is
Uganda
. - Which is the country having the minimum, and maximum population percentage over 60 years of age?
For answering these questions, as before, we type the command:
> who$Country[which.min(who$Above60)]
The yields the answer as
United Arab Emirates
as the country which has the minimum percentage of population above 60 years of age.Similarly, the following command can be used to find the country which has the maximum of this number:
> who$Country[which.max(who$Above60)]
The answer to this isJapan
. - Is there a country whose entire population is urban?
Looking at a summary of the data, it is seen that the maximum value of variable
UrbanPop
is 100. To find out the country whose entire population is urban, we use the command: For answering this question, as before, we type the command:> who$Country[which.max(who$UrbanPop)]
The yields the answer as
Monaco
.Similarly, the following command is used to find the country which has the minimum value for this number:
> who$Country[which.min(who$UrbanPop)]
The answer to this is
Burundi
. - How does a plot of the GNI vs Fertility Rate look like?
For answering this question, as before, we plot the data using the command:
> plot(who$GNI, who$FertilityRate)
The yields a plot as shown below:
We see that this is largely a triangular plot. Implying that a lower GNI indicates that the fertility rate is high, and vice versa. However, there are some countries that have a high GNI and high fertility rate. We investigate this question next.
- Which countries have a high GNI and high Fertility Rate?
For answering this question, as before, we take a subset of the original data as follows:
> HighVals = subset(who, GNI > 10000 & FertilityRate > 2.5)
This creates a subset of the data where the GNI is greater than 10000 and Fertility Rate is greater than 2.5. To find out the number of countries which fall in this category, we use the command:
> nrow(HighVals)
This gives the output as
9
, indicating that there are 9 such countries. To identify the countries which fall in this category, we use the command:> HighVals[c("Country", "GNI", "FertilityRate")]
This gives the result:This lists the 9 countries along with their GNI and Fertility Rate values.
- Which countries have a the highest and lowest ratio of number of doctors per person?
For answering this question, we add a vector to the original data set using the command:
> who$DrsPop = who$NumberOfPhysicians / who$Population
Here, the ratio of the variable
NumberOfPhysicians
to the variablePopulation
is taken, and stored as a separate vectorDrsPop
within the same data framewho
. To answer the above question, we use the commands:> who$Country[which.min(who$DrsPop)] > who$Country[which.max(who$DrsPop)]
The answers to these questions are respectively
San Marino
(highest number of physicians per person) andMalawi
(lowest number of physicians per person).A look at the structure of data using the
str()
command will yield 202 observations with 16 variables, with the 16th one being the one newly addedDrsPop
. - How does the histogram of the number of Hospital Beds look like?
For answering this question, we plot the histogram using the command:
> hist(who$HospitalBeds)
This shows the histogram as shown in the following figure:
We see that this histogram is highly skewed, with a large number of countries having a low value for the number of hospital beds.
- How does a box plot of the Population Growth against continent look like?
For answering this question, we plot the box plot using the command:
> boxplot(who$PopGrowth ~ who$Continent, xlab = "Continent", ylab = "Population Growth")
This shows the box plot as shown in the following figure:From this boxplot, we see that there are some continents where the population growth rate is indeed negative. There are some continents where the interquartile range (the vertical height of the box) is quite small, indicating that there not much of a difference between the population growth rates across the continent. Outliers, where the distance from the first or third quartile is greater than the interquartile range is termed as an outlier, and is shown as a circle in the above plot.
- How does a table of the Above60 variable vary with Continent?
For answering this question, we use the
table
command as follows:> table(who$Above60, who$Continent)
This shows the table as shown below:
From this table, we see that there are 11 countries in Continent 2 (
Europe
), having 22 percent of their population above 60 years of age. - Can we find out the average urban population on a Continent basis?
For answering this question, we use the
tapply
command as follows:> tapply(who$UrbanPop, who$Continent, mean, na.rm=TRUE)
The
tapply(arg1, arg2, arg3)
command takes three arguments, and groupsarg1
byarg2
and appliesarg3
. This means that in this case, thetapply
command groups the variableUrbanPop
by variableContinent
and applies the mean. The parameterna.rm=TRUE
in the above command is used to indicate to R to exclude the NA values from the computation.This shows the table below:
We see that the mean urban population is maximum in Continent 1 (
Eastern Mediterranean
), though Continent 4 (North America
) is not far behind. - Can we find out the average population growth on a Continent basis?
For answering this question, we again use the
tapply
command as follows:> tapply(who$PopGrowth, who$Continent, mean, na.rm=TRUE)
This shows the table below.We see that Continent 3 (
Africa
) has the highest average population growth, whereas Continent 2 (Europe
) has the lowest.
4. Closure
In this article, we got introduced to looking at data in a CSV file using simple commands in R. The example file we used was WHOReduced.csv
which is a reduced version of the WHO data as of 2017. I have attempted to give an introduction to R by posing a set of simple but important questions on the data. We got introduced to the commands read.csv(), str(), summary(), which.min(), which.max(), plot(), subset(), nrow(), hist(), boxplot(), table(), tapply()
. I plan to continue writing articles on this in future, and cover other important analytics tools using the R language.
Meanwhile, I urge you to load your own CSV files, try out the commands listed above, and let me know your feedback on this.
History
- 8th February, 2017: Version 1.0