R is a language and environment for statistical computing and graphics and is provided with excellent graphics and plotting capabilities. Lately I’ve been playing with R to get used at creating simple visualizations and charts, as a first step to get a quick overview when using a the dataset.

R is really powerful and with a few lines of code you can get the work done. Here I will explain how to make a thematic map (also known as choropleth map) as well as some basic charts. Here you can see the final result:

The image is showing the traffic mortality in Spain during the period 2005-2009. My initial goal was just to plot some maps to show the provinces that are the most dangerous for driving, but finally I added some graphs to show more data.

First things first, these are the steps to follow:

Prepare the data

In that case I use a set of csv files, one for each year, you can download them here. The original dataset was a collection of excel files. Each excel file contains, for each province in Spain, the total of road accidents, the number of road accidents involving fatally injuries, number of victims, number of deaths, etc…
The provinces with more deaths or road accidents are not the most dangerous because looking at the data is easy to see that the more population the province has, the more high rate in deaths/accidents has. That means nothing because what we have to look for is, for example, the percentage of fatal road accidents, or the average of involved deaths per accidents. These are good indicators of the dangerousness of a province. The original data sets did not contain this data, so I added this data (percentage of fatal road accidents and average of numbers of deaths per accident) to them and exported the excel files as csv files.

Get the map

To get the map of Spain, (or any other country), visit GADM, an impressive spatial database of the location of the world’s administrative areas. The data is available in several formats, and luckily for us, is available as an RData file. For Spain, different levels of administration areas are provided. We want the level 2, which contains all the provinces of the country.

Load the datasets

First step, load the csv files to work with:

#set the working directory, that is, the location of the files to load

#generate an array with the years to analize and load the data
#of each csv file in a list
years = seq(2005, 2009)
globalds <- list()
for(i in 1:length(years)) 
	globalds[[i]] <- read.csv(paste("dgt",years[i],".csv",sep=""))

Plot the map

When plotting the map, each province will be filled with a range color, depending on its mortality. As we will use a set of color values to indicate the different levels of mortality, we need to know the maximum value of mortality throught all the years for all the provinces.

#populate an array with the maximum mortality for each year
maxs <- vector()
for(i in 1:length(years)) 
	maxs[i] <- max(globalds[[i]]$mortalidad)

maxMortality <- max(maxs)

Here the variable maxMortality represents a maximum percentage of 18,27%. So the set of color values to represent the level of mortality will go from 0 to 20. We are going to split the set of values in four levels, so the first level will cover percentages of mortality from 0% to 5%, the second covers percentages between 5% and 10% and so on.
We need a color scheme for these 4 levels, so use the library RcolorBrewer to acomplish that.

#load the library

#create a color scheme with 5 colors. Ignore the brightest red (a color
#scheme with 4 colors could be created, but I often see the first color
#of a scheme too bright)
colorPalette <- brewer.pal(5, "Reds")[2:5]

Now let’s load the map:


After loading the Rdata file, there is a new variable called ‘gamd’. This variable is an instance of a SpatialPolygonsDataFrame, a class to hold polygons with attributes. A SpatialPolygonsDataFrame contains a lot of data, but for now, we just want to know in which order the provinces are plotted. To get an overview of the internal structure of the variable ‘gadm’, the functions ‘summary’ and ‘str’ can help us:

Looking at the structure, the main data frame contains the variable ‘NAME_2’, which contains the list of all the provinces in Spain:

The process that follows below is done only for the first year of our dataset (year 2005, the first csv of our files stored in the first position of the list ‘globalds’). This process should be done for each year.

For each province plotted in the map, we have to found within the csv file, the row that contains that province, so we can retrieve the percentage of mortality for that province. Next step is to assign to each percentage a level from our color scheme:

#province names are uppercase in the csv files, but the Rdata file
#contains province names in lowercase. Compare names always in
#uppercase. For each province name in the RData file, get its row
#position in the csv file.
globalds[[j]]$Provincias <- iconv(globalds[[j]]$Provincias, to="ASCII//TRANSLIT")

provsIndex <- vector(length = length(provsOrder))
for(i in 1:length(provsOrder))
    provOrder = iconv(toupper(provsOrder[i]), to="ASCII//TRANSLIT")
    provsIndex[i] <- which(globalds[[j]]$Provincias == provOrder)

Now get the corresponding color level for each province, based on its mortality percentage:

provsColorLevel <- vector(length = length(provsOrder))
for(i in 1:length(provsOrder))
	provsColorLevel[i] <- (globalds[[j]]$mortalidad[provsIndex[i]] %/% 5) + 1

Last but not least, save as a factor the set of color levels from all the provinces and make use of the function spplot to plot the map. For those of you wondering what is a factor, here there is a good explanation.

gadm$provsColorLevel = as.factor(provsColorLevel)
spplot(gadm, "provsColorLevel", col.regions = colorPalette, col = "white")

And that’s the result of plotting the map with R:

From R, you can save the plot as a PDF file and edit it with Adobe Illustrator or Inkscape and modify it at your own. In my case, I removed the general frame and the legend and rearranged some parts of the map. So I end up with this image:

Repeating the process for each year, we get this collection of maps:

Pretty cool, right? Using R we have created a series of maps with a few lines of code. Looking at these maps  you can easily see the provinces with highest mortality percentages. Even you can detect which provinces has increased or decreased its mortality over the time.

However, it’s hard to see if there is an improvement of the mortality over the years in the entire country. Was the last year better than the first year? Which is the year with the highest percentage of fatal accidents?

This data is pretty easy to get. In the CSV files, there is a column called ‘mortalidad’, containing the percentage of fatal accidents for each province. In R, we can use the summary() function to get result summaries:

#first year (2005) result summaries

Which displays these results:

  • Min. : 0.380
  • 1st Qu. : 3.995
  • Median : 5.400
  • Mean : 5.411
  • 3rd Qu. : 6.270
  • Max . : 18.270

Again, repeating the process for each year, we obtain these means:

  • 2005 : 5.411
  • 2006 : 4.801
  • 2007 : 4.688
  • 2008 : 4.306
  • 2009 : 4.153

These numbers mean good news because that’s probably a sign of improvement in road safety over these years.

Showing distributions

We can add more info about this improvement in road safety by showing the distribution of the mortality percentages through the mortality levels that we have defined before. We have this info in the variable provsColorLevel, so if we make an histogram of it we obtain:

 breaks = c(0.5, 1.5, 2.5, 3.5, 4.5),
 col = colorPalette,
 yaxt = "n",
 xlab="mortality percentage",
axis(2, at=seq(0, 30, by=2))

I’m not going into details, but the hist() function comes with a good set of option to configure the graph but if you want a fine control over the visual aspect, you can create custom axes using the axis() function.

Again, repeating this process for all the years and using Adobe Illustrator / Inkscape to change the visual aspect, we obtain these histograms:

We are close to finish. We can end up by showing the number of deaths per accident in each province. This one will be fast, we can obtain this data with a few steps:

First create a matrix with 2 columns, one with the names of the provinces and another one with the number of deaths. This last one columns will be populated with zeros by now. To combine objects by rows or columns we make use of the cbind() function:

deadsSumMatrix <- cbind(Provincias=globalds[[j]]$Provincias,DeadsMean=0)

Now we populate the second row of this matrix with the sum of deaths over the 5 years for each province:

for(a in 1:length(globalds))
   for (b in 1:length(deadsSumMatrix[,1]))
      deadsSumMatrix[b,2] = as.numeric(deadsSumMatrix[b,2]) 
                            + globalds[[a]]$Muertosxacc[b]

And finally, we get the mean of deaths per province and sort them from major to minor. To apply the mean over the second column of the matrix, we can use sapply() function, which does exactly this: to apply a function over a list or vector

deadsSumMatrix[,2] <- sapply(deadsSumMatrix[,2], 
                             function(x) as.numeric(x)/5)
deadsMeanOrdered <- deadsSumMatrix[order(deadsSumMatrix[,2], 
                                         decreasing = TRUE),]

With that done we already can plot the graph, using the barplot() function:


The barplot() function is provided with a large set of parameters to customize the plot, it’s worth to spend time inspecting these options. One more time we make use of Illustrator / Inkscape to obtain the desired graph:

We are done! The last thing left is to compose all the graphs in a single space, to get the image below:

As you can figure out, R is an excellent environment as is designed to analyze data and comes with a lot of R packages to make data graphics with just a few lines of code, so it’s definitively one of the best tools to visualize data. Although the software is not specially friendly with users that are not used to with programming, once you get used to it, the posibilities are almost infinite.

Have your say

About me

Data Visualization · Interactive · HCI · Open Data · Data Science · Digital Humanities

More info here and here: