## zaterdag 10 juni 2017

### IT IS TIME DATA JOURNALISTS LEARN TO CODE

Since the beginnings of data journalism in the nineties of the last century, then called CARR or Computer Assisted Research and Reporting, techniques for  analyzing and visualizing data have improved enormously. One of the central tools in te nineties was the spreadsheet, standardized by Microsoft Excel. Spreadsheets are still much used for analysis though moving into the area of advanced data journalism: using for example R for deeper statistical analysis or D3 for creating better interactive graphics creates various new challenges. Then you often will engage in different types of coding: I got struck between Python (for R) or JavaScript (for D3). Does a data journalists need to learn all these programming languages or is there an easier and faster solution?
Looking at journalism practice the answer is:  step on the steep learning  curve and start with learning how to code. Here is some help. Paul Bradshaw starts next year an MA in Data Journalism at the Birmingham School of Media. Studying   Coding and computational thinking being applied journalistic ally (I cover using JavaScript, R, and Python, command line, SQL and Regex to pursue stories)” is one of the elements of this new MA, writes Bradshaw on his blog.
Looking into the market, there is really demand for data journalist with coding skills. Here is a job listing from the Economist. One of the preferred qualities include: A good understanding of data analytics and Coding skills (JavaScript and Python), or a background in data journalism, are a plus.
In the following I will argue that a basic understanding of coding is very helpful, but new services on the web help data journalists to avoid getting stuck in coding.

Static
Data used for creating graph are from a small survey about mayors in the Netherlands and can be found here.

Excel created the possibility to analyze data and visualize the results of an analysis. Here is an simple example. Showing the distribution of gender for Dutch mayors in percentages.

Simple bar graphs in Excel
This is a straightforward bar graph, simply showing that 80% is man and 20% woman. The picture helps to understand the numbers through visualization. For a simple document or a report this works fine. From a data journalism perspective there are some problems.

1. The visualization is just a picture. That is a small bitmap in for example JPG format. The resolutions of these pictures is far too low to make it ready for print of showing on a TV screen.
2. Publishing the graph on a web page or blog is no problem. Resolution is OK for the web. However it is a static picture, hovering over with a mouse does not reveal extra information.
Let’s start with the first problem. What to do? Should the graphics editor import the data from the journalist and create the graphic from scratch using data and rework them with for example a program like Illustrator or Inkscape.  That would be double work, wasting time and energy.
Using  an other spreadsheets program then Excel, Calc of LibreOffice,  we can export the graph as .SVG, scalable vector graphics. Now the graph is not a bit map but a vector map and can be edited and made ready for high resolution  print or for TV screens.
Making the graph interactive is possible using for example Google charts. Then we have to import the data into Google charts or work with Google (spread)sheets.

Simple maps
When producing maps the situation is almost the same. Here is a map showing political parties of mayors for Dutch municipalities. Again this map is a bitmap. Mapping programs like QGIS offer the possibility to export the map as .svg.

However the map is not interactive on the web. Google FT creates a solution for this.
Here is an example of the map in Google FT. The map is OK for the web,  using an embedded link to publish, although one can discuss the quality of the map. For print the resolution of the map is too small. Either a screen dump or exporting the map creates a bitmap with low resolution. Should we use two software programs for mapping.  I skip the possibility of creating your own web/ map server using QGIS.

Tableau
A nice solution for bringing analysis and visualization into one application is a free tool and service Tableau public. It is easy to use for calculations and creates nice graphs and dashboards, which can be exported as .pdf or jpg, or as embedded link. Another important possibility is that Tableau can import statistical data (for example the result of calculations with R) and reads geographical data( for example .shp file for maps which can be joined with data files). Tableau is in my opinion one of the best services for data journalism. Here is an example showing a dashboard with the distribution of gender over  political parties and a map showing  gender of the mayor per municipality. The link show the interactivity of the charts, which is useful for on line.

The exported graph is a simple bitmap with low resolution. Missing here is the possibility to export to .svg.

Data Driven Documents
An other solution is to use D3.js. D3 or Data Driven Document is a library for creating visualization on the web making full use of the following possibilities. More about D3, examples and tutorials. D3 uses data in documents for visualization applying:

- html: web document;
- css: style sheets of the web document;
- svg: graphics as a text file;
- js and json: javascript for manipulating  and importing data;
- d3: library with different documents for visualizations using the above tools.

This is an important step forward in building interesting and high quality visualizations  for the web.  And because of the use of .svg the charts can be rebuild for print. There is a small problem: creating D3 graphics presupposes some knowledge of JavaScript.
Here is the script, I edited an example of Mike Bostock, for a graph showing the distribution of political parties.

<!DOCTYPE html>
<meta charset="utf-8">
<style>

.bar {
fill: steelblue;
}

.bar:hover {
fill: brown;
}

.axis--x path {
display: none;
}

</style>
<svg width="960" height="500"></svg>
<script src="https://d3js.org/d3.v4.min.js"></script>
<script>

var svg = d3.select("svg"),
margin = {top: 20, right: 20, bottom: 30, left: 40},
width = +svg.attr("width") - margin.left - margin.right,
height = +svg.attr("height") - margin.top - margin.bottom;

y = d3.scaleLinear().rangeRound([height, 0]);

var g = svg.append("g")
.attr("transform", "translate(" + margin.left + "," + margin.top + ")");

d3.tsv("partij.tsv", function(d) {
d.aantal = +d.aantal;
return d;
}, function(error, partij) {
if (error) throw error;

x.domain(partij.map(function(d) { return d.partij; }));
y.domain([0, d3.max(partij, function(d) { return d.aantal; })]);

g.append("g")
.attr("class", "axis axis--x")
.attr("transform", "translate(0," + height + ")")
.call(d3.axisBottom(x));

g.append("g")
.attr("class", "axis axis--y")
.call(d3.axisLeft(y).ticks(10,))
.append("text")
.attr("transform", "rotate(-90)")
.attr("y", 6)
.attr("dy", "0.71em")
.attr("text-anchor", "end")
.text("aantal");

g.selectAll(".bar")
.data(partij)
.enter().append("rect")
.attr("class", "bar")
.attr("x", function(d) { return x(d.partij); })
.attr("y", function(d) { return y(d.aantal); })
.attr("width", x.bandwidth())
.attr("height", function(d) { return height - y(d.aantal); });
});

</script>
And here are the data in tab separated value (partij.tsv)

partij    aantal
1          CDA     115
2          CU       11
3          D66      19
4          GEEN  2
5          GL       7
6          OVG    7
7          PVDA  73
8          SGP     7
9          VVD    94
Running the script with the data in a web server create an interactive bar graph.

R Project

From a statistical perspective spreadsheets are a bit limited for analysis. R project- using R studio, is a better tool. I have given 5 compelling reasons for using R in data journalism.  Here is an example taken from  the same data set about municipalities and the mayors. The data contain two variables: average income and average house price per municipality. Having loaded the data set in R, we are going to make the same plot for political parties

setwd("/home/ubuntu/Documents/kaarten")  # set the working dir

str(gemeente) #structure of the data
'data.frame':   335 obs. of  9 variables:
\$ Gemeente        : Factor w/ 335 levels "Aa en Hunze",..: 1 2 3 4 5 6 7 8 9 10 ...
\$ Provincie       : Factor w/ 12 levels "D","F","FL","G",..: 1 7 4 2 12 12 8 9 3 12 ...
\$ Burgemeester    : Factor w/ 335 levels "Aartsen J.J. van",..: 205 196 25 91 213 316 54 92 320 270 ...
\$ Vanaf           : Factor w/ 268 levels "10-1-2012","1-10-2001",..: 99 159 3 60 144 146 131 215 268 81 ...
\$ Geslacht        : Factor w/ 2 levels "M","V": 1 1 1 1 1 1 1 1 1 2 ...
\$ Geboortedatum   : Factor w/ 331 levels "10-10-1969","10-11-1955",..: 297 253 76 224 211 61 126 48 164 298 ...
\$ jaar            : int  1961 1956 1949 1952 1968 1963 1955 1970 1964 1966 ...
\$ leeftijd        : int  56 61 68 65 49 54 62 47 53 51 ...
\$ Politieke.Partij: Factor w/ 9 levels "CDA","CU","D66",..: 7 1 1 6 9 7 1 9 3 1 ...
p<-gemeente\$Politieke.Partij #select data for parties
p<-table(p)
p
p
CDA   CU  D66 GEEN   GL  OVG PVDA  SGP  VVD
115   11   19    2    7    7   73    7   94

p<-data.frame(p)
names(p)
[1] "p"    "Freq"
colnames(p)[1]<-"partij"  #change the names
colnames(p)[2]<-"aantal"
str(p)
'data.frame':   9 obs. of  2 variables:
\$ partij: Factor w/ 9 levels "CDA","CU","D66",..: 1 2 3 4 5 6 7 8 9
\$ aantal: int  115 11 19 2 7 7 73 7 94

ggplot(p, aes(x=partij, y=aantal))+geom_bar(stat="identity")

Default the plot can be exported as image or .pdf. With the following code we save the plot as .svg

svglite("plot.svg", width = 10, height = 7)
ggplot(p2, aes(x=partij, y=aantal))+geom_bar(stat="identity")
dev.off()   # create an .svg from the ggplot

R is command line based and works with Python, an other programming language. So there is more coding to learn in advanced data journalism. There are many libraries or packages in R for different statistical approaches. R studio is the best environment to handle the packages, do the calculations from the command line,  and print or export the results.
The output of R is generally a figure or a printed chart. Or we can export the result as a bit map or as scalable vector graphics. This is fine for the hard copy media, but not for on line. For producing for the web I want interactive graphics. The output of R is better for scientific reports then for journalism. I skip the possibility of Shiny server for interactive web applications in R.

Dilemma

Now we are stuck, between R and D3. Can we have the best of both worlds? How to get into D3 from R? More.That is how to move from Python to Javascript? More
There are various solutions.
- export the R data to Tableau and make the graph in Tableau;
- export from R direct into plot.ly and make a graph in this webservice.
The first option is simply store your calculations in .rdata format and import in Tableau for produce the graph.
In the secon option we use the service of plot.ly and export  the visualization produced with R directly into plot.ly, using the plotly package/library in R. That is converting the graphics produced by  print package of R ggplot into D3 visualizations. Now we don’t have to worry about JavaScript and D3 libraries, this all done by plot.ly.  Although the number of different graphs which can be used is a bit limited.
plot(gemeente\$Politieke.Partij) # this is a simple plot
plot_ly(gemeente, x = ~Politieke.Partij) #create an interactive graph, which can be save as image or webpage

After creating a login and API at plot.ly we can create the graph at plot.ly. Here is the link: https://plot.ly/~peterverweij/48/
The chart is now available in D3 format at plotly and can be edited and exported.

Various services on the web are very helpful for data journalists to avoid deep problems of different coding. The services are taking care of that. By exporting data to the service visualizations of different styles and formats can be produced. Detailed knowledge of Python or D3 is not needed, some basic insight will do to get the visualizations out.

#### Een reactie posten

Opmerking: Alleen leden van deze blog kunnen een reactie posten.