vrijdag 21 oktober 2016

DYLAN'S DATA with R


The Nobel prize for literature is awarded to Bob Dylan. How to report about that? From a data journalism perspective there are interesting possibilities. NRC Handelsblad published an info graphics. Interesting, but there are other possibilities using R and Tableau. Here are a few examples. If you are interested in the how-to, follow the more tag.











Here is the recipe.
1.      Find lyrics of the songs of Dylan. I discovered the following interesting collection. The songs are in pdf; so convert them to a .txt file. Use for example: http://pdftotext.com/nl/ for conversion. Have  a look at your text document: lyrics.txt. It  is a document of around 270 pages and lyrics up to 2009. So the most important recordings are present.
2.      Next we have to clean up the document for text analysis with R. Here is an overview of different steps for text analysis.
We need the following libraries in R-studio:

 "tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc" "Rcampdf"

We load the lyrics.txt in R as docs and start cleaning up the text to make it ready for analysis:
docs <- Corpus(DirSource(cname))
summary(docs)

Output:
Length Class             Modelyrics.txt 2      PlainTextDocument list

Clean up the txt with the following:

docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)

Now we end with a plain text document which can staged for analysis.
We turn the text doc into a document term matrix.

dtm <- DocumentTermMatrix(docs) 

and a term document matrix:


tdm <- TermDocumentMatrix(docs)
 
With the following we create a csv file to use in Excel and Tableau.
First calculate  the word frequencies, then remove some white space and finally save it as .csv
 
freq <- colSums(as.matrix(dtm))
dtms <- removeSparseTerms(dtm, 0.1)
freq <- colSums(as.matrix(dtms))
wf <- data.frame(word=names(freq), freq=freq) 
write.csv(wf, file="lyrics.csv")


The csv in excel shows a frequency table of the words in lyrics, about 7000 words. Making a word 
cloud of the first 100 could be done but does not show much. Select in the in Excel imported .
csv “cities” and “blues-words” and make a word cloud from them for example in Tableau. Tableau is 
interesting for production because you can save your result as .pdf (and turn in .svg for hardcopy) 
and secondly as embedded link for on line.
Link to the  Tableau dashboard: keywords songs:

Long words get in a cloud more attention, therefore you can easily turn the cloud into a bar graph 
in Tableau.

 Can we something about the sentiment in the lyrics?  For a sentiment analysis scrutinizing positive 
and negative words we use with the following recipe:
 
library("syuzhet")
 
We change the docs (from the start, which is still in the memory) into a  character vector doc and 
calculate the sentiment:
 
docs2<-as.character(docs)
dSentiment <- get_nrc_sentiment(docs2)
dSentiment
 
 
  anger anticipation disgust fear joy sadness surprise trust negative positive
  275          246     209    358 225     314      157   326      662      541


Finally lets print this output of sentiments:
 
 
sentimentTotals <- data.frame(colSums(dSentiment))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
 geom_bar(aes(fill = sentiment), stat = "identity") 
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for
 Lyrics Bob Dylan")


Geen opmerkingen:

Een reactie plaatsen

Opmerking: alleen leden van deze blog kunnen een reactie plaatsen.