R or SAS?
I work with large data sets, which sometimes need to be unpivoted reaching to few millions rows, so to get through them, I recently started (with breaks for nervous breakdown) to learn SAS.
SAS comes with license and not easy to get through support documentation and tutorials. My colleague suggested I try R – programming language for statistical computing and graphics in RStudio which is free and open source integrated development environment – IDE as there is plenty of well explained learning materials available online.
Like Try R Code School offering step by step basic R tutorials.
But it has no credibility like pay applications have:
I found this list on Wikipedia page while having conversation one early morning at Vicar St backstage about the main music inspiration beside love. That would be drugs, obviously. So the next question was how and which drugs are contributing to musicians’ deaths. And R, thanks to it’s graphic plotting possibilities can help answer this question.
The list contains of 98 musicians’ names (pop and other rubbish not included), age of their death, country they died in and drug responsible. What I wanted to see was which drug killed most artists and if club 27 is a thing.
RStudio can import data straight from URL, we can point it into right folder on the drive, can upload from previously set working directory or we can create data frames in it.
I have set the directory and uploaded my csv from there naming my data frame data1
data1<- read.csv(“drug_deaths.csv”, as.is=T)
I quickly checked if it worked, calling out first 15 rows from Age column:
Then, by choosing summary function ,I can already check if 27 myth applies to my data sample:
Not really, median is 31 – 4 years above.
simple boxplot function helps to visualize the numerical output:
boxplot(data1$Age) – Median places slightly above 30 with some skew data over 60 and 70.
For now my data frame looks like that:
To neatly plot and see results of bigger contributor to musicians’ death, it is handy to convert Drug column into table for the numerical scores:
After telling R to represent above data in graph (barplot(drugtable),I got this:
Not all labels fit under axis. To quickly fix this I used las=2:
barplot(drugtable,las=2) Now all labels are displayed vertically and fit in.
From the first glimpse on the graph it is clear that heroin is main culprit, leaving rest of the substances behind.
What I don’t like about both R and SAS is semantics. They are slightly different from what I am used to from working in Excel and similar apps. R and SAS developers came up with whole new naming, which I guess was needed as these programs bring new approach and methods to data analytics I know so far. Knowing R and SAS shows me now how limited MS Office package and similar are.
I don’t find SAS nor R intuitive. But R seems just nicer. I stick to Spotfire though – it offers full graphic interface.