Monday 22 September 2014

WELCOME TO R SIDE OF LIFE


We have many programming languages many build with different capabilities but we can all agree that most languages are multipurpose.(e.g. you can build a simple calculator using all/almost all programming languages), but when it comes to statistical programming R is the King. What is R ? R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. I know you may be bored by this R and S language so this is my deal. I want you to follow up as I do/we do a simple statistical analysis on data obtained from WHO on TB burden estimate. To be specific we will analyze the estimated incidence of all types of TB in Kenya from 1990 to 2012.
This is what we are going to do in steps.
  1. We will download the data from WHO website.
  2. Upload it in R 
  3. Do some statistical analysis with the data 

Downloading the data

You can download data in R in different formats. This means data contained or saved using other programs such as SPSS and Excel can be easily be imported into R. In this case I would love to download my data directly into my disk and finally upload it to R. Note it’s also possible to download data directly into R (I will show you right now). My data is in CSV format (comma separated variable) which is one of the best formats to work with. Go to https://extranet.who.int/tme/generateCSV.asp?ds=estimates and download the data. (If you have installed R in Windows the default working directory (folder) is Documents so for those who have problem in specifying file path I would recommend you to move that file to Documents). You can get the working directory by typing this command getwd(). You can also set your working directory using this command setwd(“specify directory(folder)  path in here”). For those who would love to download the data directly try this. data.all = read.csv(“https://extranet.who.int/tme/generateCSV.asp?ds=estimates.csv”, header = T, sep = “,”)

Uploading data in R

Now we have our data already downloaded so let us get it loaded into R before we dive in the fun of doing statistical analysis. Before then let us look the data.(Comma separated files (.csv) can be opened using Microsoft Excel or any other spreadsheet program) You would realize it us 4905 rows and 39 columns(All the countries of the world are represented from Afghanistan to Zimbabwe  each having 23 rows).
If you have moved your data into your working directory the code is very simple.
> data.all = read.table("TB_burden_countries_2014-09-23.csv", header = T, sep = ",")

If not it’s still simple , assuming you never moved the file from downloads (in windows) you can use this code.
> data.all = read.table("C:/Users/"YOurcompname"/Downloads/TB_burden_countries_2014-09-23.csv", header = T, sep = ",")

Doing statistical analysis on the data

Good now we have our data so let me show you why I call R the king of statistical programming. Since we have downloaded the whole worlds data (TB burden estimates ) . Let us specifically get the Kenyans data from this whole data. For new comers just follow me for those who are already familiar with R I will have to mention that R imports data with many variables as a data frame. If you look carefully you will find that Kenya data is on rows 2284 to 2306. This is how I will get it using this command.
>data.kenya = data.all[c(2284:2306),]  
 
I have used a technique called indexing to put together the rows which contain Kenyan data from data.all and assigned them to the object data.kenya. We can check the dimensions and structure of our new data.kenya using this codes.
>dim(data.kenya)     AND      >str(data.kenya)
 
 
We can also view the data using this code >View(data.kenya)
Note, I am using Rstudio data view on large window.

The real work


My objective is to determine whether there is a relationship between years and Estimated number of incident cases (all forms) which is the variable (column) abbreviated as e_inc_num. With all data analysis the first step is always to explore the data. In this case, scatter plots are very useful in determining whether or not the relationships between the variables are linear.
Let us plot a simple scatter plot using this code:
>plot(data.kenya$year, data.kenya$e_inc_num)

Before proceeding further let us attach our data so that we can refer our variables without having to specify the data frame where they come from. I mean
>attach(data.kenya)
 We can plot the same scatter plot using this code:
>plot(year, e_inc_num)

The above command gives a simple scatter plot with the first variable on the horizontal axis and the second on the vertical axis.
The scatter plot needs to be beautifully represented just copy paste this code into R and enter I will explain it now.
>plot(year, e_inc_num, main = "Relationship Btn Yrs and No of TB Incidences ", col.main = "red", sub = "Kenya", col.sub = "blue" ,xlab = "Years", ylab = "No of TB Incidences",col.lab = "red", pch = 13, col = "red")

                                                                                
The main = Title of the plot, col.main = the color of main, sub = subtitle of the plot, col.sub = the color of subtitle, xlab and ylab = name of x axis and y axis respectively, col. Lab = color of  axis, pch = The shape of the dots(change it to any number), col = color of the dots.
Now that we have our plot let us try to fit a linear model to this data of ours

Fitting linear model

The function lm is used to perform linear modeling in R .Use the code below to create the linear model.
>linear.kenya = lm(e_inc_num~ year)

Displaying the model by typing 'linear.kenya' gives limited information. To get more information, one can look at the attributes of this model, its summary and attributes of its summary.
>attr(linear.kenya,"names")

[1] "coefficients"  "residuals"     "effects"     
 [4] "rank"          "fitted.values" "assign"      
 [7] "qr"            "df.residual"   "xlevels"     
[10] "call"          "terms"         "model

There are 12 attributes. Most of them can be displayed with the summary function.
lm(formula = e_inc_num ~ year)

Residuals:
   Min     1Q Median     3Q    Max
-25554  -9658  -1761  11272  21500

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -1.037e+07  8.588e+05  -12.08 6.44e-11 ***
year         5.228e+03  4.292e+02   12.18 5.51e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13650 on 21 degrees of freedom
Multiple R-squared:  0.876,    Adjusted R-squared:  0.8701
F-statistic: 148.4 on 1 and 21 DF,  p-value: 5.51e-11

The first section of summary shows the formula that was 'called'. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (21500) compared to the minimum (-25554) and the first quartile is further left(-9658) of the median(-1761) than the third quartile(11272) is. For those who are not statisticians I know this is boring let me leave . Statisticians can go deeper and analyze this  results deeper.

Regression line, fitted values and residuals

A regression line can be added to the scatter plot with the following command:
>abline(linear.kenya)


The expected value is the no of expected TB incidences estimated from the regression line using a specific year.
>points(year, fitted(linear.kenya), pch=18, col="blue")


A residual is the difference between the observed and expected value. The residuals can be drawn by the following command.
>segments(year, e_inc_num, year, fitted(linear.kenya), c="pink")


The actual values of the residuals can be checked from the specific attribute of the defined linear model.

>kenyamodelresduals = residuals(linear.kenya)
> kenyamodelresduals

         1          2          3          4          5
  2467.391  -1760.870  -5989.130  -8217.391  -9445.652
         6          7          8          9         10
-10673.913  -9902.174  -8130.435  -4358.696   1413.043
        11         12         13         14         15
  6184.783  11956.522  16728.261  21500.000  16271.739
        16         17         18         19         20
 21043.478  15815.217  10586.957   5358.696  -9869.565
        21         22         23
-15097.826 -20326.087 -25554.348



The sum of the residuals and the sum of their squares can be checked.
> sum(kenyamodelresduals)
[1] -2.728484e-12
> sum(kenyamodelresduals^2)
[1] 3914228261

The sum of residuals is close to zero whereas the sum of their squares is the value previously displayed in the summary of the model. The distribution of residuals, if the model fits well, should be normal. A common sense approach is to look at the histogram.
>hist(kenyamodelresduals)
 Based on the number of observations it can be hard to say that the residuals are normally distributed . I would love to first leave you on this point to make your own conclusions. I would come later to tell how to interpret the results clearly and accurately.