We have many programming languages many build with different
capabilities but we can all agree that most languages are multipurpose.(e.g.
you can build a simple calculator using all/almost all programming languages),
but when it comes to statistical programming R is the King. What is R ? R is a language and environment for
statistical computing and graphics. It is a GNU project which is similar to the S language and
environment which was developed at Bell Laboratories (formerly AT&T, now
Lucent Technologies) by John Chambers and colleagues. R can be considered as a
different implementation of S. There are some important differences, but much
code written for S runs unaltered under R. I know you may be bored by this R and S
language so this is my deal. I want you to follow up as I do/we do a simple
statistical analysis on data obtained from WHO on TB burden estimate. To be
specific we will analyze the estimated incidence of all types of TB in Kenya
from 1990 to 2012.
This is what we
are going to do in steps.
- We will download the data from WHO website.
- Upload it in R
- Do some statistical analysis with the data
Downloading the data
You can download data in R in different formats. This means
data contained or saved using other programs such as SPSS and Excel can be easily
be imported into R. In this case I would love to download my data directly into
my disk and finally upload it to R. Note it’s also possible to download data
directly into R (I will show you right now). My data is in CSV format (comma separated
variable) which is one of the best formats to work with. Go to https://extranet.who.int/tme/generateCSV.asp?ds=estimates
and download the data. (If you have installed R in Windows the default working directory
(folder) is Documents so for those who have problem in specifying file path I
would recommend you to move that file to Documents). You can get the working
directory by typing this command getwd(). You can also set your
working directory using this command setwd(“specify directory(folder) path in here”). For those who would
love to download the data directly try this. > data.all = read.csv(“https://extranet.who.int/tme/generateCSV.asp?ds=estimates.csv”,
header = T, sep = “,”)
Uploading data in R
Now we have our data already downloaded so let us get it
loaded into R before we dive in the fun of doing statistical analysis. Before
then let us look the data.(Comma separated files (.csv) can be opened
using Microsoft Excel or any other spreadsheet program) You would realize
it us 4905 rows and 39 columns(All the countries of the world are represented
from Afghanistan to Zimbabwe each having
23 rows).
If you have moved your data into your working directory the
code is very simple.
> data.all =
read.table("TB_burden_countries_2014-09-23.csv", header = T, sep =
",")
If not it’s still simple , assuming you never moved the file
from downloads (in windows) you can use this code.
> data.all = read.table("C:/Users/"YOurcompname"/Downloads/TB_burden_countries_2014-09-23.csv", header = T, sep = ",")
Doing statistical analysis on the data
Good now we have our data so let me show you why I call R
the king of statistical programming. Since we have downloaded the whole worlds data
(TB burden estimates ) . Let us specifically get the Kenyans data from this
whole data. For new comers just follow me for those who are already familiar with
R I will have to mention that R imports data with many variables as a data
frame. If you look carefully you will find that Kenya data is on rows 2284 to
2306. This is how I will get it using this command.
>data.kenya = data.all[c(2284:2306),]
I have used a technique called indexing to put
together the rows which contain Kenyan data from data.all and assigned them to
the object data.kenya. We can check the dimensions and structure of our new
data.kenya using this codes.
>dim(data.kenya) AND >str(data.kenya)
We can also view the data
using this code >View(data.kenya)
![]() |
Note, I am using Rstudio data view on large window. |
The real work
My objective is to determine whether there is a relationship
between years and Estimated number of
incident cases (all forms) which is the variable (column)
abbreviated as e_inc_num. With all data analysis the first step is always to
explore the data. In this case, scatter plots are very useful in determining
whether or not the relationships between the variables are linear.
Let us plot a simple scatter plot using this code:
>plot(data.kenya$year,
data.kenya$e_inc_num)
Before proceeding further let us attach our data so that we
can refer our variables without having to specify the data frame where they
come from. I mean
>attach(data.kenya)
>plot(year, e_inc_num)
The above command gives a
simple scatter plot with the first variable on the horizontal axis and the
second on the vertical axis.
The scatter plot needs to be beautifully represented just copy
paste this code into R and enter I will explain it now.
>plot(year, e_inc_num, main =
"Relationship Btn Yrs and No of TB Incidences ", col.main =
"red", sub = "Kenya", col.sub = "blue" ,xlab =
"Years", ylab = "No of TB Incidences",col.lab =
"red", pch = 13, col = "red")
The main = Title of the plot, col.main = the color of main,
sub = subtitle of the plot, col.sub = the color of subtitle, xlab and ylab =
name of x axis and y axis respectively, col. Lab = color of axis, pch = The shape of the dots(change it
to any number), col = color of the dots.
Now that we have our plot let us try to fit a linear model
to this data of ours
Fitting linear model
The function lm is used to perform linear modeling in R .Use
the code below to create the linear model.
>linear.kenya = lm(e_inc_num~
year)
Displaying the model by typing 'linear.kenya' gives limited
information. To get more information, one can look at the attributes of this model,
its summary and attributes of its summary.
>attr(linear.kenya,"names")
[1] "coefficients" "residuals" "effects"
[4]
"rank"
"fitted.values" "assign"
[7]
"qr"
"df.residual"
"xlevels"
[10] "call" "terms" "model
There are 12 attributes.
Most of them can be displayed with the summary function.
lm(formula = e_inc_num ~ year)
Residuals:
Min 1Q Median 3Q
Max
-25554
-9658 -1761 11272
21500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.037e+07 8.588e+05
-12.08 6.44e-11 ***
year
5.228e+03 4.292e+02 12.18 5.51e-11 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13650 on 21 degrees of
freedom
Multiple R-squared: 0.876, Adjusted
R-squared: 0.8701
F-statistic: 148.4 on 1 and 21 DF, p-value: 5.51e-11
The first section of
summary shows the formula that was 'called'. The second section gives the
distribution of residuals. The pattern is clearly not symmetric. The maximum is
too far on the right (21500) compared to the minimum (-25554) and the first quartile
is further left(-9658) of the median(-1761) than the third quartile(11272) is.
For those who are not statisticians I know this is boring let me leave . Statisticians
can go deeper and analyze this results
deeper.
Regression
line, fitted values and residuals
A regression line can be added to the scatter plot with the
following command:
>abline(linear.kenya)
The expected value is the
no of expected TB incidences estimated from the regression line using a
specific year.
>points(year,
fitted(linear.kenya), pch=18, col="blue")
A residual is the
difference between the observed and expected value. The residuals can be drawn
by the following command.
>segments(year, e_inc_num,
year, fitted(linear.kenya), c="pink")
The actual values of the
residuals can be checked from the specific attribute of the defined linear
model.
>kenyamodelresduals =
residuals(linear.kenya)
> kenyamodelresduals
1 2 3 4 5
2467.391 -1760.870 -5989.130
-8217.391 -9445.652
6
7 8 9 10
-10673.913
-9902.174 -8130.435 -4358.696
1413.043
11 12 13 14 15
6184.783 11956.522 16728.261
21500.000 16271.739
16 17 18 19
20
21043.478
15815.217 10586.957 5358.696
-9869.565
21 22 23
-15097.826 -20326.087 -25554.348
The sum of the residuals and the sum of their squares can be
checked.
> sum(kenyamodelresduals)
[1] -2.728484e-12
> sum(kenyamodelresduals^2)
[1] 3914228261
The sum of residuals is close to zero whereas the sum of
their squares is the value previously displayed in the summary of the model.
The distribution of residuals, if the model fits well, should be normal. A
common sense approach is to look at the histogram.
>hist(kenyamodelresduals)
Based on the number
of observations it can be hard to say that the residuals are normally
distributed . I would love to first leave you on this point to make your own conclusions.
I would come later to tell how to interpret the results clearly and accurately.
No comments:
Post a Comment