I have downloaded data from WHO (World Health Organization) website, the data is about TB burden for the whole world. I have used Python (2.7) and Microsoft Azure Machine Learning Studio for the analysis.To do my analysis I wanted to deal with only East African data. So I went ahead and choose. The rows containing East African data.I used this function to read the data into Python.
I would try to describe the data set here.The initial data set contains 5337 rows and 48 columns,the picture below shows part of the data opened in Excel.
The first task is to select the rows containing East Africa
countries data using this function. Note that the function has one parameter
which accepts a data frame. The data frame in my case is the whole data set
containing all of TB burden data entries for the whole world.
I used pandas .loc method , then I went ahead and set up my
index to the column ‘country’ and used .isin method to search all the elements
of the preceding list which contains names of East African countries.
Simple Descriptive Statistics
So I would like to visualize the data so that we might get deeper insight about the data. The new dimension of the data is 125 rows and 48 columns. Before we go into plotting I would like first to plot some summary statistics on some of the columns which I think will be key in understanding this data set.
- e_pop_num - Estimated total population number
- e_mort_exc_tbhiv_num - Estimated number of deaths from TB (all forms, excluding HIV)
- e_mort_tbhiv_num - Estimated number of deaths from TB in people who are HIV-positive
- e_inc_num - Estimated number of incident cases (all forms of TB)
- e_tbhiv_prct - Estimated HIV in incident TB (percent)
The data is from 1990 to 2014, note that the average deaths
caused by TB on HIV + patients for the whole of East Africa region (Kenya,
Uganda, Tanzania, Rwanda and Burundi) is 15746.
Deaths caused by TB on HIV- negative patients from 1990 is
9154. Also note that for both of the two cases above the standard deviation is
bigger than the mean which implies that many points on the data a far from the
sample mean.
Understanding the distribution of the data
From the figure above we can see that e_mort_exc_tbhiv_num (Estimated number of deaths from TB (all forms, excluding HIV)) has a many
outliers (extreme values) far from the upper quartile. We also note that e_mort_tbhiv_num (Estimated number of
deaths from TB in people who are HIV-positive) has a large inter-quantile range
suggesting that the data is unevenly spread.
I will go ahead to plot box plots and bar graphs for each of
the East African countries using data from the following columns:
- e_mort_exc_tbhiv_num - Estimated number of deaths from TB (all forms, excluding HIV)
- e_mort_tbhiv_num - Estimated number of deaths from TB in people who are HIV-positive
The box plot helps us to understand the distribution of
Estimated number of deaths from TB (all forms, excluding HIV) and Estimated
number of deaths from TB in people who are HIV-positive as from 1990 to 2014.
Note the middle ‘box’ represents the middle 50% of the scores for the specific
group.
Kenya
KENYA BOX PLOT |
KENYA BAR GRAPH |
Tanzania
TANZANIA BOX PLOT |
TANZANIA BOX PLOT |
Uganda
UGANDA BOX PLOT |
UGANDA BAR PLOT |
Rwanda
RWANDA BOX PLOT |
RWANDA BAR PLOT |
Burundi
BURUNDI BOX PLOT |
BURUNDI BAR PLOT |
I would love to hear from you on how I can improve this analysis
ReplyDelete