Bimajolt: Exploring WHO TB Dataset using Python

Monday, 5 September 2016

Exploring WHO TB Dataset using Python - Part 1

Introduction

I have downloaded data from WHO (World Health Organization) website, the data is about TB burden for the whole world. I have used Python (2.7) and Microsoft Azure Machine Learning Studio for the analysis.To do my analysis I wanted to deal with only East African data. So I went ahead and choose. The rows containing East African data.I used this function to read the data into Python.

I would try to describe the data set here.The initial data set contains 5337 rows and 48 columns,the picture below shows part of the data opened in Excel.

The first task is to select the rows containing East Africa countries data using this function. Note that the function has one parameter which accepts a data frame. The data frame in my case is the whole data set containing all of TB burden data entries for the whole world.

I used pandas .loc method , then I went ahead and set up my index to the column ‘country’ and used .isin method to search all the elements of the preceding list which contains names of East African countries.

Simple Descriptive Statistics

So I would like to visualize the data so that we might get deeper insight about the data. The new dimension of the data is 125 rows and 48 columns. Before we go into plotting I would like first to plot some summary statistics on some of the columns which I think will be key in understanding this data set.

e_pop_num - Estimated total population number
e_mort_exc_tbhiv_num - Estimated number of deaths from TB (all forms, excluding HIV)
e_mort_tbhiv_num - Estimated number of deaths from TB in people who are HIV-positive
e_inc_num - Estimated number of incident cases (all forms of TB)
e_tbhiv_prct - Estimated HIV in incident TB (percent)

The data is from 1990 to 2014, note that the average deaths caused by TB on HIV + patients for the whole of East Africa region (Kenya, Uganda, Tanzania, Rwanda and Burundi) is 15746.

Deaths caused by TB on HIV- negative patients from 1990 is 9154. Also note that for both of the two cases above the standard deviation is bigger than the mean which implies that many points on the data a far from the sample mean.

Understanding the distribution of the data

From the figure above we can see that e_mort_exc_tbhiv_num (Estimated number of deaths from TB (all forms, excluding HIV)) has a many outliers (extreme values) far from the upper quartile. We also note that e_mort_tbhiv_num (Estimated number of deaths from TB in people who are HIV-positive) has a large inter-quantile range suggesting that the data is unevenly spread.

I will go ahead to plot box plots and bar graphs for each of the East African countries using data from the following columns:

e_mort_exc_tbhiv_num - Estimated number of deaths from TB (all forms, excluding HIV)
e_mort_tbhiv_num - Estimated number of deaths from TB in people who are HIV-positive

The box plot helps us to understand the distribution of Estimated number of deaths from TB (all forms, excluding HIV) and Estimated number of deaths from TB in people who are HIV-positive as from 1990 to 2014. Note the middle ‘box’ represents the middle 50% of the scores for the specific group.

Kenya

KENYA BOX PLOT