Monday 5 September 2016

Exploring WHO TB Dataset using Python - Part 2

 Investigating Relationship between variables

I would like to investigate the relationships between the variables, I will compute correlation and co-variance between various variables. We will be emphasizing on correlation because the data we are using Co-variance may not have a meaning. 

Relationship between e_mort_exc_tbhiv_num and e_inc_num  in East Africa
  • e_mort_exc_tbhiv_num - Estimated number of deaths from TB (all forms, excluding HIV)
  • e_inc_num - Estimated number of incident cases (all forms of TB)
 
We can see that there is high correlation (0.8875) between estimated number of deaths from TB (excluding HIV) and estimated number of TB incidence. Correlation takes values between -1 to + 1. -1 one means no relationship while +1 means strong relationship.

 








 Relationship between e_inc_num and e_tbhiv_prct in East Africa
  • e_inc_num - Estimated number of incident cases (all forms of TB)
  • e_tbhiv_prct - Estimated HIV in incident TB (percent)
 
We got a correlation of (0.32621). The number seem low but it’s actually high because e_tbhiv_prct - Estimated HIV in incident TB (percent) is in percentage. So I will go ahead and calculate the actual figure for Estimated HIV in incident and go ahead and calculate correlations.



After calculating the actual figures the results are amazing.
  • e_inc_num - Estimated number of incident cases (all forms of TB)
  • actual_fig- Estimated HIV in incident TB (figure)

This seems to be perfectly linear relationship from the scatter plot above. Then correlation level obtained is amazing 0.94893. Close to 0.95.


 
 
Relationship between e_mort_tbhiv_num and actual_fig in East Africa 
  • e_mort_tbhiv_num - Estimated number of deaths from TB in people who are HIV-positive
  • actual_fig- Estimated HIV in incident TB (figure)

The two variables seem to have a linear relationship from the plot above.  This naturally leads to high correlation (0.9325) between the two variables as indicated by results below.


   

Creating Regression Model using Machine Learning

Now that we have established the relationships in our data set, we can go ahead and build a simple machine learning Regression model to predict estimated number of deaths from TB in people who are HIV-positive given Estimated HIV in incident TB (figure) .We are going to accomplish that task using Microsoft Azure Machine Learning Studio.

 


So the resulting feature weights are shown below. This are figures which form our regression model line that is:
Estimated number of deaths from TB in people who are HIV-positive = -3070.02+0.50218(Estimated HIV in incident TB)  


      

After getting the model I used the formula to compute some of the predicted values in Excel as shown below.


          


Finally I evaluated the model to check it’s efficiency.

      


The coefficient of determination (R2) summarizes the explanatory power of the regression model. If the regression model is perfect R2 is 1. If the regression model is a total failure, R2 is zero. This model is not badly of with its R2 being 0.869699.


No comments:

Post a Comment