Investigating Relationship between variables
I would like to investigate the relationships between the
variables, I will compute correlation and co-variance between various variables.
We will be emphasizing on correlation because the data we are using Co-variance
may not have a meaning.
Relationship between e_mort_exc_tbhiv_num and e_inc_num in East Africa
- e_mort_exc_tbhiv_num - Estimated number of deaths from TB (all forms, excluding HIV)
- e_inc_num - Estimated number of incident cases (all forms of TB)
We can see that there is high correlation (0.8875) between estimated
number of deaths from TB (excluding HIV) and estimated number of TB incidence. Correlation
takes values between -1 to + 1. -1 one means no relationship while +1 means
strong relationship.
Relationship between e_inc_num and e_tbhiv_prct in East Africa
- e_inc_num - Estimated number of incident cases (all forms of TB)
- e_tbhiv_prct - Estimated HIV in incident TB (percent)
We got a correlation of (0.32621). The number seem low but
it’s actually high because e_tbhiv_prct
- Estimated HIV in incident TB (percent) is in percentage. So I will go ahead
and calculate the actual figure for Estimated HIV in incident and go ahead and
calculate correlations.
After calculating the actual figures the results are
amazing.
- e_inc_num - Estimated number of incident cases (all forms of TB)
- actual_fig- Estimated HIV in incident TB (figure)
This seems to be perfectly linear relationship from the
scatter plot above. Then correlation level obtained is amazing 0.94893. Close
to 0.95.
Relationship between e_mort_tbhiv_num and actual_fig in East Africa
- e_mort_tbhiv_num - Estimated number of deaths from TB in people who are HIV-positive
- actual_fig- Estimated HIV in incident TB (figure)
The two variables seem to have a linear relationship from
the plot above. This naturally leads to
high correlation (0.9325) between the two variables as indicated by results
below.
Creating Regression Model using Machine Learning
Now that we have established the relationships in our data
set, we can go ahead and build a simple machine learning Regression model to
predict estimated number of deaths from
TB in people who are HIV-positive given Estimated HIV in incident TB
(figure) .We are going to accomplish that task using Microsoft Azure
Machine Learning Studio.
So the resulting feature weights are shown below. This are
figures which form our regression model line that is:
Estimated number of deaths from TB in people who are HIV-positive
= -3070.02+0.50218(Estimated
HIV in incident TB)
After
getting the model I used the formula to compute some of the predicted values in
Excel as shown below.
Finally I evaluated the model to check it’s efficiency.
The coefficient of determination (R2) summarizes
the explanatory power of the regression model. If the regression model is
perfect R2 is 1. If the regression model is a total failure, R2 is
zero. This model is not badly of with its R2 being 0.869699.
No comments:
Post a Comment