Advise on making predictions given collection of dimensions and corresponding probabilities

https://datascience.stackexchange.com/questions/6765

16-10-2019
|

문제

I am a CS graduate but am very new to data science. I could use some expert advise/insight on a problem I am trying to solve. I've been through the titanic tutorial on gaggle.com which I think was helpful but my problem is a bit different.

I am trying to predict diabetes risk based upon Age, Sex...and other factors given this data: http://www.healthindicators.gov/Indicators/Diabetes-new-cases-per-1000_555/Profile/ClassicData

The data gives new cases people per 1,000 people for each dimension (Age, Sex...etc). What I would like to do is devise a way to predict, given a list of dimensions (Age, Sex...etc) a probability factor for a new diagnosis.

So far my strategy is to load this data into R and use some package to create a decision tree, similar to what I saw in the titanic example on kaggle.com, then feed in a dimension list. However, I am a bit overwhelmed. Any direction on what I should be studying, packages/methods/examples would be helpful.

해결책

Aggregate Data

Since you're only given aggregate data, and not individual examples, machine learning techniques like decision trees won't really help you much. Those algorithms gain a lot of traction by looking at correlations within a single example. For instance, the increase in risk from being both obese and over 40 might be much higher than the sum of the individual risks of being obese or over 40 (i.e. the effect is greater than the sum of its parts). Aggregate data loses this information.

The Bayesian Approach

On the bright side, though, using aggregate data like this is fairly straightforward, but requires some probability theory. If $D$ is whether the person has diabetes and $F_1,\ldots,F_n$ are the factors from that link you provided, and if I'm doing my math correctly, we can use the formula: $$ \text{Prob}(D\ |\ F_1,\ldots,F_n) \propto \frac{\prod_{k=1}^n \text{Prob}(D\ |\ F_k)}{\text{Prob}(D)^{n-1}} $$ (The proof for this is an extension of the one found here). This assumes that the factors $F_1,\ldots,F_n$ are conditionally independent given $D$, though that's usually reasonable. To calculate the probabilities, compute the outputs for $D=\text{Diabetes}$ and $\neg D=\text{No diabetes}$ and divide them both by their sum so that they add to 1.

Example

Suppose we had a married, 48-year-old male. Looking at the 2010-2012 data, 0.73% of all people get diabetes ($\text{Prob}(D) = 0.73\%$), 0.77% of married people get diabetes ($\text{Prob}(D\ |\ F_1)$$= 0.77\%$), 1.02% of people age 45-54 get diabetes ($\text{Prob}(D\ |\ F_2) = 1.02\%$), and 0.70% of males get diabetes ($\text{Prob}(D\ |\ F_3) = 0.70\%$). This gives us the unnormalized probabilities: $$ \begin{align*} P(D\ |\ F_1,F_2,F_3) &= \frac{(0.77\%)(1.02\%)(0.70\%)}{(0.73\%)^2} &= 0.0103 \\ P(\neg D\ |\ F_1,F_2,F_3) &= \frac{(99.23\%)(98.98\%)(99.30\%)}{(99.27\%)^2} &= 0.9897 \end{align*}$$ After normalizing these to add to one (which they already do in this case), we get a 1.03% chance of this person getting diabetes, and a 98.97% chance for them not getting diabetes.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange