Consigli sui fare previsioni proposta raccolta di dimensioni corrispondenti e probabilità

https://datascience.stackexchange.com/questions/6765

16-10-2019
|

Domanda

Sono un laureato CS, ma sono molto nuovo per la scienza dei dati. Potrei usare qualche esperto consulenza / comprensione su un problema che sto cercando di risolvere. Sono stato attraverso il tutorial titanica su gaggle.com che credo fosse utile, ma il mio problema è un po 'diverso.

Sto cercando di predire il rischio di diabete in base a età, sesso ... e altri fattori dato questi dati: http://www.healthindicators.gov/Indicators/Diabetes-new-cases-per-1000_555/Profile/ClassicData

I dati dà nuovi casi di persone per 1.000 persone per ogni dimensione (età, sesso, ecc ...). Quello che vorrei fare è trovare un modo per prevedere, dato un elenco delle dimensioni (età, sesso, ecc ...) un fattore di probabilità per una nuova diagnosi.

Finora la mia strategia è quello di caricare questi dati in R e serve un pacchetto per creare un albero decisionale, simile a quello che ho visto nell'esempio titanica su kaggle.com, poi nutrire in un elenco dimensioni. Tuttavia, io sono un po 'sopraffatti. Tutto il senso di ciò che dovrei studiare, pacchetti / metodi / esempi sarebbe utile.

Soluzione

Aggregate Data

Since you're only given aggregate data, and not individual examples, machine learning techniques like decision trees won't really help you much. Those algorithms gain a lot of traction by looking at correlations within a single example. For instance, the increase in risk from being both obese and over 40 might be much higher than the sum of the individual risks of being obese or over 40 (i.e. the effect is greater than the sum of its parts). Aggregate data loses this information.

The Bayesian Approach

On the bright side, though, using aggregate data like this is fairly straightforward, but requires some probability theory. If $D$ is whether the person has diabetes and $F_1,\ldots,F_n$ are the factors from that link you provided, and if I'm doing my math correctly, we can use the formula: $$ \text{Prob}(D\ |\ F_1,\ldots,F_n) \propto \frac{\prod_{k=1}^n \text{Prob}(D\ |\ F_k)}{\text{Prob}(D)^{n-1}} $$ (The proof for this is an extension of the one found here). This assumes that the factors $F_1,\ldots,F_n$ are conditionally independent given $D$, though that's usually reasonable. To calculate the probabilities, compute the outputs for $D=\text{Diabetes}$ and $\neg D=\text{No diabetes}$ and divide them both by their sum so that they add to 1.

Example

Suppose we had a married, 48-year-old male. Looking at the 2010-2012 data, 0.73% of all people get diabetes ($\text{Prob}(D) = 0.73\%$), 0.77% of married people get diabetes ($\text{Prob}(D\ |\ F_1)$$= 0.77\%$), 1.02% of people age 45-54 get diabetes ($\text{Prob}(D\ |\ F_2) = 1.02\%$), and 0.70% of males get diabetes ($\text{Prob}(D\ |\ F_3) = 0.70\%$). This gives us the unnormalized probabilities: $$ \begin{align*} P(D\ |\ F_1,F_2,F_3) &= \frac{(0.77\%)(1.02\%)(0.70\%)}{(0.73\%)^2} &= 0.0103 \\ P(\neg D\ |\ F_1,F_2,F_3) &= \frac{(99.23\%)(98.98\%)(99.30\%)}{(99.27\%)^2} &= 0.9897 \end{align*}$$ After normalizing these to add to one (which they already do in this case), we get a 1.03% chance of this person getting diabetes, and a 98.97% chance for them not getting diabetes.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a datascience.stackexchange