Frage

I am writing a piece of R code and got stuck.

Background (which is not necessary for solving the problem): I am calculating the joint probability by multiplying independent marginal distributions. The marginal probability vectors are generated by the ProbGenerationProcess() iteratively. At each iteration it will output a vector, eg.

Iteration 1:
Color =
   Blue  Green
   0.2    0.8   

Iteration 2:
Material =
   Cotton  Silk
    0.7     0.3

Iteration 3:
Country =
   China     USA
    0.6      0.4

......

Desired result: I want the resulted joint probability to be the product of every single element in each marginal vector. The format should look like this.

Color   Material  Country   Prob
Blue    Cotton     China    0.084  (= 0.2*0.7*0.6)
Blue    Cotton     USA      0.056  (= 0.2*0.7*0.4)
Blue    Silk       China    0.036  (= 0.2*0.3*0.6)
Blue    Silk       USA      ..
Green   Cotton     China    ..
Green   Cotton     USA      ..
...     ...        ...      ...

My Implementation: Here's my code:

joint.names = NULL  # data.from store the marginal value names
joint.probs = NULL  # store probabilities.

for (i in iterations) {
    marginal = ProbGenerationProcess(VarUniqueToIteration) # output is numeric with names

    if ( is.null(joint.names) ) {
        # initialize the dataframes
        joint.names = names(marginal)
        joint.probs = marginal
    } else {
        # (my hope:) iteratively populate the joint.names and joint.probs

        joint.names = expand.grid(joint.names, names(marginal))

        expanded.prob = expand.grid(joint.probs, marginal)
        joint.probs = expanded.prob$Var1 * expanded.prob$Var2 # Row-by-row multiplication.
    }
}

Output: Joint.probs turnout out to be always correct, However, joint.names doesn't quite work the way I wanted. After the first two iterations everything works well. I got:

joint.names = 
    Var1  Var2
1   Blue  Cotton
2   Green Cotton
3   Blue  Silk
4   Green Silk 
    ...   ...

Start from the third iteration it become problematic:

joint.names =
    Var1.Var1  Var1.Var2  Var1.Var1.1  Var1.Var2.1  Var2
1   Blue       Cotton     Blue         Cotton       China 
2   Green      Cotton     Green        Cotton       China
3   Blue       Silk       Blue         Silk         USA
4   Green      Silk       Green        Silk         USA

I guess my first question is: is this the most efficient way to get the result I wanted? If so, is expand.grid() the function I should be using, and how should I initialize it correctly?

Any help is appreciated!

War es hilfreich?

Lösung

Merge is your friend.

color <- data.frame(color=c("blue","green"),prob=c(0.2,0.8))
material <- data.frame(material=c("cotton","silk"),prob=c(0.7,0.3))
country <- data.frame(country=c("china","usa"),prob=c(0.6,0.4))

dat <- merge(merge(color[1],material[1]),country[1]) # get names first

# same as: expand.grid(c("china","usa"),c("cotton","silk"),c("blue","green"))

dat <- merge(dat, color, by="color")
dat <- merge(dat, material, by="material")
dat <- merge(dat, country, by="country")

dat$joint <- dat$prob.x * dat$prob.y * dat$prob # joint calc

dat <- dat[-grep("^prob",colnames(dat))] # cleanup extra probs

Result:

  country material color joint
1   china   cotton  blue 0.084
2   china   cotton green 0.336
3   china     silk  blue 0.036
4   china     silk green 0.144
5     usa   cotton  blue 0.056
6     usa   cotton green 0.224
7     usa     silk  blue 0.024
8     usa     silk green 0.096

Andere Tipps

How about this for simlicity (although if performance is an issue, maybe better with merge)

PROBS<-data.frame(Item=rep(c("Color","Material","Country"),each=2),
           Value=c("Blue","Green","Cotton","Silk","China","USA"),
           Prob=c(0.2,0.8,0.7,0.3,0.6,0.4))

rownames(PROBS)<-PROBS$Value

GRID<-expand.grid(by(PROBS,PROBS$Item,function(x)x["Value"]))

GRID$probs<-apply(GRID,1,function(x)prod(PROBS[c(x),"Prob"]))

GRID
#  Color Country Material probs
#1  Blue   China   Cotton 0.084
#2 Green   China   Cotton 0.336
#3  Blue     USA   Cotton 0.056
#4 Green     USA   Cotton 0.224
#5  Blue   China     Silk 0.036
#6 Green   China     Silk 0.144
#7  Blue     USA     Silk 0.024
#8 Green     USA     Silk 0.096
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top