Question

I have a small problem in R:

Say i have a dataframe with two columns, one containing frequencies and one containing scores. I suspect that the variance of the scores depends on the frequency. So I want to normalize my scores by binned frequency to have mean=0 and var=1.

For example, let's say I want 10 bins. First every score would be assigned a bin, and then within that bin every score would be normalized by the mean and variance of all the scores in that bin.

The result should be a third column with the normalized values

Getting the data binned is easy, using bins = cut(frequencies, b=bins, 1:bins), however I haven't found a way to on from there.

Thanks in advance!

Was it helpful?

Solution

scale is your friend here in terms of normalising to mean=0, sd=1, and if sd=1, var=1.

> mean(scale(1:10))
[1] 0
> sd(scale(1:10))
[1] 1
> var(scale(1:10))
     [,1]
[1,]    1

Try some example data:

set.seed(42)
dat <- data.frame(freq=sample(1:100), scores=rnorm(100, mean=4, sd=2))
dat$bins <- cut(dat$freq, breaks=c(0, 1:10*10), include.lowest=TRUE)

Now use ave to scale the scores within each of the bins:

dat$scaled <- with(dat,ave(scores,bins,FUN=scale))

You can check the results with aggregate or similar:

The mean is 0 (or very close to within rounding error) in each bin.

> aggregate(scaled ~ bins, data=dat, FUN=function(x) round(mean(x), 2) )
       bins scaled
1    [0,10]      0
2   (10,20]      0
3   (20,30]      0
4   (30,40]      0
5   (40,50]      0
6   (50,60]      0
7   (60,70]      0
8   (70,80]      0
9   (80,90]      0
10 (90,100]      0

The sd is 1 in each bin:

> aggregate(scaled ~ bins, data=dat, FUN=sd)
       bins scaled
1    [0,10]      1
2   (10,20]      1
3   (20,30]      1
4   (30,40]      1
5   (40,50]      1
6   (50,60]      1
7   (60,70]      1
8   (70,80]      1
9   (80,90]      1
10 (90,100]      1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top