Question

Originally, I was using a short C# program I wrote to average some numbers. But now I want to do more extensive analysis so I converted my C# code to R. However, I really don't think that I am doing it the proper way in R or taking advantage of the language. I wrote the R in the exact same way I did the C#.

I have a CSV with two columns. The first column identifies the row's type (one of three values: C, E, or P) and the second column has a number. I want to average the numbers grouped on the type (C, E, or P).

My question is, what is the idiomatic way of doing this in R?

C# code:

        string path = "data.csv";
        string[] lines = File.ReadAllLines(path);

        int cntC = 0; int cntE = 0; int cntP = 0; //counts
        double totC = 0; double totE = 0; double totP = 0; //totals
        foreach (string line in lines)
        {
            String[] cells = line.Split(',');
            if (cells[1] == "NA") continue; //skip missing data

            if (cells[0] == "C") 
            {
                totC += Convert.ToDouble(cells[1]);
                cntC++;
            }
            else if (cells[0] == "E")
            {
                totE += Convert.ToDouble(cells[1]);
                cntE++;
            }
            else if (cells[0] == "P")
            {
                totP += Convert.ToDouble(cells[1]);
                cntP++;
            }
        }
        Console.WriteLine("C found " + cntC + " times with a total of " + totC + " and an average of " + totC / cntC);
        Console.WriteLine("E found " + cntE + " times with a total of " + totE + " and an average of " + totE / cntE);
        Console.WriteLine("P found " + cntP + " times with a total of " + totP + " and an average of " + totP / cntP);

R code:

dat = read.csv("data.csv", header = TRUE)

cntC = 0; cntE = 0; cntP = 0  # counts
totC = 0; totE = 0; totP = 0  # totals
for(i in 1:nrow(dat))
{
    if(is.na(dat[i,2])) # missing data
        next

    if(dat[i,1] == "C"){
        totC = totC + dat[i,2]
        cntC = cntC + 1
    }
    if(dat[i,1] == "E"){
        totE = totE + dat[i,2]
        cntE = cntE + 1
    }
    if(dat[i,1] == "P"){
        totP = totP + dat[i,2]
        cntP = cntP + 1
    }
}
sprintf("C found %d times with a total of %f and an average of %f", cntC, totC, (totC / cntC))
sprintf("E found %d times with a total of %f and an average of %f", cntE, totE, (totE / cntE))
sprintf("P found %d times with a total of %f and an average of %f", cntP, totP, (totP / cntP))
Was it helpful?

Solution 2

I would do something like this :

dat = dat[complete.cases(dat),]  ## The R way to remove missing data
dat[,2] <- as.numeric(dat[,2])   ## convert to numeric as you do in c#
by(dat[,2],dat[,1],mean)         ## compute the mean by group

Of course to aggregate your result in a data.frame you can use the the classic , But I don't think is necessary here since it a list of 3 variables:

 do.call(rbind,result)

EDIT1

Another option here is to use the elegant ave :

ave(dat[,2],dat[,1])

But the result is different here. In the sense you will get a vector of the same length as your original data.

EDIT2 To include more results you can elaborate your anonymous function:

by(dat[,2],dat[,1],function(x) c(min(x),max(x),mean(x),sd(x)))

Or returns data.frame more suitable to rbind call and with columns names:

by(dat[,2],dat[,1],function(x) 
            data.frame(min=min(x),max=max(x),mean=mean(x),sd=sd(x)))

Or use the elegant built-in function ( you can define your's also) summary:

by(dat[,2],dat[,1],summary)

OTHER TIPS

I would use the data.table package since it has group by functionality built in.

 library(data.table)
 dat <- data.table(dat)

 dat[, mean(COL_NAME_TO_TAKE_MEAN_OF), by=COL_NAME_TO_GROUP_BY]
       # no quotes for the column names

If you would like to take the mean (or perform other function) on multiple columns, still by group, use:

 dat[, lapply(.SD, mean), by=COL_NAME_TO_GROUP_BY]

Alternatively, if you want to use Base R, you could use something like

 by(dat, dat[, 1], lapply, mean)
 # to convert the results to a data.frame, use  
 do.call(rbind,  by(dat, dat[, 1], lapply, mean) )

One way:

library(plyr)

ddply(dat, .(columnOneName), summarize, Average = mean(columnTwoName))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top