Frage

My dataset looks like this:

genera
             Genus Location Number
    1                             NA
    2    Terriglobus       CC      1
    3    Terriglobus        N      5
    4 Acidobacterium       CC      2
    5 Acidobacterium        N     12
    6   Edaphobacter       CC      0

I want to do do two things 1) delete rows with any NA in any column and 2) calculate the frequencies for each genus in both location, CC and N.

I have been trying to use

AB<-genera[genera[, "Location"] == "CC", ] #to keep all separate the rows by location 
CD<-genera[genera[, "Location"] == "N", ]

I want to use table or prop.table and calculate the frequencies each, but I am having difficulties because I just get NA NA NA NA NA NA NA

Any help is much appreciated.

War es hilfreich?

Lösung 3

Here's how I would do it, using @RichardScriven's dat:

with(na.omit(dat), aggregate(Number, list(Genus=Genus, Location=Location), sum))

#            Genus Location  x
# 1 Acidobacterium       CC  2
# 2   Edaphobacter       CC  0
# 3    Terriglobus       CC  1
# 4 Acidobacterium        N 12
# 5    Terriglobus        N  5

Edit

Given the clarification in your comments on other solutions, I now suggest the following, which calculates, for each Genus and Location, the Number as a proportion of the sum of Number at the location. Again, starting with @RichardScriven's dat.

do.call(rbind, lapply(unique(dat$Location), function(x) {
  d <- subset(dat, Location==x)
  cbind(Location=x, aggregate(d$Number, list(Genus=d$Genus), 
                              function(x) sum(x)/sum(d$Number)))
}))

#   Location          Genus         x
# 1       CC Acidobacterium 0.6666667
# 2       CC   Edaphobacter 0.0000000
# 3       CC    Terriglobus 0.3333333
# 4        N Acidobacterium 0.7058824
# 5        N    Terriglobus 0.2941176

However, if each Genus only occurs once per Location, you can simplify to:

lapply(split(dat, list(dat$Location), drop=TRUE), function(x) 
  transform(x, propn=x$Number/sum(x$Number)))

# $CC
#            Genus Location Number     propn
# 2    Terriglobus       CC      1 0.3333333
# 4 Acidobacterium       CC      2 0.6666667
# 6   Edaphobacter       CC      0 0.0000000
# 
# $N
#            Genus Location Number     propn
# 3    Terriglobus        N      5 0.2941176
# 5 Acidobacterium        N     12 0.7058824

This could then be combined into a single data frame with do.call(rbind, x), where x is the list created above.

Finally, you could use dplyr as follows:

library(dplyr)
dat %.%
  group_by(Location) %.%
  mutate(total = sum(Number), Propn = Number/total) %.%
  select(-total)

#            Genus Location Number     Propn
# 1    Terriglobus       CC      1 0.3333333
# 2    Terriglobus        N      5 0.2941176
# 3 Acidobacterium       CC      2 0.6666667
# 4 Acidobacterium        N     12 0.7058824
# 5   Edaphobacter       CC      0 0.0000000

Andere Tipps

prop.table needs a table object to start with:

 prop.table( table(genera$CC) )

If "Number" is a count then you would probably want tapply with sum of Number. Perhaps something along these lines:

prop.table( with(genera, tapply(Number, CC, sum) ) )

xtabs will also do sums:

 prop.table( xtabs( Number ~ CC, data=genera) )

I had to add two NULL values to make the table.

> dat <- read.table(header = TRUE, text = 'Genus Location Number
  1           NULL     NULL     NA
  2    Terriglobus       CC      1
  3    Terriglobus        N      5
  4 Acidobacterium       CC      2
  5 Acidobacterium        N     12
  6   Edaphobacter       CC      0', row.names = 1)

With regard to your first question, you can remove the rows with NA numbers with which and is.na

> newDat <- dat[-which(is.na(dat$Number)), ]
> newDat
           Genus Location Number
2    Terriglobus       CC      1
3    Terriglobus        N      5
4 Acidobacterium       CC      2
5 Acidobacterium        N     12
6   Edaphobacter       CC      0

For your second question, I think you may have frequency and percentage (or probability) confused. Frequency can be found by

> sapply(split(newDat, as.character(newDat$Genus)), function(x){
    sum(x$Number)
    })
Acidobacterium   Edaphobacter    Terriglobus 
            14              0              6 

The percentage is a little different,

> pct <- with(newDat, Number/sum(Number))
> names(pct) <- newDat$Location

This will tell you, in order, the weight as a percentage that each location carries relative to the overall total.

> pct
  CC    N   CC    N   CC 
0.05 0.25 0.10 0.60 0.00 

ADDED

On second thought, you may just need

> split(newDat[,c("Location", "Number")], newDat$Genus)
$Acidobacterium
  Location Number
4       CC      2
5        N     12

$Edaphobacter
  Location Number
6       CC      0

$Terriglobus
  Location Number
2       CC      1
3        N      5
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top