Question

I am a new user of R and I do not know very well how to improve the following script. I have heard about the apply functions but I did not manage to use them. Here is my problem:

I have two dataframes, the first one called data and the second one called eco. data has more than 1 million rows and eco 90.000. They both have a common column named id.For one id, they are several rows in data corresponding to the presence of botanic species.

I want to symplify this by giving a value to the id in the data frame eco if one specific specie is present or missing in the same id in data. The information will appear in a column sp in eco.

My script with the for loop, which takes hours to run:

for (k in (1:nrow(data))) {
if (data[k, "sp"]==1) #sp corresponds to one specific specie
{
eco[which(eco$id==data[k, "id"]), "sp"] = 1 # before this, the "sp" columnis empty in eco
}
}

How can I improve that ?

Thank you very much for any help.

Was it helpful?

Solution 2

Is this what you are looking for?

Edit after comment by @Simon:

eco$sp <- 0                         #create new column `sp` initialized with 0
eco[eco$id %in% data$id[data$sp == 1],"sp"] <- 1  # replace 0 with 1 if for all id where data$sp == 1

OTHER TIPS

With 1,000,000 records I'd consider using data.table. You can do this using one of data.table's compound join operations, which is just data[sp==1,][eco], if you don't mind NA being returned when species 1 is not present. You have the perfect setup. Two tables with a common key. You can easily do this like so:

# Some sample data
set.seed(123)
data <- data.frame( id = rep( letters[1:3] , each = 3 ) , sp = sample( 1:5 , 9 , TRUE ) )
eco <- data.frame( id = letters[1:3] , otherdat = rnorm(3) )
data
   id sp
#1:  a  2
#2:  a  4
#3:  a  3
#4:  b  5
#5:  b  5
#6:  b  1 ===> species 1 is present at this id only
#7:  c  3
#8:  c  5
#9:  c  3

eco
#   id   otherdat
#1:  a -0.1089660
#2:  b -0.1172420
#3:  c  0.1830826


#  All you need to do is turn your data.frames to data.tables, with a key, like so...
require(data.table)
data <- data.table( data , key = "id" )
eco <- data.table( eco , key = "id" )

# Join relevant records from data to eco by the common key
# This way keep 0 when species 1 is present and 0 otherwise
eco[ data[ , list( sp = as.integer( any( sp == 1 ) ) ) , by = id ] ]
#   id   otherdat sp
#1:  a -0.1089660  0
#2:  b -0.1172420  1
#3:  c  0.1830826  0

# A more succinct way of doing this (and faster)
# is a compound join (but you get NA instead of 0)
data[sp==1,][eco]
#   id   sp   otherdat
#1:  a   NA -0.1089660
#2:  b TRUE -0.1172420
#3:  c   NA  0.1830826
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top