Question

I have a dataframe (data) which includes a lot of dates. I want to lop off everything from before 1970. I can create a list of indices that are before 1970:

tmp <- which(data$data < '1970-01-01')
[1]  13446 102876 141199

and I want to create a new table that drops out those three rows. Something like:

data.after.1970 <- data[!tmp, ]

I know I could create a vector of all the incidents after 1970 and match against it with:

tmp <- which(data$data > '1970-01-01')
data.after.1970 <- data[tmp, ]

But I am wondering what syntax I would use to exclude items.

UPDATE

I finally just did this:

tmp <- which(data$data > as.Date('1970-01-01'))
data.after.1970 <- data[tmp, ]

and took a closer look at it. which(data$data < as.Date('1970-01-01')) gets three results, but nrow(data) - nrow(data.after.1970) shows that I dropped 45 rows. summary(datae$date) cleared that up:

summary(data$date)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max.         NA's 
"1933-07-01" "1989-01-25" "1992-07-09" "1992-05-03" "1996-06-10" "2006-09-14"         "42" 

Since my goal was to get a second dataset so I could compare my results if I exclude those with bad dates, I actually do want to drop those with NA values as well.

I still want to know what syntax I would use to exclude some numeric vector rather than include it.

Was it helpful?

Solution 3

Turns out it was actually pretty simple.

data.after.1970 <- data[-tmp, ]

will create a new frame, data.after.1970 that includes all the rows from data except those which match the indexes in tmp.

OTHER TIPS

which returns a numeric vector for the items that are TRUE in the the evaluation of a logical expression or in a logical vector itself. It is also possible to use negative indexing to remove items. In your case that might look like:

tmp <- data[ which(data$data < '1970-01-01') ,  ] 

I.e. return all rows of the dataframe, "data" where the "data" column is less than "1970-01-01". Your really should learn to use more specific names than "data". Not only will you create confusion by having hte same name for an object and a element with that object, but there is also a function "data". So how is your poor audience supposed to know what you meant when you wrote than code 10 months ago.

To address those nay-sayers above who do not like the use of which I would answer that it avoids the problem that the "[" function will return all rows for which either the condition is TRUE or is NA. You can use subset which has the same advantage but it is not advised for use in programming, only for interactive use. You could do it the way that subset does it and add a clause that eliminates the NA values:

tmp <- data[ data$data < '1970-01-01 & !is.na(data$data) , ]

... and I would argue that the version using which is "cleaner" than that alternative. There is a downside to which which arises in the case where no values are TRUE and you are using negative indexing, in which case, contrary to expectation using dfrm[-which(condition) , ] will not give you what you want but rather an empty vector. So the rule is: use which but not with negative indexing.

To explain a little more, if you run:

data$date > '1970-01-01'

you will see it returns a logical vector of TRUE/FALSE which you can use to select the required rows. It works as in this example:

test <- 1:3
test[c(TRUE,FALSE,TRUE)]
# result
[1] 1 3

As @DWin notes in his answer, there are some caveats when you have NA values, which will also be returned as well as the TRUE values. As in:

test <- c(NA,1:3)
test > 2
# result
[1]    NA FALSE FALSE  TRUE
test[test>2]
# result
[1] NA  3

The which statement returns all the TRUE indices only, which avoids the issue with NA values.

test <- c(NA,1:3)

which(test>2)
# result - the fourth value of test is > 2
[1] 4

> test[4]
# result - return the fourth value of test
[1] 3

test[which(test>2)]
# result - return the fourth value of test as the 
# which statement has identified it is > 2
[1] 3

As an aside, I'm gathering you would probably actually need to do something like:

data$date > as.Date('1970-01-01')

...to get your example working properly. This is assuming of course your date column is also actually a Date object and not just plain text.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top