Domanda

I have read in a large data file into R using the following command

data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))

The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).

Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.

Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.

My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.

require(data.table)

### Create a data.table
dt <- data.table(a = 1:10,
                 b = letters[1:10],
                 c = rep(1, times = 10))

### Create a comparable data.frame
df <- data.frame(dt)

### Expected result
unique(dt$a)

### Expected result
length(unique(dt$a))

However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).

### I want to determine the number of unique obs in
  # each variable, for a large list of vars
lapply(names(df), function(x) {
    length(unique(df[, x]))
})

### Unexpected result
length(unique(dt[, 'a', with = F]))  # Returns 1

It seems to me the problem is that

dt[, 'a', with = F]

returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.

Here's pseudo code for how I would remedy the solution, using the data.frame way:

for (x in names(data)) {
  unique.obs <- length(unique(data[, x]))
  if (unique.obs == 1) {
    data[, x] <- NULL
  }
}

Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.

È stato utile?

Soluzione

Update: uniqueN

As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:

dt[ , lapply(.SD, uniqueN)]

If you want to find the number of unique values in each column, something like

 dt[, lapply(.SD, function(x) length(unique(x)))]
##     a  b c
## 1: 10 10 1

To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)

lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))

or

 lapply(names(df) function(x) length(unique(dt[[x]])))

will work

In one step

dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]


 # or to avoid calling `.SD` 

dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]

Altri suggerimenti

The approaches in the other answers are good. Another way to add to the mix, just for fun :

for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]

or if there may be duplicate column names :

for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]

NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".

Here is a solution to your core problem (I hope I got it right).

require(data.table)

### Create a data.table
dt <- data.table(a = 1:10,
                 b = letters[1:10],
                 d1 = "",
                 c = rep(1, times = 10),
                 d2 = "")
dt
     a b d1 c d2
 1:  1 a    1   
 2:  2 b    1   
 3:  3 c    1   
 4:  4 d    1   
 5:  5 e    1   
 6:  6 f    1   
 7:  7 g    1   
 8:  8 h    1   
 9:  9 i    1   
10: 10 j    1   

First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.

only_space <- function(x) {
  length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]

Somehow, I have the feeling that you could further simplify it...

Output:

     a b c
 1:  1 a 1
 2:  2 b 1
 3:  3 c 1
 4:  4 d 1
 5:  5 e 1
 6:  6 f 1
 7:  7 g 1
 8:  8 h 1
 9:  9 i 1
10: 10 j 1

There is an easy way to do that using "dplyr" library, and then use select function as follow:

library(dplyr)

newdata <- select(old_data, first variable,second variable)

Note that, you can choose as many variables as you like.

Then you will get the type of data that you want.

Many thanks,

Fadhah

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top