Question

I've got a data file A with 7 columns, no missing values, to which I've unix-joined a data file B that has 28 fields. The result file is C. If no match is found in B, then the output row in C only has 7 columns. If there is a match in B, then the output row in C has 35 columns. I've kicked around join's -e option to fill the missings 28 filds, but without success.

What I'm trying to do is duplicate SAS's MISSOVER input statement in R. For example the following code works perfectly:

 dat <- textConnection('x1,x2,x3,x4
 1,2,"present","present"
 3,4
 5,6')

 df <- read.csv(dat, sep=',' , header=T , 
     colClasses = c("numeric" , "numeric", "character", "character"))

 > df
   x1 x2      x3      x4
 1  1  2 present present
 2  3  4                
 3  5  6   

But when I try to load my C file, I get the following error (using TRUE instead of T):

 df <- read.table( 'C.tab' , header=T , sep='\t', fill=TRUE,
                   colClasses = c(rep('numeric',7),rep('character',28)))


 Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
   line 1 did not have 35 elements

The first line (second row in C, after the header), does indeed have only those 7 fields from A. In SAS I'd use the MISSOVER statement to set all those trailing missing fields to some missing value. How can I do that in R? Thanks.

Was it helpful?

Solution

The fill=TRUE setting to the parameters of read.table (or its derivative cousin read.csv) are probably what you are looking for.

  df <- read.table(dat, sep=',' , header=T , fill=TRUE,
      colClasses = c("numeric" , "numeric", "character", "character"))
 df
#
  x1 x2      x3      x4
1  1  2 present present
2  3  4                
3  5  6      

The default for fill is TRUE for read.csv, but your error says you used fill=T suggesting that you have an object named T in your workspace. The default for read.table is fill=!blank.lines.skip and since the default is also blank.lines.skip = TRUE, the usual default for fill in read.table is FALSE.

Your edited question suggests you have other problems in your character fields. The usual suspects are unmatched quotes or octothorpes(#) which are effectively line terminators, so try this instead:

df <- read.table( 'C.tab' , header=T , sep='\t', fill=TRUE, 
              quote="",
              comment.char="",
              colClasses = c(rep('numeric',7),rep('character',28)))

If you are having difficulty with errors related to varying numbers of items per line, it can be very useful to use count.fields. It accepts similar parameters to those used by read.table. If you have a large number of input lines it can be useful to wrap the call to count.fields in a table call:

length_tbl <- table( count.fields( 'C.tab' , header=TRUE , sep='\t', 
                                    quote="",
                                    comment.char="")
                     )

You can then experiment with different options. Once you know what you are looking for you can also identify the line numbers that are causing problems by wrapping a which call around count.fields:

bad_lines <- which( count.fields( 'C.tab' , header=TRUE , sep='\t', 
                                    quote="",
                                    comment.char="")
                     != 7  # or whatever is the "correct" length
                     )
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top