Read.table to skip lines with errors

https://stackoverflow.com/questions/22878488

28-06-2023
|

Question

I have large .csv file separated by tabs, which has strict structure with colClasses = c("integer", "integer", "numeric") . For some reason, there are number of trash irrelevant character lines, that broke the pattern, that's why I get

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'an integer', got 'ExecutiveProducers'

How can I ask read.table to continue and just to skip this lines? The file is large, so it's troublesome to perform the task by hand. If it's impossible, should I use scan + for-loop ?

Now I just read everything as characters and then delete irrelevant rows and convert columns back to numeric, which I think not very memory-efficient

Solution

If your file fits into memory, you could first read the file, remove unwanted lines and then read those using read.csv:

lines <- readLines("yourfile")

# remove unwanted lines: select only lines that do not contain 
# characters; assuming you have column titles in the first line,
# you want to add those back again; hence the c(1, sel)
sel <- grep("[[:alpha:]]", lines, invert=TRUE)
lines <- lines[c(1,sel)]

# read data from selected lines
con <- textConnection(lines)
data <- read.csv(file=con, [other arguments as normal])

OTHER TIPS

If the character strings are always the same, or always contain the same word, you can define them as NA values using

  read.csv(...,  na.strings="")

and the delete all of them afterwards with

omit.na(dataframe)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow