How can I use R to find malformed rows and fields in a file too big to read into RAM

https://stackoverflow.com/questions/19082490

29-06-2022
|

Domanda

I have a file that is larger than the total RAM on my computer by a factor of roughly 10. I am trying to getting it read in to an R Object that will let me look at it and extract more manageable chunks of it. I have tried various approaches to this, but have run into problems – different problems – with each. I have a copy of the file in fixed-width format, and another as a CSV. I believe the files are otherwise identical. I have been able to read the first 5000 lines and have a tentative field width for each column in the fixed width file and a tentative data class for each column for both files. At this point, I am not asking how to achieve my overall objective. Instead, I would like to rule out (or prove) malformation of the data as the source of my errors. If I had the whole file read I would have some idea how to do this. As it is, I do not.

So here is my question: Is there a way in R to read in fixed width or CSV data line by line without reading the whole file into memory, and: for the CSV, check: • if the number of fields is always the same, and return the row numbers where it is not; • if the data in each field is consistent with the column class, and return the row number and column number or name where it is not

for the fixed-width, check: • if the number of characters is always the same, and return the row number if it is not; • if the data in each field is consistent with the column class; and return the row number and the number of the first character in the field, or the column number, or the column name, if it is not;

Finally, for both cases I would like the method to tell me how many rows it has examined in all (to make sure it got to the end of the file), and I would like a way to extract copies of arbitrary rows by row number, so that I can look at them (again without reading the whole file into memory).

In for both the fixed-width and the CSV cases, the checking for column classes has to be robust to having some fields or characters absent or malformed, i.e., it should still tell me sensible things about the row, and still go on to look at the next row.

Maybe there is a package or function that does this? It seems like a fairly standard data-cleaning task, except for the large-file problem.

Any help would be greatly appreciated.

Sincerely, andrewH

Soluzione

Option 1: I have limited experience with fwf data in "real life situations", but for large CSV files have found the count.fields function to be very helpful. Try this:

 (table(cnts <- 
      count.fields(paste0(path,filename), sep=",", quote="", comment.char="") )

Then you can search in cnts for the line numbers with outlier values. For instance, if you noticed that there were only 10-20 field counts of 47 while the rest were 48, you might print out those locations:

which(cnts=47)

Option 2: I'm pretty sure I have seen solutions to this using sed and grep at a system level for counting field separators. I cobbled this together from some NIX forums and it gives me a table of counts of fields in a four line file that is well structured:

fct <- table(system("awk -F ',' '{print NF}' A.csv", intern=TRUE))
fct

#3 
#4

And it took 6 seconds to count the fields in a 1.2 MM record dataset and none of the data were brought into R:

system.time( fct <- table(system("awk -F ',' '{print NF}' All.csv", intern=TRUE)) )
#   user  system elapsed 
#  6.597   0.215   6.552

You can get the count of lines with :

sum(fct)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow