One workaround if you wish to skip erroneous rows:
First read in the file only separating according to new rows by using sep="\n"
then count the number of separators for each row and filter for the correct # of separators then collapse
the data and separate according to the true column separator. see example below.
example data:
require(data.table)
wrong <- fread("
var1|var2|var3|var4
a|1|10|TRUE
b|2|10|FALSE
c|3|10FALSE # note the missing separator between 10 and FALSE.
d|4|10|TRUE
e|5|10|TRUE",sep="\n")
count number of strings:
The are a number of ways to do this, see stringr
's ?str_count
for one:
wrong[,n_seps := str_count(wrong[[1]],fixed("|"))] # see below for explanation.
Or with some simplifying assumptions via an rcpp
analogue:
If the separator is a single character (which it usually is) then I have found the simple function below to be most efficient. It is written is c++
and exported to R
via the Rcpp
package's sourceCpp()
workhorse.
in a seperate "helpers.cpp" file
#include <Rcpp.h>
#include <algorithm>
#include <string>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
NumericVector v_str_count_cpp(CharacterVector x, char y) {
int n = x.size();
NumericVector out(n);
for(int i = 0; i < n; ++i) {
out[i] = std::count(x[i].begin(), x[i].end(), y);
}
return out;
}
New column with counts:
We then apply the function to count the number of occurences of |
for each row and return the results
in a new column called n_seps
.
wrong[,n_seps := apply(wrong,1,v_str_count_cpp,"|")]
Now wrong
looks like:
> wrong
var1|var2|var3|var4 n_seps
1: a|1|10|TRUE 3
2: b|2|10|FALSE 3
3: c|3|10FALSE 2
4: d|4|10|TRUE 3
5: e|5|10|TRUE 3
now filter for the nice rows and collapse it back:
collapsed <- paste0( wrong[n_seps == 3][[1]], collapse = "\n" )
and lastly read it back with the proper separator:
correct <- fread(collapsed,sep="|")
which looks like:
> correct
V1 V2 V3 V4
1: a 1 10 TRUE
2: b 2 10 FALSE
3: d 4 10 TRUE
4: e 5 10 TRUE
Hope this helps.