Get the number of lines in a text file using R

Question 1

If you:

still want to avoid the system call that a system2("wc"… will cause
are on BSD/Linux or OS X (I didn't test the following on Windows)
don't mind a using a full filename path
are comfortable using the inline package

then the following should be about as fast as you can get (it's pretty much the 'line count' portion of wc in an inline R C function):

library(inline)

wc.code <- "
uintmax_t linect = 0; 
uintmax_t tlinect = 0;

int fd, len;
u_char *p;

struct statfs fsb;

static off_t buf_size = SMALL_BUF_SIZE;
static u_char small_buf[SMALL_BUF_SIZE];
static u_char *buf = small_buf;

PROTECT(f = AS_CHARACTER(f));

if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {

  if (fstatfs(fd, &fsb)) {
    fsb.f_iosize = SMALL_BUF_SIZE;
  }

  if (fsb.f_iosize != buf_size) {
    if (buf != small_buf) {
      free(buf);
    }
    if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
      buf = small_buf;
      buf_size = SMALL_BUF_SIZE;
    } else {
      buf_size = fsb.f_iosize;
    }
  }

  while ((len = read(fd, buf, buf_size))) {

    if (len == -1) {
      (void)close(fd);
      break;
    }

    for (p = buf; len--; ++p)
      if (*p == '\\n')
        ++linect;
  }

  tlinect += linect;

  (void)close(fd);

}
SEXP result;
PROTECT(result = NEW_INTEGER(1));
INTEGER(result)[0] = tlinect;
UNPROTECT(2);
return(result);
";

setCMethod("wc",
           signature(f="character"), 
           wc.code,
           includes=c("#include <stdlib.h>", 
                      "#include <stdio.h>",
                      "#include <sys/param.h>",
                      "#include <sys/mount.h>",
                      "#include <sys/stat.h>",
                      "#include <ctype.h>",
                      "#include <err.h>",
                      "#include <errno.h>",
                      "#include <fcntl.h>",
                      "#include <locale.h>",
                      "#include <stdint.h>",
                      "#include <string.h>",
                      "#include <unistd.h>",
                      "#include <wchar.h>",
                      "#include <wctype.h>",
                      "#define SMALL_BUF_SIZE (1024 * 8)"),
           language="C",
           convention=".Call")

wc("FULLPATHTOFILE")

It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a 189,955 line file I had lying around, I get (mean values from a bunch of runs):

   user  system elapsed 
  0.007   0.003   0.010

Question 2

You can count the number of newline characters (\n, will also work for \r\n on Windows) in a file. This will give you a correct answer iff:

There is a newline char at the end of last line (BTW, read.csv gives a warning if this doesn't hold)
The table does not contain a newline character in the data (e.g. within quotes)

I'll suffice to read the file in parts. Below I set chunk (tmp buf) size of 65536 bytes:

f <- file("filename.csv", open="rb")
nlines <- 0L
while (length(chunk <- readBin(f, "raw", 65536)) > 0) {
   nlines <- nlines + sum(chunk == as.raw(10L))
}
print(nlines)
close(f)

Benchmarks on a ca. 512 MB ASCII text file, 12101000 text lines, Linux:

readBin: ca. 2.4 s.
@luis_js's wc-based solution: 0.1 s.
read.delim: 39.6 s.
EDIT: reading a file line by line with readLines (f <- file("/tmp/test.txt", open="r"); nlines <- 0L; while (length(l <- readLines(f, 128)) > 0) nlines <- nlines + length(l); close(f)): 32.0 s.

Question 3

I found this easy way using R.utils package

library(R.utils)
sapply(myfiles,countLines)

here is how it works

Question 4

Maybe I am missing something but usually I do it using length on top of ReadLines:

con <- file("some_file.format") 
length(readLines(con))

This at least has worked with many cases I had. I think it's kinda fast and it does only create a connection to the file without importing it.

Question 5

If you are using linux, this might work for you:

# total lines on a file through system call to wc, and filtering with awk
target_file   <- "your_file_name_here"
total_records <- as.integer(system2("wc",
                                    args = c("-l",
                                             target_file,
                                             " | awk '{print $1}'"),
                                    stdout = TRUE))

in your case:

#
lapply(myfiles, function(x){
                         as.integer(system2("wc",
                                            args = c("-l",
                                                     x,
                                                     " | awk '{print $1}'"),
                                            stdout = TRUE))
                      }
                  )

Question 6

Here is another way with CRAN package fpeek, function peek_count_lines. This function is coded in C++ and is pretty fast.

library(fpeek)
sapply(filenames, peek_count_lines)