题
我使用巨大的数据文件,有时我只需要知道这些文件中的行数,通常我打开它们并逐行读取它们,直到到达文件末尾
我想知道是否有更聪明的方法来做到这一点
解决方案
这是最快的版本到目前为止,比readlines方法快约6倍,我发现。在150MB日志文件使用readlines方法时这需要0.35秒,与2.40秒()。只是为了好玩,LINUX” WC -l命令采用0.15秒。
public static int countLinesOld(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
编辑,9年半后:我几乎没有Java经验,但反正我已经尝试对基准以下,因为它困扰着我,没有人这样做,是LineNumberReader
的解决了这个代码。看来,特别是对于大型文件我的解决方案更快。虽然它似乎直到优化做一个体面的工作,采取一些运行。我打了一下的代码,并产生了一个新的版本是一致的最快的:
public static int countLinesNew(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int readChars = is.read(c);
if (readChars == -1) {
// bail out if nothing to read
return 0;
}
// make it easy for the optimizer to tune this loop
int count = 0;
while (readChars == 1024) {
for (int i=0; i<1024;) {
if (c[i++] == '\n') {
++count;
}
}
readChars = is.read(c);
}
// count remaining characters
while (readChars != -1) {
System.out.println(readChars);
for (int i=0; i<readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
readChars = is.read(c);
}
return count == 0 ? 1 : count;
} finally {
is.close();
}
}
有一个1.3GB文本文件基准resuls,以秒为Y轴。我已经完成100次使用相同的文件,并测量与System.nanoTime()
每次运行。你可以看到,countLinesOld
有一些异常,并countLinesNew
现在没有,虽然这只是一个有点快,差异有统计学显著。 LineNumberReader
显然慢。
其他提示
我已经实现另一种解决方案的问题,我发现它更有效的在计数行:
try
(
FileReader input = new FileReader("input.txt");
LineNumberReader count = new LineNumberReader(input);
)
{
while (count.skip(Long.MAX_VALUE) > 0)
{
// Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
}
result = count.getLineNumber() + 1; // +1 because line index starts at 0
}
接受的答案具有由一个误差对于不在换行结束多行文件的偏离。没有一个换行符结束一个行文件将返回1,但没有一个换行符结束了两行的文件将返回1了。下面是接受的解决方案,其修复此的实现。的endsWithoutNewLine检查是浪费的用于一切,但最后读取,而应该是相对于整个功能琐碎时间明智的。
public int count(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean endsWithoutNewLine = false;
while ((readChars = is.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n')
++count;
}
endsWithoutNewLine = (c[readChars - 1] != '\n');
}
if(endsWithoutNewLine) {
++count;
}
return count;
} finally {
is.close();
}
}
使用 java的8 时,可以使用流:
try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
long numOfLines = lines.count();
...
}
以上方法计数)答案(给我行误算,如果一个文件没有在文件的结尾换行符 - 它没有计算文件的最后一行
这个方法更好地工作对我来说:
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
我知道这是一个老问题,但接受的解决方案并没有完全匹配的内容,我需要做的事。所以,我精制它接受各种行终止(而不是仅仅换行),并使用一个指定的字符编码(而不是ISO-8859- 名词的)。所有在一种方法(重构适当时):
public static long getLinesCount(String fileName, String encodingName) throws IOException {
long linesCount = 0;
File file = new File(fileName);
FileInputStream fileIn = new FileInputStream(file);
try {
Charset encoding = Charset.forName(encodingName);
Reader fileReader = new InputStreamReader(fileIn, encoding);
int bufferSize = 4096;
Reader reader = new BufferedReader(fileReader, bufferSize);
char[] buffer = new char[bufferSize];
int prevChar = -1;
int readCount = reader.read(buffer);
while (readCount != -1) {
for (int i = 0; i < readCount; i++) {
int nextChar = buffer[i];
switch (nextChar) {
case '\r': {
// The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
linesCount++;
break;
}
case '\n': {
if (prevChar == '\r') {
// The current line is terminated by a carriage return immediately followed by a line feed.
// The line has already been counted.
} else {
// The current line is terminated by a line feed.
linesCount++;
}
break;
}
}
prevChar = nextChar;
}
readCount = reader.read(buffer);
}
if (prevCh != -1) {
switch (prevCh) {
case '\r':
case '\n': {
// The last line is terminated by a line terminator.
// The last line has already been counted.
break;
}
default: {
// The last line is terminated by end-of-file.
linesCount++;
}
}
}
} finally {
fileIn.close();
}
return linesCount;
}
此溶液是在速度媲美的接受溶液,在我的测试慢约4%(虽然在Java中定时测试是非常不可靠)。
我测试了上述方法来计算行数,以下是我对在我的系统上测试的不同方法的观察结果
文件大小 :1.6 GB方法:
- 使用扫描仪 :约35秒
- 使用BufferedReader :约5秒
- 使用Java 8 :约5秒
- 使用 LineNumberReader :约5秒
此外,Java8 方法似乎很方便:Files.lines(Paths.get(filePath), Charset.defaultCharset()).count() [返回类型:长的]
/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (Stream<String> lines = Files.lines(file.toPath())) {
return lines.count();
}
}
测试在JDK8_u31。但确实相比该方法很慢:
/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {
byte[] c = new byte[1024];
boolean empty = true,
lastEmpty = false;
long count = 0;
int read;
while ((read = is.read(c)) != -1) {
for (int i = 0; i < read; i++) {
if (c[i] == '\n') {
count++;
lastEmpty = true;
} else if (lastEmpty) {
lastEmpty = false;
}
}
empty = false;
}
if (!empty) {
if (count == 0) {
count = 1;
} else if (!lastEmpty) {
count++;
}
}
return count;
}
}
测试和非常快的。
使用扫描仪直接的方式
static void lineCounter (String path) throws IOException {
int lineCount = 0, commentsCount = 0;
Scanner input = new Scanner(new File(path));
while (input.hasNextLine()) {
String data = input.nextLine();
if (data.startsWith("//")) commentsCount++;
lineCount++;
}
System.out.println("Line Count: " + lineCount + "\t Comments Count: " + commentsCount);
}
我的结论是wc -l
:计数换行符第方法是好的,但在其中最后一行不与换行结尾的文件返回非直观的结果。
和基于LineNumberReader但添加一个到行计数上的文件,其中最后一行确实与换行结束返回非直观的结果@ er.vikas溶液。
我因此由它处理如下的算法中:
@Test
public void empty() throws IOException {
assertEquals(0, count(""));
}
@Test
public void singleNewline() throws IOException {
assertEquals(1, count("\n"));
}
@Test
public void dataWithoutNewline() throws IOException {
assertEquals(1, count("one"));
}
@Test
public void oneCompleteLine() throws IOException {
assertEquals(1, count("one\n"));
}
@Test
public void twoCompleteLines() throws IOException {
assertEquals(2, count("one\ntwo\n"));
}
@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
assertEquals(2, count("one\ntwo"));
}
@Test
public void aFewLines() throws IOException {
assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}
和它看起来像这样:
static long countLines(InputStream is) throws IOException {
try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
char[] buf = new char[8192];
int n, previousN = -1;
//Read will return at least one byte, no need to buffer more
while((n = lnr.read(buf)) != -1) {
previousN = n;
}
int ln = lnr.getLineNumber();
if (previousN == -1) {
//No data read at all, i.e file was empty
return 0;
} else {
char lastChar = buf[previousN - 1];
if (lastChar == '\n' || lastChar == '\r') {
//Ending with newline, deduct one
return ln;
}
}
//normal case, return line number + 1
return ln + 1;
}
}
如果你想直观的效果,你可以使用这个。如果你只是想wc -l
兼容性,简单的使用@ er.vikas的解决方案,但不添加一个结果,然后重试跳跃:
try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
while(lnr.skip(Long.MAX_VALUE) > 0){};
return lnr.getLineNumber();
}
如何有关使用Process类从Java代码内?然后读出命令的输出。
Process p = Runtime.getRuntime().exec("wc -l " + yourfilename);
p.waitFor();
BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
int lineCount = 0;
while ((line = b.readLine()) != null) {
System.out.println(line);
lineCount = Integer.parseInt(line);
}
需要,虽然尝试。将后的结果。
如果你没有任何索引结构,你不会获得完整文件的读取左右。但可以通过避免逐行读取它并使用正则表达式来匹配所有行结束优化它。
这有趣的解决方案的工作真的很好实际上!
public static int countLines(File input) throws IOException {
try (InputStream is = new FileInputStream(input)) {
int count = 1;
for (int aChar = 0; aChar != -1;aChar = is.read())
count += aChar == '\n' ? 1 : 0;
return count;
}
}
在基于Unix的系统,在命令行中使用的wc
命令。
只有这样,才能知道有多少行有文件是指望他们。当然,您可以从您的数据度量给你一条线的平均长度,然后获取文件大小,并分割与魅力。长度,但是,这将是不准确的。
对于具有在EOF没有换行( '\ n')字符多行文件最佳优化代码。
/**
*
* @param filename
* @return
* @throws IOException
*/
public static int countLines(String filename) throws IOException {
int count = 0;
boolean empty = true;
FileInputStream fis = null;
InputStream is = null;
try {
fis = new FileInputStream(filename);
is = new BufferedInputStream(fis);
byte[] c = new byte[1024];
int readChars = 0;
boolean isLine = false;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if ( c[i] == '\n' ) {
isLine = false;
++count;
}else if(!isLine && c[i] != '\n' && c[i] != '\r'){ //Case to handle line count where no New Line character present at EOF
isLine = true;
}
}
}
if(isLine){
++count;
}
}catch(IOException e){
e.printStackTrace();
}finally {
if(is != null){
is.close();
}
if(fis != null){
fis.close();
}
}
LOG.info("count: "+count);
return (count == 0 && !empty) ? 1 : count;
}
扫描仪与正则表达式:
public int getLineCount() {
Scanner fileScanner = null;
int lineCount = 0;
Pattern lineEndPattern = Pattern.compile("(?m)$");
try {
fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
while (fileScanner.hasNext()) {
fileScanner.next();
++lineCount;
}
}catch(FileNotFoundException e) {
e.printStackTrace();
return lineCount;
}
fileScanner.close();
return lineCount;
}
还没有主频它。
如果使用此
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
你不能跑到大NUM行,喜欢100K行,因为从reader.getLineNumber回报为int。需要长的数据类型来处理最大的行..