Javaのファイルの行数

https://stackoverflow.com/questions/453018

19-08-2019
|

質問

私は巨大なデータファイルを使用し、時にはこれらのファイルの行数を知るだけでよく、通常ファイルを開き、ファイルの最後に達するまで行ごとに読み取ります

それを実現するよりスマートな方法があるかどうか疑問に思っていました

解決

これは、これまでに発見した最速のバージョンで、readLinesの約6倍の速度です。 150MBのログファイルでは、これは0.35秒かかりますが、readLines（）を使用する場合は2.40秒です。楽しみのために、Linuxのwc -lコマンドには0.15秒かかります。

public static int countLinesOld(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean empty = true;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
        }
        return (count == 0 && !empty) ? 1 : count;
    } finally {
        is.close();
    }
}

9年半後の

EDIT：私は事実上Javaの経験がありませんが、とにかく誰もやらなかったので、このコードを以下のLineNumberReaderソリューションに対してベンチマークしようとしました。特に大きなファイルの場合、私のソリューションの方が速いようです。オプティマイザーが適切なジョブを実行するまで、数回実行する必要があります。コードを少し試してみましたが、一貫して最速の新しいバージョンを作成しました：

public static int countLinesNew(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];

        int readChars = is.read(c);
        if (readChars == -1) {
            // bail out if nothing to read
            return 0;
        }

        // make it easy for the optimizer to tune this loop
        int count = 0;
        while (readChars == 1024) {
            for (int i=0; i<1024;) {
                if (c[i++] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        // count remaining characters
        while (readChars != -1) {
            System.out.println(readChars);
            for (int i=0; i<readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        return count == 0 ? 1 : count;
    } finally {
        is.close();
    }
}

ベンチマークは1.3GBテキストファイルの結果で、y軸は秒単位です。同じファイルで100回実行し、各実行をSystem.nanoTime()で測定しました。 countLinesOldにはいくつかの異常値があり、countLinesNewには何もありませんが、少し高速ですが、統計的に有意な差があります。 <=>は明らかに遅いです。

他のヒント

この問題に対する別のソリューションを実装しましたが、行のカウントがより効率的であることがわかりました。

try
(
   FileReader       input = new FileReader("input.txt");
   LineNumberReader count = new LineNumberReader(input);
)
{
   while (count.skip(Long.MAX_VALUE) > 0)
   {
      // Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
   }

   result = count.getLineNumber() + 1;                                    // +1 because line index starts at 0
}

受け入れられた回答には、改行で終わらない複数行のファイルに対して1つずれたエラーがあります。改行なしで終わる1行のファイルは1を返しますが、改行なしで終わる2行のファイルも1を返します。これは、これを修正する承認済みソリューションの実装です。 endsWithoutNewLineチェックは、最終読み取り以外はすべて無駄になりますが、全体的な機能と比較して時間的にはささいなはずです。

public int count(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean endsWithoutNewLine = false;
        while ((readChars = is.read(c)) != -1) {
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n')
                    ++count;
            }
            endsWithoutNewLine = (c[readChars - 1] != '\n');
        }
        if(endsWithoutNewLine) {
            ++count;
        } 
        return count;
    } finally {
        is.close();
    }
}

java-8 、ストリームを使用できます：

try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
  long numOfLines = lines.count();
  ...
}

上記のcount（）メソッドでの回答では、ファイルの最後に改行がない場合、ファイルの最後の行をカウントできませんでした。行が誤ってカウントされました。

この方法は私にとってより効果的です：

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}

cnt = reader.getLineNumber(); 
reader.close();
return cnt;
}

これは古い質問であることは知っていますが、受け入れられた解決策は、私がそれをするために必要なものと完全には一致しませんでした。そこで、（改行だけでなく）さまざまな行終端文字を受け入れ、指定された文字エンコード（ISO-8859- n ではなく）を使用するように改良しました。すべて1つの方法で（必要に応じてリファクタリング）：

public static long getLinesCount(String fileName, String encodingName) throws IOException {
    long linesCount = 0;
    File file = new File(fileName);
    FileInputStream fileIn = new FileInputStream(file);
    try {
        Charset encoding = Charset.forName(encodingName);
        Reader fileReader = new InputStreamReader(fileIn, encoding);
        int bufferSize = 4096;
        Reader reader = new BufferedReader(fileReader, bufferSize);
        char[] buffer = new char[bufferSize];
        int prevChar = -1;
        int readCount = reader.read(buffer);
        while (readCount != -1) {
            for (int i = 0; i < readCount; i++) {
                int nextChar = buffer[i];
                switch (nextChar) {
                    case '\r': {
                        // The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
                        linesCount++;
                        break;
                    }
                    case '\n': {
                        if (prevChar == '\r') {
                            // The current line is terminated by a carriage return immediately followed by a line feed.
                            // The line has already been counted.
                        } else {
                            // The current line is terminated by a line feed.
                            linesCount++;
                        }
                        break;
                    }
                }
                prevChar = nextChar;
            }
            readCount = reader.read(buffer);
        }
        if (prevCh != -1) {
            switch (prevCh) {
                case '\r':
                case '\n': {
                    // The last line is terminated by a line terminator.
                    // The last line has already been counted.
                    break;
                }
                default: {
                    // The last line is terminated by end-of-file.
                    linesCount++;
                }
            }
        }
    } finally {
        fileIn.close();
    }
    return linesCount;
}

このソリューションは、受け入れられたソリューションと速度が同等で、テストでは約4％遅くなります（ただし、Javaのタイミングテストは信頼性が低いことで有名です）。

行をカウントするために上記の方法をテストしましたが、ここに私のシステムでテストされたさまざまな方法の観察結果を示します

ファイルサイズ：1.6 Gb メソッド：

スキャナーの使用：約35秒
BufferedReaderの使用：約5秒
Java 8の使用：約5秒
LineNumberReaderの使用：約5秒

さらにJava8のアプローチは非常に便利なようです。Files.lines（Paths.get（filePath）、Charset.defaultCharset（））。count（）[戻り値の型：long]

/**
 * Count file rows.
 *
 * @param file file
 * @return file row count
 * @throws IOException
 */
public static long getLineCount(File file) throws IOException {

    try (Stream<String> lines = Files.lines(file.toPath())) {
        return lines.count();
    }
}

JDK8_u31でテスト済み。しかし実際、この方法と比較してパフォーマンスは遅くなります。

/**
 * Count file rows.
 *
 * @param file file
 * @return file row count
 * @throws IOException
 */
public static long getLineCount(File file) throws IOException {

    try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {

        byte[] c = new byte[1024];
        boolean empty = true,
                lastEmpty = false;
        long count = 0;
        int read;
        while ((read = is.read(c)) != -1) {
            for (int i = 0; i < read; i++) {
                if (c[i] == '\n') {
                    count++;
                    lastEmpty = true;
                } else if (lastEmpty) {
                    lastEmpty = false;
                }
            }
            empty = false;
        }

        if (!empty) {
            if (count == 0) {
                count = 1;
            } else if (!lastEmpty) {
                count++;
            }
        }

        return count;
    }
}

テスト済みで非常に高速。

スキャナーを使用した簡単な方法

static void lineCounter (String path) throws IOException {

        int lineCount = 0, commentsCount = 0;

        Scanner input = new Scanner(new File(path));
        while (input.hasNextLine()) {
            String data = input.nextLine();

            if (data.startsWith("//")) commentsCount++;

            lineCount++;
        }

        System.out.println("Line Count: " + lineCount + "\t Comments Count: " + commentsCount);
    }

wc -l：sで改行をカウントする方法は問題ありませんが、最後の行が改行で終わらないファイルでは直感的でない結果を返すと結論付けました。

また、LineNumberReaderに基づく@ er.vikasソリューションですが、行カウントに1を追加すると、最後の行が改行で終わるファイルで直感的でない結果が返されました。

したがって、次のように処理するアルゴリズムを作成しました。

@Test
public void empty() throws IOException {
    assertEquals(0, count(""));
}

@Test
public void singleNewline() throws IOException {
    assertEquals(1, count("\n"));
}

@Test
public void dataWithoutNewline() throws IOException {
    assertEquals(1, count("one"));
}

@Test
public void oneCompleteLine() throws IOException {
    assertEquals(1, count("one\n"));
}

@Test
public void twoCompleteLines() throws IOException {
    assertEquals(2, count("one\ntwo\n"));
}

@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
    assertEquals(2, count("one\ntwo"));
}

@Test
public void aFewLines() throws IOException {
    assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}

次のようになります：

static long countLines(InputStream is) throws IOException {
    try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
        char[] buf = new char[8192];
        int n, previousN = -1;
        //Read will return at least one byte, no need to buffer more
        while((n = lnr.read(buf)) != -1) {
            previousN = n;
        }
        int ln = lnr.getLineNumber();
        if (previousN == -1) {
            //No data read at all, i.e file was empty
            return 0;
        } else {
            char lastChar = buf[previousN - 1];
            if (lastChar == '\n' || lastChar == '\r') {
                //Ending with newline, deduct one
                return ln;
            }
        }
        //normal case, return line number + 1
        return ln + 1;
    }
}

直感的な結果が必要な場合は、これを使用できます。 <=>互換性のみが必要な場合は、@ er.vikasソリューションを使用しますが、結果に追加せずにスキップを再試行します。

try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
    while(lnr.skip(Long.MAX_VALUE) > 0){};
    return lnr.getLineNumber();
}

Javaコード内からProcessクラスを使用するのはどうですか？そして、コマンドの出力を読み取ります。

Process p = Runtime.getRuntime().exec("wc -l " + yourfilename);
p.waitFor();

BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
int lineCount = 0;
while ((line = b.readLine()) != null) {
    System.out.println(line);
    lineCount = Integer.parseInt(line);
}

しかし、試してみる必要があります。結果を投稿します。

インデックス構造がない場合、完全なファイルの読み取りを回避できません。ただし、1行ずつ読み取ることを避け、正規表現を使用してすべての行ターミネータに一致させることで、最適化できます。

このおもしろい解決策は実際には本当にうまくいきます！

public static int countLines(File input) throws IOException {
    try (InputStream is = new FileInputStream(input)) {
        int count = 1;
        for (int aChar = 0; aChar != -1;aChar = is.read())
            count += aChar == '\n' ? 1 : 0;
        return count;
    }
}

Unixベースのシステムでは、コマンドラインでwcコマンドを使用します。

ファイルに何行あるかを知る唯一の方法は、それらを数えることです。もちろん、データからメトリックを作成して1行の平均長を求め、ファイルサイズを取得してそれをavgで除算できます。長さですが、それは正確ではありません。

EOFに改行文字（ '\ n'）がない複数行ファイルに最適なコード。

/**
 * 
 * @param filename
 * @return
 * @throws IOException
 */
public static int countLines(String filename) throws IOException {
    int count = 0;
    boolean empty = true;
    FileInputStream fis = null;
    InputStream is = null;
    try {
        fis = new FileInputStream(filename);
        is = new BufferedInputStream(fis);
        byte[] c = new byte[1024];
        int readChars = 0;
        boolean isLine = false;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if ( c[i] == '\n' ) {
                    isLine = false;
                    ++count;
                }else if(!isLine && c[i] != '\n' && c[i] != '\r'){   //Case to handle line count where no New Line character present at EOF
                    isLine = true;
                }
            }
        }
        if(isLine){
            ++count;
        }
    }catch(IOException e){
        e.printStackTrace();
    }finally {
        if(is != null){
            is.close();    
        }
        if(fis != null){
            fis.close();    
        }
    }
    LOG.info("count: "+count);
    return (count == 0 && !empty) ? 1 : count;
}

正規表現を使用したスキャナー：

public int getLineCount() {
    Scanner fileScanner = null;
    int lineCount = 0;
    Pattern lineEndPattern = Pattern.compile("(?m)$");  
    try {
        fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
        while (fileScanner.hasNext()) {
            fileScanner.next();
            ++lineCount;
        }   
    }catch(FileNotFoundException e) {
        e.printStackTrace();
        return lineCount;
    }
    fileScanner.close();
    return lineCount;
}

時間を計りませんでした。

これを使用する場合

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
    int cnt = 0;
    String lineRead = "";
    while ((lineRead = reader.readLine()) != null) {}

    cnt = reader.getLineNumber(); 
    reader.close();
    return cnt;
}

reader.getLineNumberからの戻り値はintであるため、10万行など、大きなnum行まで実行できません。最大行を処理するには長いタイプのデータが必要です。

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow