Java의 파일의 줄 수

https://stackoverflow.com/questions/453018

19-08-2019
|

문제

나는 거대한 데이터 파일을 사용하고 때로는이 파일의 줄 수를 알아야합니다. 일반적으로 파일의 끝에 도달 할 때까지 열어서 한 줄씩 읽습니다.

그렇게 할 수있는 더 똑똑한 방법이 있는지 궁금합니다.

해결책

이것은 내가 지금까지 찾은 가장 빠른 버전이며, 읽기 라인보다 약 6 배 빠릅니다. 150MB 로그 파일에서는 readLines ()를 사용할 때 0.35 초, 2.40 초에 비해 0.35 초가 소요됩니다. 재미를 위해 Linux 'WC -L 명령은 0.15 초가 걸립니다.

public static int countLinesOld(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean empty = true;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
        }
        return (count == 0 && !empty) ? 1 : count;
    } finally {
        is.close();
    }
}

편집, 9 1/2 년 후 : 나는 실제로 Java 경험이 없지만 어쨌든 나는이 코드를 LineNumberReader 솔루션 아래의 해결책은 아무도 그것을하지 않았다는 것을 귀찮게했기 때문에. 특히 큰 파일의 경우 내 솔루션이 더 빠릅니다. 최적화가 괜찮은 작업을 수행 할 때까지 몇 번의 실행이 필요한 것 같습니다. 나는 코드를 조금 플레이했으며 일관되게 가장 빠른 새 버전을 생성했습니다.

public static int countLinesNew(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];

        int readChars = is.read(c);
        if (readChars == -1) {
            // bail out if nothing to read
            return 0;
        }

        // make it easy for the optimizer to tune this loop
        int count = 0;
        while (readChars == 1024) {
            for (int i=0; i<1024;) {
                if (c[i++] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        // count remaining characters
        while (readChars != -1) {
            System.out.println(readChars);
            for (int i=0; i<readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        return count == 0 ? 1 : count;
    } finally {
        is.close();
    }
}

1.3GB 텍스트 파일, y 축에 대한 벤치 마크 선출 초. 나는 같은 파일로 100 개의 실행을 수행했으며 각 실행을 측정했습니다. System.nanoTime(). 당신은 그것을 볼 수 있습니다 countLinesOld 몇 가지 이상치가 있습니다 countLinesNew 아무것도없고 조금 더 빠르지 만 차이는 통계적으로 유의합니다. LineNumberReader 분명히 느립니다.

다른 팁

문제에 대한 또 다른 솔루션을 구현했으며 행 계산에 더 효율적으로 발견되었습니다.

try
(
   FileReader       input = new FileReader("input.txt");
   LineNumberReader count = new LineNumberReader(input);
)
{
   while (count.skip(Long.MAX_VALUE) > 0)
   {
      // Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
   }

   result = count.getLineNumber() + 1;                                    // +1 because line index starts at 0
}

허용 된 답변에는 Newline에서 끝나지 않는 멀티 라인 파일의 경우 하나의 오류가 있습니다. Newline없이 끝나는 한 줄 파일은 1을 반환하지만 Newline이없는 두 줄 파일은 1을 반환합니다. 다음은이를 수정하는 허용 솔루션의 구현입니다. 끝을 확인한 Newline 검사는 최종 읽기를 제외한 모든 것에 대해 낭비되지만 전체 기능에 비해 사소한 시간이어야합니다.

public int count(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean endsWithoutNewLine = false;
        while ((readChars = is.read(c)) != -1) {
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n')
                    ++count;
            }
            endsWithoutNewLine = (c[readChars - 1] != '\n');
        }
        if(endsWithoutNewLine) {
            ++count;
        } 
        return count;
    } finally {
        is.close();
    }
}

와 함께 Java-8, 스트림을 사용할 수 있습니다.

try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
  long numOfLines = lines.count();
  ...
}

위의 메소드 카운트 ()에 대한 답은 파일에 파일 끝에 파일이 새로운 라인이없는 경우 라인 잘못된 계산을 주었다. 파일의 마지막 줄을 계산하지 못했습니다.

이 방법은 나에게 더 잘 작동합니다.

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}

cnt = reader.getLineNumber(); 
reader.close();
return cnt;
}

나는 이것이 오래된 질문이라는 것을 알고 있지만, 받아 들여진 솔루션은 내가해야 할 일에 맞지 않았다. 그래서, 나는 단지 라인 피드가 아닌 다양한 라인 터미네이터를 받아들이고 지정된 문자 인코딩을 사용하여 ISO-8859- 대신-N). 하나의 메소드 (적절한 refactor) :

public static long getLinesCount(String fileName, String encodingName) throws IOException {
    long linesCount = 0;
    File file = new File(fileName);
    FileInputStream fileIn = new FileInputStream(file);
    try {
        Charset encoding = Charset.forName(encodingName);
        Reader fileReader = new InputStreamReader(fileIn, encoding);
        int bufferSize = 4096;
        Reader reader = new BufferedReader(fileReader, bufferSize);
        char[] buffer = new char[bufferSize];
        int prevChar = -1;
        int readCount = reader.read(buffer);
        while (readCount != -1) {
            for (int i = 0; i < readCount; i++) {
                int nextChar = buffer[i];
                switch (nextChar) {
                    case '\r': {
                        // The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
                        linesCount++;
                        break;
                    }
                    case '\n': {
                        if (prevChar == '\r') {
                            // The current line is terminated by a carriage return immediately followed by a line feed.
                            // The line has already been counted.
                        } else {
                            // The current line is terminated by a line feed.
                            linesCount++;
                        }
                        break;
                    }
                }
                prevChar = nextChar;
            }
            readCount = reader.read(buffer);
        }
        if (prevCh != -1) {
            switch (prevCh) {
                case '\r':
                case '\n': {
                    // The last line is terminated by a line terminator.
                    // The last line has already been counted.
                    break;
                }
                default: {
                    // The last line is terminated by end-of-file.
                    linesCount++;
                }
            }
        }
    } finally {
        fileIn.close();
    }
    return linesCount;
}

이 솔루션은 허용 된 솔루션과 속도가 비슷하며 테스트에서 약 4% 느립니다 (Java의 타이밍 테스트는 신뢰할 수 없음).

선을 계산하는 위의 방법을 테스트했는데 여기 내 시스템에서 테스트 한 다른 방법에 대한 관찰이 있습니다.

파일 크기 : 1.6 GB 방법 :

스캐너 사용 : 35S 대략
BufferedReader 사용 : 5S 대략
Java 사용 8 : 5S 대략
LinenumberReader 사용 : 5S 대략

또한 Java8 접근 방식은 매우 편리한 것 같습니다 : files.lines (paths.get (filepath), charset.defaultcharset ()). count () [반환 유형 : long

/**
 * Count file rows.
 *
 * @param file file
 * @return file row count
 * @throws IOException
 */
public static long getLineCount(File file) throws IOException {

    try (Stream<String> lines = Files.lines(file.toPath())) {
        return lines.count();
    }
}

JDK8_U31에서 테스트. 그러나 실제로 성능은이 방법에 비해 느립니다.

/**
 * Count file rows.
 *
 * @param file file
 * @return file row count
 * @throws IOException
 */
public static long getLineCount(File file) throws IOException {

    try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {

        byte[] c = new byte[1024];
        boolean empty = true,
                lastEmpty = false;
        long count = 0;
        int read;
        while ((read = is.read(c)) != -1) {
            for (int i = 0; i < read; i++) {
                if (c[i] == '\n') {
                    count++;
                    lastEmpty = true;
                } else if (lastEmpty) {
                    lastEmpty = false;
                }
            }
            empty = false;
        }

        if (!empty) {
            if (count == 0) {
                count = 1;
            } else if (!lastEmpty) {
                count++;
            }
        }

        return count;
    }
}

테스트되고 매우 빠릅니다.

스캐너를 사용하는 간단한 방법

static void lineCounter (String path) throws IOException {

        int lineCount = 0, commentsCount = 0;

        Scanner input = new Scanner(new File(path));
        while (input.hasNextLine()) {
            String data = input.nextLine();

            if (data.startsWith("//")) commentsCount++;

            lineCount++;
        }

        System.out.println("Line Count: " + lineCount + "\t Comments Count: " + commentsCount);
    }

나는 그것을 결론 지었다 wc -l: S Newlines 계산 방법은 괜찮지 만 마지막 줄이 Newline으로 끝나지 않는 파일에서 직관적이지 않은 결과를 반환합니다.

그리고 LinenumberReader를 기반으로하지만 Line Count에 하나를 추가하면 마지막 줄이 Newline으로 끝나는 파일에서 직관적이지 않은 결과를 반환했습니다.

따라서 나는 다음과 같이 처리하는 알고를 만들었습니다.

@Test
public void empty() throws IOException {
    assertEquals(0, count(""));
}

@Test
public void singleNewline() throws IOException {
    assertEquals(1, count("\n"));
}

@Test
public void dataWithoutNewline() throws IOException {
    assertEquals(1, count("one"));
}

@Test
public void oneCompleteLine() throws IOException {
    assertEquals(1, count("one\n"));
}

@Test
public void twoCompleteLines() throws IOException {
    assertEquals(2, count("one\ntwo\n"));
}

@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
    assertEquals(2, count("one\ntwo"));
}

@Test
public void aFewLines() throws IOException {
    assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}

그리고 다음과 같이 보입니다.

static long countLines(InputStream is) throws IOException {
    try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
        char[] buf = new char[8192];
        int n, previousN = -1;
        //Read will return at least one byte, no need to buffer more
        while((n = lnr.read(buf)) != -1) {
            previousN = n;
        }
        int ln = lnr.getLineNumber();
        if (previousN == -1) {
            //No data read at all, i.e file was empty
            return 0;
        } else {
            char lastChar = buf[previousN - 1];
            if (lastChar == '\n' || lastChar == '\r') {
                //Ending with newline, deduct one
                return ln;
            }
        }
        //normal case, return line number + 1
        return ln + 1;
    }
}

직관적 인 결과를 원한다면이를 사용할 수 있습니다. 당신이 원한다면 wc -l 호환성, 간단한 사용 @er.vikas 솔루션이지만 결과에 하나를 추가하고 건너 뛰기를 다시 시도하지 마십시오.

try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
    while(lnr.skip(Long.MAX_VALUE) > 0){};
    return lnr.getLineNumber();
}

Java 코드 내에서 프로세스 클래스를 사용하는 것은 어떻습니까? 그런 다음 명령의 출력을 읽습니다.

Process p = Runtime.getRuntime().exec("wc -l " + yourfilename);
p.waitFor();

BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
int lineCount = 0;
while ((line = b.readLine()) != null) {
    System.out.println(line);
    lineCount = Integer.parseInt(line);
}

그래도 시도해야합니다. 결과를 게시합니다.

인덱스 구조가 없으면 전체 파일을 읽지 못할 것입니다. 그러나 라인별로 읽지 않고 모든 라인 터미네이터를 일치시키기 위해 Regex를 사용하여 최적화 할 수 있습니다.

이 재미있는 솔루션은 실제로 정말 잘 작동합니다!

public static int countLines(File input) throws IOException {
    try (InputStream is = new FileInputStream(input)) {
        int count = 1;
        for (int aChar = 0; aChar != -1;aChar = is.read())
            count += aChar == '\n' ? 1 : 0;
        return count;
    }
}

UNIX 기반 시스템에서는 사용하십시오 wc 명령 줄에 명령.

파일에 몇 줄이 있는지 알 수있는 유일한 방법은 그것들을 계산하는 것입니다. 물론 데이터에서 메트릭을 만들면 평균 길이의 한 줄을 제공 한 다음 파일 크기를 가져 와서 AVG로 나눌 수 있습니다. 길이는 정확하지 않습니다.

EOF에서 Newline ( ' n') 문자가없는 멀티 라인 파일에 대한 최적의 최적화 코드.

/**
 * 
 * @param filename
 * @return
 * @throws IOException
 */
public static int countLines(String filename) throws IOException {
    int count = 0;
    boolean empty = true;
    FileInputStream fis = null;
    InputStream is = null;
    try {
        fis = new FileInputStream(filename);
        is = new BufferedInputStream(fis);
        byte[] c = new byte[1024];
        int readChars = 0;
        boolean isLine = false;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if ( c[i] == '\n' ) {
                    isLine = false;
                    ++count;
                }else if(!isLine && c[i] != '\n' && c[i] != '\r'){   //Case to handle line count where no New Line character present at EOF
                    isLine = true;
                }
            }
        }
        if(isLine){
            ++count;
        }
    }catch(IOException e){
        e.printStackTrace();
    }finally {
        if(is != null){
            is.close();    
        }
        if(fis != null){
            fis.close();    
        }
    }
    LOG.info("count: "+count);
    return (count == 0 && !empty) ? 1 : count;
}

Regex가있는 스캐너 :

public int getLineCount() {
    Scanner fileScanner = null;
    int lineCount = 0;
    Pattern lineEndPattern = Pattern.compile("(?m)$");  
    try {
        fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
        while (fileScanner.hasNext()) {
            fileScanner.next();
            ++lineCount;
        }   
    }catch(FileNotFoundException e) {
        e.printStackTrace();
        return lineCount;
    }
    fileScanner.close();
    return lineCount;
}

그것을 시계하지 않았다.

이것을 사용하는 경우

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
    int cnt = 0;
    String lineRead = "";
    while ((lineRead = reader.readLine()) != null) {}

    cnt = reader.getLineNumber(); 
    reader.close();
    return cnt;
}

retud.getlineNumber에서 돌아 오기 때문에 100k 행을 좋아하는 큰 무리로 달릴 수 없습니다. 최대 행을 처리하려면 긴 유형의 데이터가 필요합니다 ..

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow