Java의 정렬 된 (메모리 매핑?) 파일의 이진 검색

https://stackoverflow.com/questions/736556

09-09-2019
|

문제

나는 Perl 프로그램을 Java에 포트하고 내가 갈 때 Java를 배우는 데 어려움을 겪고 있습니다. 원래 프로그램의 중심 구성 요소는 a Perl 모듈 이진 검색을 사용하여 +500GB 정렬 된 텍스트 파일의 문자열 접두사 조회 (파일 중간에있는 바이트 오프셋으로 "Seek", 가장 가까운 Newline으로의 역 추적, 검색 문자열과 "Seek"와 비교하는 것과 비교합니다. 바이트 오프셋 반/두 배, 발견 될 때까지 반복 ...)

여러 데이터베이스 솔루션을 실험했지만이 크기의 데이터 세트와 함께이 조회 속도로이를 능가하는 것은 없습니다. 그러한 기능을 구현하는 기존 Java 라이브러리를 알고 있습니까? 실패하면 텍스트 파일에서 임의의 액세스를 읽는 관용 코드를 지적 할 수 있습니까?

또는 새로운 (?) Java I/O 라이브러리에 익숙하지 않지만 500GB 텍스트 파일 (메모리가있는 64 비트 시스템에 있음)을 메모리 매핑하고 바이너리를 수행하는 옵션 일 것입니다. 메모리 매핑 된 바이트 배열에서 검색 하시겠습니까? 나는 당신이 이것과 비슷한 문제에 대해 공유 해야하는 경험을 듣고 싶어합니다.

해결책

나는 큰 Java의 팬 MappedByteBuffers 이런 상황에 대해. 빨리 타 오르고 있습니다. 아래는 버퍼를 파일에 맵핑하고 중간을 찾은 다음 Newline 문자로 뒤로 검색하는 스 니펫입니다. 이것은 당신을 가기에 충분해야합니까?

내 신청서에서 비슷한 코드 (찾기, 읽기, 반복)가 있습니다.java.io 스트림에 대한 MappedByteBuffer 제작 환경에서 결과를 내 블로그에 게시했습니다 (geekomatic 게시물 태그 'java.nio' ) 원시 데이터, 그래프 및 모두.

두 번째 요약? 나의 MappedByteBuffer-기반 구현은 약 275% 더 빠릅니다. ymmv.

~ 2GB보다 큰 파일에서 작동하려면 캐스트와 .position(int pos), 나는 배열로 뒷받침하는 페이징 알고리즘을 제작했습니다. MappedByteBuffer에스. MBB가 OS의 가상 메모리 시스템을 사용하여 마술을 작동시키기 때문에 2-4GB보다 큰 파일로 작동하려면 64 비트 시스템을 사용해야합니다.

public class StusMagicLargeFileReader  {
    private static final long PAGE_SIZE = Integer.MAX_VALUE;
    private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
    private final byte raw[] = new byte[1];

    public static void main(String[] args) throws IOException {
        File file = new File("/Users/stu/test.txt");
        FileChannel fc = (new FileInputStream(file)).getChannel(); 
        StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
        long position = file.length() / 2;
        String candidate = buffer.getString(position--);
        while (position >=0 && !candidate.equals('\n')) 
            candidate = buffer.getString(position--);
        //have newline position or start of file...do other stuff    
    }
    StusMagicLargeFileReader(FileChannel channel) throws IOException {
        long start = 0, length = 0;
        for (long index = 0; start + length < channel.size(); index++) {
            if ((channel.size() / PAGE_SIZE) == index)
                length = (channel.size() - index *  PAGE_SIZE) ;
            else
                length = PAGE_SIZE;
            start = index * PAGE_SIZE;
            buffers.add(index, channel.map(READ_ONLY, start, length));
        }    
    }
    public String getString(long bytePosition) {
        int page  = (int) (bytePosition / PAGE_SIZE);
        int index = (int) (bytePosition % PAGE_SIZE);
        raw[0] = buffers.get(page).get(index);
        return new String(raw);
    }
}

다른 팁

나도 같은 문제를 안고있어. 정렬 된 파일에서 일부 접두사로 시작하는 모든 줄을 찾으려고합니다.

다음은 여기에서 발견 된 Python 코드의 포트 인 내가 요리 한 방법입니다. http://www.logarithmic.net/pfh/blog/01186620415

나는 그것을 테스트했지만 아직 철저한 것은 아닙니다. 그러나 메모리 매핑을 사용하지 않습니다.

public static List<String> binarySearch(String filename, String string) {
    List<String> result = new ArrayList<String>();
    try {
        File file = new File(filename);
        RandomAccessFile raf = new RandomAccessFile(file, "r");

        long low = 0;
        long high = file.length();

        long p = -1;
        while (low < high) {
            long mid = (low + high) / 2;
            p = mid;
            while (p >= 0) {
                raf.seek(p);

                char c = (char) raf.readByte();
                //System.out.println(p + "\t" + c);
                if (c == '\n')
                    break;
                p--;
            }
            if (p < 0)
                raf.seek(0);
            String line = raf.readLine();
            //System.out.println("-- " + mid + " " + line);
            if (line.compareTo(string) < 0)
                low = mid + 1;
            else
                high = mid;
        }

        p = low;
        while (p >= 0) {
            raf.seek(p);
            if (((char) raf.readByte()) == '\n')
                break;
            p--;
        }

        if (p < 0)
            raf.seek(0);

        while (true) {
            String line = raf.readLine();
            if (line == null || !line.startsWith(string))
                break;
            result.add(line);
        }

        raf.close();
    } catch (IOException e) {
        System.out.println("IOException:");
        e.printStackTrace();
    }
    return result;
}

해당 기능이있는 라이브러리를 모릅니다. 그러나 Java에서 외부 바이너리 검색을위한 올바른 코드는 다음과 비슷해야합니다.

class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
    this.file = new RandomAccessFile(f, "r");
    this.test = test;
}
public String search(String element) throws IOException {
    long l = file.length();
    return search(element, -1, l-1);
}
/**
 * Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
 * In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
 */
private String search(String element, long low, long high) throws IOException {
    if(high - low < 1024) {
        // search directly
        long p = low;
        while(p < high) {
            String line = nextLine(p);
            int r = test.compare(line,element);
            if(r > 0) {
                return null;
            } else if (r < 0) {
                p += line.length();
            } else {
                return line;
            }
        }
        return null;
    } else {
        long m  = low + ((high - low) / 2);
        String line = nextLine(m);
        int r = test.compare(line, element);
        if(r > 0) {
            return search(element, low, m);
        } else if (r < 0) {
            return search(element, m, high);
        } else {
            return line;
        }
    }
}
private String nextLine(long low) throws IOException {
    if(low == -1) { // Beginning of file
        file.seek(0);           
    } else {
        file.seek(low);
    }
    int bufferLength = 65 * 1024;
    byte[] buffer = new byte[bufferLength];
    int r = file.read(buffer);
    int lineBeginIndex = -1;

    // search beginning of line
    if(low == -1) { //beginning of file
        lineBeginIndex = 0;
    } else {
        //normal mode
        for(int i = 0; i < 1024; i++) {
        if(buffer[i] == '\n') {
            lineBeginIndex = i + 1;
            break;
        }
        }
    }
    if(lineBeginIndex == -1) {
        // no line begins within next 1024 bytes
        return null;
    }
    int start = lineBeginIndex;
        for(int i = start; i < r; i++) {
            if(buffer[i] == '\n') {
                // Found end of line
                return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
                return line.toString();
            }
        }
        throw new IllegalArgumentException("Line to long");
}
}

참고 : 나는이 코드를 임시로 만들었습니다. 코너 케이스는 거의 잘 테스트되지 않았으며, 코드는 단일 줄이 64K보다 큰 것으로 가정합니다.

또한 라인이 시작되는 오프셋의 색인을 구축하는 것이 좋은 생각 일 수 있다고 생각합니다. 500GB 파일의 경우 해당 인덱스는 인덱스 파일에 저장해야합니다. 각 단계에서 다음 줄을 검색 할 필요가 없기 때문에 해당 인덱스와 약간 약간의 상수 요인을 얻어야합니다.

나는 그것이 의문이 아니라는 것을 알고 있지만 (Patrica) Tries (디스크/SSD)와 같은 접두사 트리 데이터 구조를 구축하는 것은 접두사 검색을 수행하는 것이 좋습니다.

이것은 당신이 달성하고자하는 것의 간단한 예입니다. 아마도 먼저 파일을 색인화하여 각 문자열의 파일 위치를 추적 할 것입니다. 나는 줄이 Newlines (또는 운송 반환)로 분리된다고 가정합니다.

    RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
    List<Long> indexList = new ArrayList();
    long pos = 0;
    while (file.readLine() != null)
    {
        Long linePos = new Long(pos);
        indexList.add(linePos);
        pos = file.getFilePointer();
    }
    int indexSize = indexList.size();
    Long[] indexArray = new Long[indexSize];
    indexList.toArray(indexArray);

마지막 단계는 많은 조회를 할 때 약간의 속도 개선을 위해 배열로 변환하는 것입니다. 나는 아마도 그것을 변환 할 것이다 Long[] a long[] 또한, 나는 위의 것을 보여주지 않았다. 마지막으로 주어진 색인 위치에서 문자열을 읽는 코드 :

    int i; // Initialize this appropriately for your algorithm.
    file.seek(indexArray[i]);
    String line = file.readLine();
            // At this point, line contains the string #i.

500GB 파일을 다루는 경우 이진 검색보다 더 빠른 조회 방법, 즉 Radix 정렬을 사용하는 것이 본질적으로 해싱의 변형입니다. 이 작업을 수행하는 가장 좋은 방법은 실제로 데이터 배포와 조회 유형에 따라 다르지만 문자열 접두사를 찾고 있다면이 작업을 수행하는 좋은 방법이 있어야합니다.

정수에 대한 Radix 정렬 솔루션의 예를 게시했지만 동일한 아이디어를 사용할 수 있습니다. 기본적으로 데이터를 버킷으로 나누어서 정렬 시간을 줄인 다음 O (1) 조회를 사용하여 관련 데이터 버킷을 검색합니다. .

Option Strict On
Option Explicit On

Module Module1

Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0

Private Sub generateData()
    ' fill with random numbers between 0 and MAX_SIZE - 1
    For i = 0 To MAX_SIZE - 1
        m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
    Next

End Sub

Private Sub sortData()
    For i As Integer = 0 To MAX_SIZE - 1
        Dim x = m_input(i)
        If m_table(x) Is Nothing Then
            m_table(x) = New List(Of Integer)
        End If
        m_table(x).Add(x)
        ' clearly this is simply going to be MAX_SIZE -1
        m_operations = m_operations + 1
    Next
End Sub

 Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
    If start < 0 Or start > MAX_SIZE - 1 Then
        Throw New Exception("printData - start out of range")
    End If
    If finish < 0 Or finish > MAX_SIZE - 1 Then
        Throw New Exception("printData - finish out of range")
    End If
    For i As Integer = start To finish
        If m_table(i) IsNot Nothing Then
            For Each x In m_table(i)
                Console.WriteLine(x)
            Next
        End If
    Next
End Sub

' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
    m_operations = 0
    generateData()
    Console.WriteLine("Time started = " & Now.ToString())
    sortData()
    Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
    ' print out a random 100 segment from the sorted array
    Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
    printData(start, start + 100)
End Sub

Sub Main()
    test()
    Console.ReadLine()
End Sub

End Module

나는 요점을 게시한다 https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c

그것은 내가 스택 오버플로에서 찾은 내용을 기반으로 한 완전한 예이며, 일부 블로그는 다른 사람이 그것을 사용할 수 있기를 바랍니다.

import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;

import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;

public class FileUtils {

    private FileUtils() {
    }

    private static boolean found(final String candidate, final String prefix) {
        return isBlank(candidate) || candidate.startsWith(prefix);
    }

    private static boolean before(final String candidate, final String prefix) {
        return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
    }

    public static MappedByteBuffer getMappedByteBuffer(final Path path) {
        FileChannel fileChannel = null;
        try {
            fileChannel = FileChannel.open(path, READ);
            return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        }
        finally {
            closeQuietly(fileChannel);
        }
    }

    public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
        if (buffer == null) {
            return null;
        }
        try {
            long low = 0;
            long high = buffer.limit();
            while (low < high) {
                int mid = (int) ((low + high) / 2);
                final String candidate = getLine(mid, buffer);
                if (found(candidate, prefix)) {
                    return trimToNull(candidate);
                } 
                else if (before(candidate, prefix)) {
                    high = mid;
                } 
                else {
                    low = mid + 1;
                }
            }
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        } 
        return null;
    }

    private static String getLine(int position, final MappedByteBuffer buffer) {
        // search backwards to the find the proceeding new line
        // then search forwards again until the next new line
        // return the string in between
        final StringBuilder stringBuilder = new StringBuilder();
        // walk it back
        char candidate = (char)buffer.get(position);
        while (position > 0 && candidate != '\n') {
            candidate = (char)buffer.get(--position);
        }
        // we either are at the beginning of the file or a new line
        if (position == 0) {
            // we are at the beginning at the first char
            candidate = (char)buffer.get(position);
            stringBuilder.append(candidate);
        }
        // there is/are char(s) after new line / first char
        if (isInBuffer(buffer, position)) {
            //first char after new line
            candidate = (char)buffer.get(++position);
            stringBuilder.append(candidate);
            //walk it forward
            while (isInBuffer(buffer, position) && candidate != ('\n')) {
                candidate = (char)buffer.get(++position);
                stringBuilder.append(candidate);
            }
        }
        return stringBuilder.toString();
    }

    private static boolean isInBuffer(final Buffer buffer, int position) {
        return position + 1 < buffer.limit();
    }

    public static File getOrCreateDirectory(final String dirName) { 
        final File directory = new File(dirName);
        try {
            forceMkdir(directory);
            isWritable(directory.toPath());
        } 
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        return directory;
    }
}

비슷한 문제가 있었 으므로이 스레드에서 제공된 솔루션에서 (SCALA) 라이브러리를 만들었습니다.

https://github.com/avast/bigmap

이 분류 된 파일에서 거대한 파일을 정렬하고 이진 검색을위한 유틸리티가 포함되어 있습니다 ...

파일을 진정으로 메모리 매핑하려면 메모리 매핑 사용 방법에 대한 튜토리얼 Java Nio에서.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow