recherche binaire dans un fichier classé (? mappé en mémoire) en Java

https://stackoverflow.com/questions/736556

09-09-2019
|

Question

Je me bats pour le port d'un programme Perl à Java, et l'apprentissage de Java que je vais. Un élément central du programme d'origine est un module Perl qui fait prefix lookups chaîne dans un fichier texte triée +500 Go en utilisant la recherche binaire (Essentiellement, « chercher » à un décalage d'octet au milieu du fichier, machine arrière à nouvelle ligne le plus proche, comparer le préfixe de ligne avec la chaîne de recherche, « chercher » à moitié / double octet offset, répéter jusqu'à ce que sa ...)

Je l'ai expérimenté plusieurs solutions de bases de données, mais a constaté que rien ne vaut cette vitesse de recherche pure avec des ensembles de données de cette taille. Connaissez-vous une bibliothèque Java existante qui implémente cette fonctionnalité? A défaut, pourriez-vous me indiquer quelques exemples de code idiomatiques qui ne lit un accès aléatoire dans des fichiers texte?

Sinon, je ne suis pas au courant des nouvelles (?) Des bibliothèques Java E / S, mais serait-il une possibilité de mémoire mapper le fichier texte 500 Go (je suis sur une machine 64 bits avec mémoire pour épargner) et faire la recherche binaire sur le tableau d'octets mémoire mappée? Je serais très intéressé d'entendre toutes les expériences que vous avez à partager à ce sujet et des problèmes similaires.

La solution

Je suis grand fan de Java MappedByteBuffers pour des situations comme celle-ci. Il est ultra-rapide. Voici un extrait que je mets ensemble pour vous qui mappe un tampon au fichier, cherche au milieu, et recherche ensuite en arrière à un retour à la ligne. Cela devrait être suffisant pour vous aller?

J'ai un code similaire (chercher, lire, répéter jusqu'à ce que fait) dans ma propre application, benchmarkée java.io cours d'eau contre MappedByteBuffer dans un environnement de production et a affiché les résultats sur mon blog ( messages Geekomatic étiquette « » java.nio ) avec des données brutes, graphiques et tous.

Deux secondes résumé? Ma mise en œuvre basée MappedByteBuffer-est environ 275% plus rapide. YMMV.

Pour des fichiers de plus de ~ 2 Go, ce qui est un problème en raison de la fonte et .position(int pos), j'ai conçu algorithme d'échange soutenu par un tableau de MappedByteBuffers. Vous aurez besoin de travailler sur un système 64 bits pour que cela fonctionne avec des fichiers plus grand que 2-4GB parce que l'utilisation de MBB système de mémoire virtuelle du système d'exploitation pour travailler leur magie.

public class StusMagicLargeFileReader  {
    private static final long PAGE_SIZE = Integer.MAX_VALUE;
    private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
    private final byte raw[] = new byte[1];

    public static void main(String[] args) throws IOException {
        File file = new File("/Users/stu/test.txt");
        FileChannel fc = (new FileInputStream(file)).getChannel(); 
        StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
        long position = file.length() / 2;
        String candidate = buffer.getString(position--);
        while (position >=0 && !candidate.equals('\n')) 
            candidate = buffer.getString(position--);
        //have newline position or start of file...do other stuff    
    }
    StusMagicLargeFileReader(FileChannel channel) throws IOException {
        long start = 0, length = 0;
        for (long index = 0; start + length < channel.size(); index++) {
            if ((channel.size() / PAGE_SIZE) == index)
                length = (channel.size() - index *  PAGE_SIZE) ;
            else
                length = PAGE_SIZE;
            start = index * PAGE_SIZE;
            buffers.add(index, channel.map(READ_ONLY, start, length));
        }    
    }
    public String getString(long bytePosition) {
        int page  = (int) (bytePosition / PAGE_SIZE);
        int index = (int) (bytePosition % PAGE_SIZE);
        raw[0] = buffers.get(page).get(index);
        return new String(raw);
    }
}

Autres conseils

J'ai le même problème. Je suis en train de trouver toutes les lignes qui commencent par un certain préfixe dans un fichier triée.

Voici une méthode que je cuisinais jusqu'à ce qui est en grande partie un port de code Python trouvé ici: http : //www.logarithmic.net/pfh/blog/01186620415

Je l'ai testé, mais pas à fond tout de suite. Il n'utilise la cartographie de la mémoire, cependant.

public static List<String> binarySearch(String filename, String string) {
    List<String> result = new ArrayList<String>();
    try {
        File file = new File(filename);
        RandomAccessFile raf = new RandomAccessFile(file, "r");

        long low = 0;
        long high = file.length();

        long p = -1;
        while (low < high) {
            long mid = (low + high) / 2;
            p = mid;
            while (p >= 0) {
                raf.seek(p);

                char c = (char) raf.readByte();
                //System.out.println(p + "\t" + c);
                if (c == '\n')
                    break;
                p--;
            }
            if (p < 0)
                raf.seek(0);
            String line = raf.readLine();
            //System.out.println("-- " + mid + " " + line);
            if (line.compareTo(string) < 0)
                low = mid + 1;
            else
                high = mid;
        }

        p = low;
        while (p >= 0) {
            raf.seek(p);
            if (((char) raf.readByte()) == '\n')
                break;
            p--;
        }

        if (p < 0)
            raf.seek(0);

        while (true) {
            String line = raf.readLine();
            if (line == null || !line.startsWith(string))
                break;
            result.add(line);
        }

        raf.close();
    } catch (IOException e) {
        System.out.println("IOException:");
        e.printStackTrace();
    }
    return result;
}

Je ne suis pas au courant d'une bibliothèque qui a cette fonctionnalité. Cependant, devrait être similaire à un code correct pour une recherche binaire externe en Java ceci:

class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
    this.file = new RandomAccessFile(f, "r");
    this.test = test;
}
public String search(String element) throws IOException {
    long l = file.length();
    return search(element, -1, l-1);
}
/**
 * Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
 * In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
 */
private String search(String element, long low, long high) throws IOException {
    if(high - low < 1024) {
        // search directly
        long p = low;
        while(p < high) {
            String line = nextLine(p);
            int r = test.compare(line,element);
            if(r > 0) {
                return null;
            } else if (r < 0) {
                p += line.length();
            } else {
                return line;
            }
        }
        return null;
    } else {
        long m  = low + ((high - low) / 2);
        String line = nextLine(m);
        int r = test.compare(line, element);
        if(r > 0) {
            return search(element, low, m);
        } else if (r < 0) {
            return search(element, m, high);
        } else {
            return line;
        }
    }
}
private String nextLine(long low) throws IOException {
    if(low == -1) { // Beginning of file
        file.seek(0);           
    } else {
        file.seek(low);
    }
    int bufferLength = 65 * 1024;
    byte[] buffer = new byte[bufferLength];
    int r = file.read(buffer);
    int lineBeginIndex = -1;

    // search beginning of line
    if(low == -1) { //beginning of file
        lineBeginIndex = 0;
    } else {
        //normal mode
        for(int i = 0; i < 1024; i++) {
        if(buffer[i] == '\n') {
            lineBeginIndex = i + 1;
            break;
        }
        }
    }
    if(lineBeginIndex == -1) {
        // no line begins within next 1024 bytes
        return null;
    }
    int start = lineBeginIndex;
        for(int i = start; i < r; i++) {
            if(buffer[i] == '\n') {
                // Found end of line
                return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
                return line.toString();
            }
        }
        throw new IllegalArgumentException("Line to long");
}
}

S'il vous plaît noter: je fait cette ad hoc code:. Cas d'angle ne sont pas testés presque assez bon, le code suppose que pas de ligne unique est supérieur à 64 Ko, etc

Je pense aussi que la construction d'un indice des compensations où les lignes commencent peut-être une bonne idée. Pour un fichier de 500 Go, cet indice doit être stocké dans un fichier d'index. Vous devez gagner un facteur constant pas si petite avec cet indice parce qu'il n'y a pas besoin de chercher la ligne suivante à chaque étape.

Je sais que ce n'était pas la question, mais la construction d'une structure de données d'arbre préfixe comme (Patrica) tente (sur le disque / SSD) pourrait être une bonne idée de faire la recherche de préfixe.

Ceci est un exemple simple de ce que vous voulez atteindre. Je voudrais probablement d'abord indexer le fichier, en gardant trace de la position de fichier pour chaque chaîne. Je suppose que les chaînes sont séparées par des sauts de lignes (ou retour chariot):

    RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
    List<Long> indexList = new ArrayList();
    long pos = 0;
    while (file.readLine() != null)
    {
        Long linePos = new Long(pos);
        indexList.add(linePos);
        pos = file.getFilePointer();
    }
    int indexSize = indexList.size();
    Long[] indexArray = new Long[indexSize];
    indexList.toArray(indexArray);

La dernière étape consiste à convertir en un tableau pour une légère amélioration de la vitesse lorsque vous faites beaucoup de recherches. Je serais probablement convertir le Long[] à un long[] aussi, mais je ne l'ai pas montrer que ci-dessus. Enfin, le code pour lire la chaîne à partir d'une position indexée donnée:

    int i; // Initialize this appropriately for your algorithm.
    file.seek(indexArray[i]);
    String line = file.readLine();
            // At this point, line contains the string #i.

Si vous avez affaire à un fichier de 500 Go, vous pouvez utiliser une méthode de recherche plus rapide que la recherche binaire - à savoir une sorte radix qui est essentiellement une variante de hachage. La meilleure méthode pour faire cela dépend vraiment de vos données et distributions types de recherche, mais si vous êtes à la recherche de chaîne préfixes il devrait y avoir une bonne façon de le faire.

J'ai posté un exemple d'une solution de tri radix pour les entiers, mais vous pouvez utiliser la même idée - essentiellement à réduire le temps de tri en divisant les données dans des seaux, puis en utilisant O (1) recherche pour récupérer le seau de données qui est pertinent.

Option Strict On
Option Explicit On

Module Module1

Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0

Private Sub generateData()
    ' fill with random numbers between 0 and MAX_SIZE - 1
    For i = 0 To MAX_SIZE - 1
        m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
    Next

End Sub

Private Sub sortData()
    For i As Integer = 0 To MAX_SIZE - 1
        Dim x = m_input(i)
        If m_table(x) Is Nothing Then
            m_table(x) = New List(Of Integer)
        End If
        m_table(x).Add(x)
        ' clearly this is simply going to be MAX_SIZE -1
        m_operations = m_operations + 1
    Next
End Sub

 Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
    If start < 0 Or start > MAX_SIZE - 1 Then
        Throw New Exception("printData - start out of range")
    End If
    If finish < 0 Or finish > MAX_SIZE - 1 Then
        Throw New Exception("printData - finish out of range")
    End If
    For i As Integer = start To finish
        If m_table(i) IsNot Nothing Then
            For Each x In m_table(i)
                Console.WriteLine(x)
            Next
        End If
    Next
End Sub

' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
    m_operations = 0
    generateData()
    Console.WriteLine("Time started = " & Now.ToString())
    sortData()
    Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
    ' print out a random 100 segment from the sorted array
    Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
    printData(start, start + 100)
End Sub

Sub Main()
    test()
    Console.ReadLine()
End Sub

End Module

Je posterai un point essentiel https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c

qui est plutôt exemple complet basé sur ce que j'ai trouvé sur un débordement de pile et certains blogs, espérons que quelqu'un d'autre peut l'utiliser

import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;

import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;

public class FileUtils {

    private FileUtils() {
    }

    private static boolean found(final String candidate, final String prefix) {
        return isBlank(candidate) || candidate.startsWith(prefix);
    }

    private static boolean before(final String candidate, final String prefix) {
        return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
    }

    public static MappedByteBuffer getMappedByteBuffer(final Path path) {
        FileChannel fileChannel = null;
        try {
            fileChannel = FileChannel.open(path, READ);
            return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        }
        finally {
            closeQuietly(fileChannel);
        }
    }

    public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
        if (buffer == null) {
            return null;
        }
        try {
            long low = 0;
            long high = buffer.limit();
            while (low < high) {
                int mid = (int) ((low + high) / 2);
                final String candidate = getLine(mid, buffer);
                if (found(candidate, prefix)) {
                    return trimToNull(candidate);
                } 
                else if (before(candidate, prefix)) {
                    high = mid;
                } 
                else {
                    low = mid + 1;
                }
            }
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        } 
        return null;
    }

    private static String getLine(int position, final MappedByteBuffer buffer) {
        // search backwards to the find the proceeding new line
        // then search forwards again until the next new line
        // return the string in between
        final StringBuilder stringBuilder = new StringBuilder();
        // walk it back
        char candidate = (char)buffer.get(position);
        while (position > 0 && candidate != '\n') {
            candidate = (char)buffer.get(--position);
        }
        // we either are at the beginning of the file or a new line
        if (position == 0) {
            // we are at the beginning at the first char
            candidate = (char)buffer.get(position);
            stringBuilder.append(candidate);
        }
        // there is/are char(s) after new line / first char
        if (isInBuffer(buffer, position)) {
            //first char after new line
            candidate = (char)buffer.get(++position);
            stringBuilder.append(candidate);
            //walk it forward
            while (isInBuffer(buffer, position) && candidate != ('\n')) {
                candidate = (char)buffer.get(++position);
                stringBuilder.append(candidate);
            }
        }
        return stringBuilder.toString();
    }

    private static boolean isInBuffer(final Buffer buffer, int position) {
        return position + 1 < buffer.limit();
    }

    public static File getOrCreateDirectory(final String dirName) { 
        final File directory = new File(dirName);
        try {
            forceMkdir(directory);
            isWritable(directory.toPath());
        } 
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        return directory;
    }
}

J'ai eu le même problème, donc je créé (Scala) bibliothèque de solutions fournies dans ce fil:

https://github.com/avast/BigMap

Il contient l'utilitaire pour le fichier énorme tri et recherche binaire dans ce fichier ... Sorted

Si vous voulez vraiment essayer la mémoire mapper le fichier, j'ai trouvé un tutoriel sur la façon d'utiliser mappage de mémoire en Java nio.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow