Efficient Searching Algorithm for Capture File

https://stackoverflow.com/questions/11108092

15-06-2021
|

题

I am currently developing a tool in java that will help track and interpret data being sent across an ethernet connection. I have already successfully developed both the packet sniffer and the packet data interpreter.

I run into a problem when trying to navigate to specific packets within the trace file. Each packet has an associated time stamp, and I would like to be able to navigate to a specific time window. My current method to do this is below.

public ArrayList<Packet> getTimeWindow(double time, int window) {
    ArrayList<Packet> packets = new ArrayList<Packet>();
    double start = time - window;
    double end = time + window;

    JpcapCaptor captor = null;
    try {
        captor = JpcapCaptor.openFile(this.traceFile); 
    } catch (IOException e) {e.printStackTrace();}

    Packet p = captor.getPacket();
    while(packet != null) {
        if(f.timestamp > end) return packets;
        if(p.timestamp >= start) packets.add(p);    
        packet=captor.getPacket();
    }
    return packets;
}

This works fine for small traces, but can get pretty slow when we're dealing with millions of packets. I would like to implement some form of binary search algorithm, but I can't figure out a way to navigate to the middle of the packets without preprocessing them. The packets are not neatly organized by line, and even if I jump to a random point in the file, I can't guarantee I'm at the start of a packet.

In summary: I am looking to develop an efficient way to search for a specific packet in a capture (.pcap or .cap) file. I've scoured the net, and I haven't been able to find anything that can do quite what I'm asking.

If anyone has any ideas / solutions you could suggest, it would be greatly appreciated.

Thanks!

解决方案

An easy, small solution is to build a simple index for the files in question. For instance, you can record the offset in the file of the start of every 1000th packet. Store this information (just a sequence of 64-bit indexes into the original trace file) in a small index file. Then when you're doing binary search, you can use this index, together with the original file, to find (within 1000 packets) the correct point to start reading.

Of course, this requires preprocessing (or processing while generating) the trace files.

其他提示

This is just a guess but maybe a Interval Tree or a Segment Tree would be a good choice. Assuming you can fit all the packets into memory. Interval Trees are fairly easy to create if you follow the Cormen et al algorithm. Segment Tree can be more expensive in terms of memory but should give you faster stabbing queries.

If the packets won't fit into memory. You could use the capture file timestamps as the most broad interval and drill down into each file if someone navigates to that interval.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow