Question

I'm trying to read some data from a file to fill the following data structures:

typedef uint64_t t_feat; 

typedef pair<vector<t_feat>, int> t_featfam;

Each log file contains several such families, so I want to save them all in a vector. The logfiles that have a very simple formatting:

line = "-": start a new family

line = "#": family ends here

line = 64bit unsigned integer (as string): add this value to the family

line = "!": mark the following integer as important (exactly one is marked like this in each family), the marking is done by setting the the second value of the family to the index of the important element

There are no mistakes in the files, so every ! is followed by an integer, all families start and end properly and there are no additional spaces or anything (only exception is a possible empty line at the file end).

Right now I'm using the following code:

void read_data_from_file(const string &fname, vector<t_featfam> &data)
{
    ifstream f;
    f.open(fname, ios::in);
    while (!f.eof())
    {
        string currentline;
        getline(f, currentline);
        if (currentline == "" || currentline == "#")
            continue;
        else if (currentline == "-")
            data.push_back(t_featfam());
        else if (currentline == "!")
            data.back().second = data.back().first.size();
        else
        {
            istringstream iss(currentline);
            t_feat value;
            iss >> value;
            data.back().first.push_back(value);
        }
    }
}

This works, but feels horribly inefficient and probably is... If it would be just numbers, I would certainly use only fstreams, but as it is, I'm not sure how to do that properly. Can anyone hint me the right direction? This should be possible somehow. I'm using Visual Studio, and don't mind VS-specific solutions, but don't wanna include boost.

edit2: now a really working version, using steves codes, and improved by the ideas from luk32... 4 times faster then aboves code...

void read_data_from_file(const string &fname, vector<t_featfam> &data)
{
    ifstream f;
    f.open(fname, ios::in);
    char* currentline = new char [30];
    while (!f.eof())
    {
        f.getline(currentline, 30);
        switch (currentline[0])
        {
        case '\0':
        case '#':
            break;
        case '-':
            data.push_back(t_featfam());
            break;
        case '!':
            data.back().second = data.back().first.size();
            break;
        default:
            data.back().first.push_back(stoull(currentline));
            break;
        }
    }
    delete currentline;
}
Was it helpful?

Solution

I would probably do something along the lines of the following:

  • move currentline outside the loop - prevents reallocs every time around the loop
  • use a switch statement on the first char of currentline so we jump instead of multiple if/else statements
  • use std::stoull instead of a stringstream to convert currentline to uint64_t

Here's the function (not tested to see if it compiles, just wrote it up)

void read_data_from_file(const string &fname, vector<t_featfam> &data)
{
    ifstream f;
    f.open(fname, ios::in);
    string currentline;
    while (!f.eof())
    {
        getline(f, currentline);
        switch (currentline.c_str()[0])
        {
            case '\0':
            case '#':
                break;
            case '-':
                data.push_back(t_featfam());
                break;
            case '!':
                data.back().second = data.back().first.size();
                break;
            default:
                data.back().first.push_back(std::stoull(currentline));
                break;
        }
    }
}

OTHER TIPS

Most of the time is lost in the memory allocations. You have one allocation when you call getline() and another one when you construct the istringstream. Each allocation costs roughly 250 cycles on my system. So, you can save roughly 500 cycles per line that you read.

You can eliminate the allocations altogether if you use mmap() to map the entire file into your address space. Once you have everything in a single large array of char, you can relatively easily parse it without any need to copy lines out of that large array.

How long is it taking to process the entire file? If speed is really an issue here, and since you should have plenty of memory if you're running this program on a PC, you could use the equivalent of seek to end, tell, seek to beginning to get the size of the entire log file, allocate that much memory, then read in the entire file into a a large buffer. Then use memchr () to scan for every occurrence of "-", to determine the number of pairs, optionally create an array of pointers (pre-allocated based on maximum number of pairs for a given file size) then do a one time resize of the vector of pairs (or a one time new if using a pointer to vector to pairs). Then parse the buffer again to fill in the pairs via index or iterator instead of push_back(). Although this method scans the file buffer twice, it may compensate by avoiding internal dynamic resizes that would occurs with a bunch of push_backs().

Another option would be to put the pair count at the start or end of the log file, which would eliminate the first scan used to get number of pairs. If you have an idea of the maximum file size, you could just allocate sufficient memory to handle the largest expected log file, which would eliminate having to determine the file size before allocation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top