C++ Load text file, optimization [closed]

https://stackoverflow.com/questions/8679212

11-05-2021
|

Domanda

There is my source code loading text file and delimitting each line to single items (words).

How to further optimize the code? Testing empty lines (and other constructions) are (in my opinion) a little bit inefficient....

typedef std::vector < std::string >  TLines;
typedef std::vector < std::vector < std::string > > TItems;

TItems TFloadFile ( const char * file_name )
{
    //Load projection from file
    unsigned int lines = 0;
    char buffer[BUFF];
    FILE * file;
    TItems file_words;
    TLines file_lines;


    file = fopen ( file_name, "r" );

    if ( file != NULL )
    {
            for ( ; fgets ( buffer, BUFF, file ); )
            {
                    //Remove empty lines
                    bool empty_line = true;
                    for ( unsigned i = 0; i < strlen ( buffer ); i++ )
                    {
                            if ( !isspace ( ( unsigned char ) buffer[i] ) )
                            {
                                    empty_line = false;
                                    break;
                            }
                    }

                    if ( !empty_line )
                    {
                            file_lines.push_back ( buffer );
                            lines++;
                    }
            }


            file_words.resize ( lines + 1 );
            for ( unsigned int i = 0; i < lines; i++ )
            {
                    char * word = strtok ( const_cast<char *> ( file_lines[i].c_str() ), " \t,;\r\n" );
                    for ( int j = 0; word; j++, word = strtok ( 0, " \t;\r\n" ) )
                    {
                            file_words[i].push_back ( word );
                    }
            }

            fclose ( file );
    }

    return file_words;
}

Thanks for your help...

Soluzione

Before optimizing, can you explain how big the file is, how long the code currently takes to execute and why you think it isn't already IO bound (ie due to hard disk speed). How long do you think it should take? Some idea of the type of data in the file would be good too (such as average line length, average proportion of empty lines etc).

That said, combine the remove-empty-line loop with the word-tokenising loop. Then you can remove TLines altogether and avoid the std::string constructions and vector push-back. I haven't checked this code works, but it should be close enough to give you the idea. It also includes a more efficient empty line spotter:

if ( file != NULL )
{
    for ( ; fgets ( buffer, BUFF, file ); )
    {
        bool is_empty = true;
        for (char *c = buffer; *c != '\0'; c++)
        {
            if (!isspace(c))
            {
                is_empty = false;
                break;
            }
        }

        if (is_empty)
            continue;

        file_words.resize ( lines + 1 );
        char * word = strtok ( buffer, " \t,;\r\n" );
        for ( int j = 0; word; j++, word = strtok ( 0, " \t;\r\n" ) )
        {
                file_words[i].push_back ( word );
        }

        lines++;
    }

    fclose ( file );
}

Altri suggerimenti

The line for ( unsigned i = 0; i < strlen ( buffer ); i++ ) is quite inefficient as you're calculating the length of buffer each time through the loop. However, it's possible that this will be optimised away by the compiler.

You're pushing items onto your std::vectors without reserve()ing any space. For large file, this will involve a lot of overhead as the content of the vectors will need to be copied in order to resize them. I just read @Notinlist's answer, which already talks about the inefficiencies of std::vector::resize().

Instead of reading each line into a vector through repeated fgets() calls, could you not simply determine the number of bytes in the file, dynamically allocate a char array to hold them, and then dump the bytes into it? Then, you could parse the words and store them in file_words. This would be more efficient than the method you're currently using.

For one

file_lines.push_back ( buffer );

That is a very expensive line. If you don't have to use vector, then use a list instead. Maybe convert your list to a vector after you finished with the job.

If you absolutely in need of using vector for this purpose, then you should use some exponential increment instead, like:

if(file_lines.size()<=lines){
    file_lines.resize((int)(lines * 1.3 + 1));
}

That way you will have much less cpu intensive .resize() operations, for a cost of minimal memory consumption overhead.

Simplified and converted to use std::list instead of std::vector.

typedef std::list< std::list< std::string > > TItems;

TItems TFloadFile ( const char * file_name )
{
    using namespace std;
    //Load projection from file
    ifstream file(file_name);
    TItems file_words;
    string line;

    for(getline(file,line); !file.fail() && !file.eof(); getline(file,line))
    {
        file_words.push_back(list<string>());
        list<string> &words(file_words.back());

        char *word = strtok((char*)line.c_str(), " \t,;\r\n" );
        for(; word; word=strtok( 0, " \t;\r\n" ))
        {
            words.push_back( word );
        }
        if(!words.size())
            file_words.pop_back();
    }
    return file_words;
}

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow