Domanda

PROBLEM

I have this old piece of pre-stl C++ code that I want to translate into std C++11 without losing efficiency.

using T = unsigned;  // but can be any POD
FILE* fp = fopen( outfile.c_str(), "r" );
T* x = new T[big_n];
fread( x, sizeof(T), big_n, fp );
delete[] x;
fclose( fp );

Note that big_n is really big - like millions of records big, so any inefficiencies are pronounced.

PREVIOUS SOLUTION

In this answer from my previous question, I accepted this solution:

std::vector<T> x(big_n);
fread(x.data(), sizeof(T), big_n, fp);

ISSUE AND ATTEMPTED SOLUTION

That previous solution works, but the constructor actually calls T's default constructor big_n times. This is very slow when big_n is really big (and totally unnecessary as I am about to fread() the entire chunk from disk). FWIW, in my test case for one file, it was taking 3 seconds instead of 200ms.

So I tried to use this instead:

std::vector<T> x;
x.reserve( big_n );
fread(x.data(), sizeof(T), big_n, fp);

This seems to work, but then I run into the issue that size() returns 0 and not big_n.

How do I correct this without losing too much efficiency?

ADDENDUM

I just noticed that std::vector<> can take a custom allocator. Could using that form of the constructor solve my problem? I'm looking into this approach now.

WHAT WORKS FOR ME

I've looked into Ali's custom allocator solution below in addition to jrok's simple array solution. I have decided to adapt jrock's solution for its ease-of-understanding/lower maintenance.

The working code I came up with is below:

#include <vector>
#include <set>
#include <memory>
#include <fstream>
#include <iostream>
#include <cassert>

struct Foo
{
        int m_i;
        Foo() { }
        Foo( int i ) : m_i( i ) { }
        bool operator==( Foo const& rhs ) const { return m_i==rhs.m_i; }
        bool operator!=( Foo const& rhs ) const { return m_i!=rhs.m_i; }
        friend std::ostream& operator<<( std::ostream& os, Foo const& rhs )
        { os << rhs.m_i; }
};


// DESIGN NOTES  /*{{{*/
//
// LIMITATION  T must be a POD so we can fread/fwrite quickly
//
// WHY DO WE NEED THIS CLASS?
//
// We want to write a large number of small PODs to disk and read them back without
//   1. spurious calls to default constructors by std::vector
//   2. writing to disk a gazillion times
//
// SOLUTION
//   A hybrid class containing a std::vector<> for adding new items and a
//   std::unique_ptr<T[]> for fast persistence.  From the user's POV, it looks
//   like a std::vector<>.
//
// Algorithm
//   1. add new items into:
//      std::vector<T>        m_v;
//   2. when writing to disk, write out m_v as a chunk
//   3. when reading from disk, read into m_chunk (m_v will start empty again)
//   4. m_chunk and m_v combined will represent all the data
/*}}}*/

template<typename T>
class vector_chunk
{
// STATE  /*{{{*/
        size_t                m_n_in_chunk;
        std::unique_ptr<T[]>  m_chunk;
        std::vector<T>        m_v;
/*}}}*/

// CONSTRUCTOR, INITIALIZATION  /*{{{*/
public:
        vector_chunk() : m_n_in_chunk( 0 ) { }
/*}}}*/

// EQUALITY /*{{{*/
        public:
                bool operator==( vector_chunk const& rhs ) const
                {
                        if ( rhs.size()!=size() )
                                return false;

                        for( size_t i=0; i<size(); ++i )
                                if ( operator[]( i )!=rhs[i] )
                                        return false;

                        return true;
                }
/*}}}*/

// OSTREAM /*{{{*/
        public:
                friend std::ostream& operator<<( std::ostream& os, vector_chunk const& rhs )
                {
                        for( size_t i=0; i<rhs.m_n_in_chunk; ++i )
                                os << rhs.m_chunk[i] << "\n";
                        for( T const& t : rhs.m_v )
                                os << rhs.t << "\n";
                }
/*}}}*/
// BINARY I/O  /*{{{*/
public:
        void write_as_binary( std::ostream& os ) const
        {
                // write everything out
                size_t const  n_total = size();
                os.write( reinterpret_cast<const char*>( &n_total ), sizeof( n_total ));
                os.write( reinterpret_cast<const char*>( &m_chunk[0] ), m_n_in_chunk * sizeof( T ));
                os.write( reinterpret_cast<const char*>( m_v.data() ), m_v.size() * sizeof( T ));
        }
        void read_as_binary(  std::istream& is )
        {
                // only read into m_chunk, clear m_v
                is.read( reinterpret_cast<char*>( &m_n_in_chunk ), sizeof( m_n_in_chunk ));
                m_chunk.reset( new T[ m_n_in_chunk ] );
                is.read( reinterpret_cast<char*>( &m_chunk[0] ), m_n_in_chunk * sizeof( T ));
                m_v.clear();
        }
/*}}}*/

// DELEGATION to std::vector<T>  /*{{{*/
public:
        size_t size() const                 { return m_n_in_chunk + m_v.size(); }
        void push_back( T const& value )    { m_v.push_back( value ); }
        void push_back( T&&      value )    { m_v.push_back( value ); }
        template< class... Args >
        void emplace_back( Args&&... args ) { m_v.emplace_back( args... ); }
        typename std::vector<T>::const_reference
        operator[]( size_t pos ) const
        { return ((pos < m_n_in_chunk) ? m_chunk[ pos ] : m_v[ pos - m_n_in_chunk]); }

        typename std::vector<T>::reference
        operator[]( size_t pos )
        { return ((pos < m_n_in_chunk) ? m_chunk[ pos ] : m_v[ pos - m_n_in_chunk]); }
/*}}}*/
};

int main()
{
        size_t const n = 10;
        vector_chunk<Foo>  v, w;
        for( int i=0; i<n; ++i )
                v.emplace_back( Foo{ i } );

        std::filebuf                   ofb, ifb;
        std::unique_ptr<std::ostream>  osp;
        std::unique_ptr<std::istream>  isp;

        ofb.open( "/tmp/junk.bin", (std::ios::out | std::ios::binary));
        osp.reset( new std::ostream( &ofb ));
        v.write_as_binary( *osp );
        ofb.close();

        ifb.open( "/tmp/junk.bin", (std::ios::in | std::ios::binary));
        isp.reset( new std::istream( &ifb ));
        w.read_as_binary(  *isp );
        ifb.close();

        assert( v==w );
}
È stato utile?

Soluzione 2

If you don't need the interface of the vector:

auto p = unique_ptr<T[]>{ new T[big_n] };

It won't initialize the array if T is POD, otherwise it calls default constructors (default-initialization).

In C++1y, you'll be able to use std::make_unique.

Altri suggerimenti

Using vector::reserve() and then writing into vector::data() is a dirty hack and undefined behavior. Please don't do that.

The way to solve this problem is to use a custom allocator, such as the one in this answer. I have just tested it, works fine with clang 3.5 trunk but doesn't compile with gcc 4.7.2.

Although, as others have already pointed out, unique_ptr<T[]> will serve your needs just fine.

If using boost is an option for you, since version 1.55 boost::container::vector has had support for explicitly default-initializing elements when resizing using the syntax:

using namespace boost::container;
vector<T> vector(37283, default_init);

at creation or

using namespace boost::container;
vector.resize(37283, default_init);

after creation. This results in the nice syntax:

using T = unsigned;  // but can be any trivially copyable type
FILE* fp = fopen( outfile.c_str(), "r" );
boost::container::vector<T> x(big_n, boost::container::default_init);
fread( x.data(), sizeof(T), big_n, fp );
fclose( fp );

In my tests performance is identical to using std::vector with a default-initializing allocator.

EDIT: Unrelated aside, I'd use an RAII wrapper for FILE*:

struct FILE_deleter {
  void operator () (FILE* f) const {
    if (f) fclose(f);
  }
};
using FILE_ptr = std::unique_ptr<FILE, FILE_deleter>;

using T = unsigned;  // but can be any trivially copyable type
FILE_ptr fp{fopen( outfile.c_str(), "r" )};
boost::container::vector<T> x(big_n, boost::container::default_init);
fread( x.data(), sizeof(T), big_n, fp.get() );

I'm a bit OCD about RAII.

EDIT 2: Another option, if you absolutely MUST produce a std::vector<T>, and not a boost::container::vector<T> or std::vector<T, default_allocator<T>>, is to fill your std::vector<T> from a custom iterator pair. Here's one way to make an fread iterator:

template <typename T>
class fread_iterator :
  public boost::iterator_facade<fread_iterator<T>, T,
                                std::input_iterator_tag, T> {
  friend boost::iterator_core_access;

  bool equal(const fread_iterator& other) const {
    return (file_ && feof(file_)) || n_ <= other.n_;
  }

  T dereference() const {
    // is_trivially_copyable is sufficient, but libstdc++
    // (for whatever reason) doesn't have that trait.
    static_assert(std::is_pod<T>::value,
                 "Jabberwocky is killing user.");
    T result;
    fread(&result, sizeof(result), 1, file_);
    return result;
  }

  void increment() { --n_; }

  FILE* file_;
  std::size_t n_;

public:
  fread_iterator() : file_(nullptr), n_(0) {}
  fread_iterator(FILE* file, std::size_t n) : file_(file), n_(n) {}
};

(I've used boost::iterator_facade to reduce the iterator boilerplate.) The idea here is that the compiler can elide the move construction of dereference's return value so that fread will read directly into the vector's memory buffer. It will likely be less efficient due to calling fread once per item vs. just once for the allocator modification methods, but nothing too terrible since (a) the file data is still only copied once from the stdio buffer into the vector, and (b) the whole point of buffering IO is so that granularity has less impact. You would fill the vector using its assign(iterator, iterator) member:

using T = unsigned;  // but can be any trivially copyable type
FILE_ptr fp{fopen( outfile.c_str(), "r" )};
std::vector<T> x;
x.reserve(big_n);
x.assign(fread_iterator<T>{fp.get(), big_n}, fread_iterator<T>{});

Throwing it all together and testing side-by-side, this iterator method is about 10% slower than using the custom allocator method or boost::container::vector. The allocator and boost method have virtually identical performance.

Since you are upgrading to c++11, why not use file streams as well ? I just tried to read a 17 MB to a char* using ifstream & then write the contents to a file using ofstream.

I ran the same application in a loop 15 times and the maximum time it took is 320 ms and minimum is 120 ms.

std::unique_ptr<char []> ReadToEnd(const char* filename)
{
    std::ifstream inpfile(filename, std::ios::in | std::ios::binary | std::ios::ate);
    std::unique_ptr<char[]> ret;
    if (inpfile.is_open())
    {
        auto sz = static_cast<size_t>(inpfile.tellg());
        inpfile.seekg(std::ios::beg);
        ret.reset(new char[sz + 1]);
        ret[sz] = '\0';
        inpfile.read(ret.get(), sz);
    }

    return ret;
}


int main(int argc, char* argv [])
{

    auto data = ReadToEnd(argv[1]);
    std::cout << "Num of characters in file:" << strlen(data.get()) << "\n";

    std::ofstream outfile("output.txt");
    outfile.write(data.get(), strlen(data.get()));

}

Output

D:\code\cpp\ConsoleApplication1\Release>ConsoleApplication1.exe d:\code\cpp\SampleApp\Release\output.txt
Num of characters in file:18805057
Time taken to read the file, d:\code\cpp\SampleApp\Release\output.txt:152.008 ms.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top