Question

I have data files with about 1.5 Gb worth of floating-point numbers stored as ASCII text separated by whitespace, e.g., 1.2334 2.3456 3.4567 and so on.

Before processing such numbers I first translate the original file to binary format. This is helpful because I can choose whether to use float or double, reduce file size (to about 800 MB for double and 400 MB for float), and read in chunks of the appropriate size once I am processing the data.

I wrote the following function to make the ASCII-to-binary translation:

template<typename RealType=float>  
void ascii_to_binary(const std::string& fsrc, const std::string& fdst){    
 RealType value;
 std::fstream src(fsrc.c_str(), std::fstream::in | std::fstream::binary);
 std::fstream dst(fdst.c_str(), std::fstream::out | std::fstream::binary);

 while(src >> value){
  dst.write((char*)&value, sizeof(RealType));
 }
 // RAII closes both files
}

I would like to speed-up acii_to_binary, and I seem unable to come up with anything. I tried reading the file in chunks of 8192 bytes, and then try to process the buffer in another subroutine. This seems very complicated because the last few characters in the buffer may be whitespace (in which case all is good), or a truncated number (which is very bad) - the logic to handle the possible truncation seems hardly worth it.

What would you do to speed up this function? I would rather rely on standard C++ (C++11 is OK) with no additional dependencies, like boost.

Thank you.

Edit:

@DavidSchwarts:

I tried to implement your suggestion as follows:

 template<typename RealType=float>  
  void ascii_to_binary(const std::string& fsrc, const std::string& fdst{    
    std::vector<RealType> buffer;
    typedef typename std::vector<RealType>::iterator VectorIterator;
    buffer.reserve(65536);

    std::fstream src(fsrc, std::fstream::in | std::fstream::binary);
    std::fstream dst(fdst, std::fstream::out | std::fstream::binary);

    while(true){
      size_t k = 0;
      while(k<65536 && src >> buffer[k]) k++;     
      dst.write((char*)&buffer[0], buffer.size());
      if(k<65536){
    break;
      }
    }
  }

But it does not seem to be writing the data! I'm working on it...

Était-ce utile?

La solution

I did exactly the same thing, except that my fields were separated by tab '\t' and I had to also handle non-numeric comments on the end of each line and header rows interspersed with the data.

Here is the documentation for my utility.

And I also had a speed problem. Here are the things I did to improve performance by around 20x:

  • Replace explicit file reads with memory-mapped files. Map two blocks at once. When you are in the second block after processing a line, remap with the second and third blocks. This way a line that straddles a block boundary is still contiguous in memory. (Assumes that no line is larger than a block, you can probably increase blocksize to guarantee this.)
  • Use SIMD instructions such as _mm_cmpeq_epi8 to search for line endings or other separator characters. In my case, any line containing an '=' character was a metadata row that needed different processing.
  • Use a barebones number parsing function (I used a custom one for parsing times in HH:MM:SS format, strtod and strtol are perfect for grabbing ordinary numbers). These are much faster than istream formatted extraction functions.
  • Use the OS file write API instead of the standard C++ API.

If you dream of throughput in the 300,000 lines/second range, then you should consider a similar approach.

Your executable also shrinks when you don't use C++ standard streams. I've got 205KB, including a graphical interface, and only dependent on DLLs that ship with Windows (no MSVCRTxx.dll needed). And looking again, I still am using C++ streams for status reporting.

Autres conseils

Aggregate the writes into a fixed buffer, using a std::vector of RealType. Your logic should work like this:

  1. Allocate a std::vector<RealType> with 65,536 default-constructed entries.

  2. Read up to 65,536 entries into the vector, replacing the existing entries.

  3. Write out as many entries as you were able to read in.

  4. If you read in exactly 65,536 entries, go to step 2.

  5. Stop, you are done.

This will prevent you from alternating reads and writes to two different files, minimizing the seek activity significantly. It will also allow you make far fewer write calls, reducing copying and buffering logic.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top