I did exactly the same thing, except that my fields were separated by tab '\t'
and I had to also handle non-numeric comments on the end of each line and header rows interspersed with the data.
Here is the documentation for my utility.
And I also had a speed problem. Here are the things I did to improve performance by around 20x:
- Replace explicit file reads with memory-mapped files. Map two blocks at once. When you are in the second block after processing a line, remap with the second and third blocks. This way a line that straddles a block boundary is still contiguous in memory. (Assumes that no line is larger than a block, you can probably increase blocksize to guarantee this.)
- Use SIMD instructions such as
_mm_cmpeq_epi8
to search for line endings or other separator characters. In my case, any line containing an'='
character was a metadata row that needed different processing. - Use a barebones number parsing function (I used a custom one for parsing times in HH:MM:SS format,
strtod
andstrtol
are perfect for grabbing ordinary numbers). These are much faster thanistream
formatted extraction functions. - Use the OS file write API instead of the standard C++ API.
If you dream of throughput in the 300,000 lines/second range, then you should consider a similar approach.
Your executable also shrinks when you don't use C++ standard streams. I've got 205KB, including a graphical interface, and only dependent on DLLs that ship with Windows (no MSVCRTxx.dll needed). And looking again, I still am using C++ streams for status reporting.