I have mmapped a huge file into char string and made a c++ string
Don't, std::string
has to copy the memory, so you lose the performance improvement mmap would otherwise get you. Just work on the raw memory as a char array
I could do it from one thread but I need to optimize it
Are you sure multiple threads will optimize it? Did you profile and confirm it's definitely CPU-bound and not I/O bound?
If you're sure multiple threads is the way to go, I'd suggest doing this:
- create N threads (this should be based on the number of cores and then tweaked according to test results)
- carve your mmap'd region up into N blocks of approximately equal size
- you can just search back & forth for the nearest newline to your block boundary
- have each thread n create its own independent output
- combine all the outputs afterwards
As for the bug in the code I'm trying to persuade you not to use: you pass (void*)&i
as your argument to the thread function. This is a pointer to an automatic local that goes out of scope at the end of create_threads_for_parsing
, so it's likely to be random garbage by the time any thread reads it.
Even if it weren't random garbage (ie, if create_threads_for_parsing
joined all the threads before returning, to keep i
in scope), it would be the same pointer for each thread.
To safely pass a distinct integer id to each thread, you should allocate a distinct integer for each thread, and pass its address. It's either that or mess around with intptr_t
.