Here's what I need to do. Take an example text file (like so)
test.txt
The quick brown fox jumped over the lazy dog
I need to split that file into an arbitrary division by bytes. So the above file is 45 bytes (including EOL/EOF character). I basically want to split it in an arbitrary way by bytes.
So if I split it into 4 parts I'd get something like:
Current
Part1: The quick b (11 bytes)
Part2: rown fox ju (11 bytes)
Part3: mped over t (11 bytes)
Part4: he lazy dog (12 bytes)
(Roughly something like that)
But I want to split it into complete words, so it'd look something like this
Desired
Part1: The quick brown (15 bytes)
Part2: fox jumped (9 bytes)
Part3: Over the (8 bytes)
Part4: lazy dog (9 bytes)
Or something roughly like it just so the divisions have complete words. If there's 3 words and 6 sections to split to, the first 3 should each have a word and the remaining should just be empty. like this:
file: The quick brown
(Split into 6 parts)
Part1: The
Part2: quick
Part3: brown
Part4-6: ""
Here's what I have which gives me the "current"
// Get file size in bytes
off_t fileSize = statBuf.st_size;
// Split a section of file to read for each thread
off_t startSection[NUM_SECTIONS];
off_t endSection[NUM_SECTIONS];
for (int i = 0; i < NUM_SECTIONS; i++) {
if (i == 0) {
// Start at 0, end at our interval chunk
startSection[i] = 0;
endSection[i] = fileSize / NUM_SECTIONS;
} else {
// Start at the last section's end
startSection[i] = endSection[i-1];
// End after the next chunk
endSection[i] = (fileSize / NUM_SECTIONS) * (i + 1);
}
// At the last section, add any remaining bytes
if (i == NUM_SECTIONS - 1) {
endSection[i] += fileSize % NUM_SECTIONS;
}
}
I think I'd have to peek into the file contents and identify white space/punctuation characters (I want to treat punctuation and white space characters as the same). But I couldn't get it to implement in equal portions (arbitrary, could be 3 parts, 4, 5, 6, etc.)
Any help is appreciated. This is on linux too.