Split file into equal byte sections, separated by complete word (C/C++)

https://stackoverflow.com/questions/23431299

14-07-2023
|

Question

Here's what I need to do. Take an example text file (like so)

test.txt
The quick brown fox jumped over the lazy dog

I need to split that file into an arbitrary division by bytes. So the above file is 45 bytes (including EOL/EOF character). I basically want to split it in an arbitrary way by bytes.

So if I split it into 4 parts I'd get something like:

Current

Part1: The quick b (11 bytes)

Part2: rown fox ju (11 bytes)

Part3: mped over t (11 bytes)

Part4: he lazy dog (12 bytes)

(Roughly something like that)

But I want to split it into complete words, so it'd look something like this

Desired

Part1: The quick brown (15 bytes)

Part2: fox jumped (9 bytes)

Part3: Over the (8 bytes)

Part4: lazy dog (9 bytes)

Or something roughly like it just so the divisions have complete words. If there's 3 words and 6 sections to split to, the first 3 should each have a word and the remaining should just be empty. like this:

file: The quick brown

(Split into 6 parts)

Part1: The

Part2: quick

Part3: brown

Part4-6: ""

Here's what I have which gives me the "current"

// Get file size in bytes
off_t fileSize = statBuf.st_size;

// Split a section of file to read for each thread
off_t startSection[NUM_SECTIONS];
off_t endSection[NUM_SECTIONS];
for (int i = 0; i < NUM_SECTIONS; i++) {
    if (i == 0) {
        // Start at 0, end at our interval chunk
        startSection[i] = 0;
        endSection[i] = fileSize / NUM_SECTIONS;
    } else {
        // Start at the last section's end
        startSection[i] = endSection[i-1];
        // End after the next chunk
        endSection[i] = (fileSize / NUM_SECTIONS) * (i + 1);
    }

    // At the last section, add any remaining bytes
    if (i == NUM_SECTIONS - 1) {
        endSection[i] += fileSize % NUM_SECTIONS;
    }
}

I think I'd have to peek into the file contents and identify white space/punctuation characters (I want to treat punctuation and white space characters as the same). But I couldn't get it to implement in equal portions (arbitrary, could be 3 parts, 4, 5, 6, etc.)

Any help is appreciated. This is on linux too.

Solution

If you know the size of the file beforehand, this approach would be a good place to start, I think (C-ish pseudo-code only):

filesize = ???;
nchunks = ???;
fileno = 1;
bytes_processed = 0;
while (bytes_processed < filesize)
{ copy_one_byte();
  if (++bytes_processed >= (filesize / nchunks * fileno))
  { // keep processing to end of word or the end of file, whichever is first
    // then switch to next file
    ++fileno;
  }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow