Question

Is there an easy way of shuffling randomly a fixed-size of byte chunks?

I have a large binary file (say, a hundreds of gigabytes) containing many fixed-size of bytes. I do not care about the randomness, but want to shuffle two-byte (or could be any fixed-size of bytes, up to 8) elements in the binary file. Is there a way of combining unix core tools to achieve this goal? If there is no such tool, I might have to develop a C code. I want to hear what recommendation people have.

Was it helpful?

Solution

Here's a stupid shell trick to do so.

  1. First, break the file down two 2 byte chunks using xxd
  2. Shuffle it with shuf
  3. Reassemble the file using xxd.

eg.

xxd -p -c 2 input_file | shuf - | xxd -p -r - output_file

I haven't tested it on huge files. You may want to use an intermediary file.

Alternately, you could use sort -R like so:

xxd -c 2 in_file |sort -R | cut -d' ' -f 2 | xxd -r -p - out_file

This depends on xxd outputing offsets, which should sort differently for each line.

OTHER TIPS

Given the size of the input files to work with, this is a sufficiently complex problem. I wouldn't try to push the limits of shell scripting, best to code this in C or other.

I'm not aware of a tool that can make this easy.

Try:

split -b $CHUNK_SIZE $FILE && find . -name "x*" | perl -MList::Util='shuffle' -e "print shuffle<>" | xargs cat > temp.bin

This creates a large amount of files each with a file size of $CHUNK_SIZE (or less, if the total file size doesn't divide by $CHUNK_SIZE), named xaa, xab, xac, etc., lists the files, shuffles the list, and joins them.

This will take up an extra 2 x of disk space and probably won't work with large files.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top