Split large files

https://stackoverflow.com/questions/18921855

29-06-2022
|

Question

I am developing a distributed system where in a server will distribute a huge task to clients who would process them and return the result.
Server has to accept huge files with size of the order of 20Gb.

Server has to split this file into smaller pieces and send the path to the clients who in turn would scp the file and process them.

I am using read and write to perform file splitting which is performing ridiculously slow.

Code

//fildes - Source File handle
//offset - The point from which the split to be made  
//buffersize - How much to split  

//This functions is called in a for loop   

void chunkFile(int fildes, char* filePath, int client_id, unsigned long long* offset, int buffersize) 
{
    unsigned char* buffer = (unsigned char*) malloc( buffersize * sizeof(unsigned char) );
    char* clientFileName = (char*)malloc( 1024 );
    /* prepare client file name */
    sprintf( clientFileName, "%s%d.txt",filePath, client_id);

    ssize_t readcount = 0;
    if( (readcount = pread64( fildes, buffer, buffersize, *offset ) ) < 0 ) 
    {
            /* error reading file */
            printf("error reading file \n");
    } 
    else 
    {
            *offset = *offset + readcount;
            //printf("Read %ud bytes\n And offset becomes %llu\n", readcount, *offset);
            int clnfildes = open( clientFileName, O_CREAT | O_TRUNC | O_WRONLY , 0777);

            if( clnfildes < 0 ) 
            {
                    /* error opening client file */
            } 
            else 
            {
                    if( write( clnfildes, buffer, readcount ) != readcount ) 
                    {
                            /* eror writing client file */
                    } 
                    else 
                    {
                            close( clnfildes );
                    }
            }
    }

    free( buffer );
    return;
}

Is there any faster way to split files?
Is there any way client can access its chunk in the file without using scp (read without transfer)?

I am using C++. I am ready to use other languages if they can perform faster.

Solution

You can place the file in the reach of a webserver and then use curl from the clients

curl --range 10000-20000 http://the.server.ip/file.dat > result

would get 10000 bytes (from 10000 to 20000)

If the file is highly redundant and the network is slow probably using compression could help speeding up the transfer a lot. For example executing

nc -l -p 12345 | gunzip > chunk

on the client and then executing

dd skip=10000 count=10000 if=bigfile bs=1 | gzip | nc client.ip.address 12345

on the server you can transfer a section doing a gzip compression on the fly without the need of creating intermediate files.

EDIT

A single command to get a section of a file from a server using compression over the network is

ssh server 'dd skip=10000 count=10000 bs=1 if=bigfile | gzip' | gunzip > chunk

OTHER TIPS

Is rsync over SSH with the --partial an option? Then you might not need to split the files either since you can just continue if the transfer is interrupted.

Are the file split sizes known in advance or are they split along some marker in the file?

You can deposit file onto NFS shared device, and client can mount that device in RO-mode. Thereafter, client can open file, and use mmap() or pread() for read it's slice (piece of file). By this way, to client, will be transferred just needed part of the file.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow