Question

I would like to upload a directory from an EMR local file system to s3 as a zipped file.

Is there be a better way to approach this than the method I'm currently using?

Would it be possible to return a ZipOutputStream as a Reducer output?

Thanks

zipFolderAndUpload("target", "target.zip", "s3n://bucketpath/");


static public void zipFolderAndUpload(String srcFolder, String zipFile, String dst) throws Exception {

    //Zips a directory
    FileOutputStream fileWriter = new FileOutputStream(zipFile);
    ZipOutputStream zip = new ZipOutputStream(fileWriter);
    addFolderToZip("", srcFolder, zip);
    zip.flush();
    zip.close();

    // Copies the zipped file to the s3 filesystem,
    InputStream in = new BufferedInputStream(new FileInputStream(zipFile));
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(dst+zip), conf);
    OutputStream out = fs.create(new Path(dst+zip));
    IOUtils.copyBytes(in, out, 4096, true);

}

static private void addFileToZip(String path, String srcFile, ZipOutputStream zip) throws Exception {

    File folder = new File(srcFile);
    if (folder.isDirectory()) {
        addFolderToZip(path, srcFile, zip);
    } else {
        byte[] buf = new byte[1024];
        int len;
        FileInputStream in = new FileInputStream(srcFile);
        zip.putNextEntry(new ZipEntry(path + "/" + folder.getName()));
        while ((len = in.read(buf)) > 0) {
            zip.write(buf, 0, len);
        }
    }
}

static private void addFolderToZip(String path, String srcFolder, ZipOutputStream zip) throws Exception {
    File folder = new File(srcFolder);

    for (String fileName : folder.list()) {
        if (path.equals("")) {
            addFileToZip(folder.getName(), srcFolder + "/" + fileName, zip);
        } else {
            addFileToZip(path + "/" + folder.getName(), srcFolder + "/" + fileName, zip);
        }
    }
}
Was it helpful?

Solution

The approach you are taking looks fine. If you find that it is too slow because it is single-threaded, then you can create your own Hadoop OutputFormat implementation that writes to zip files.

One thing you have to be careful of, is that Java SE's implementation of ZipOutputFormat does not support Zip64, which means that it does not support ZIP files larger than 4GB in size. There are other Java implementations of ZIP that do, like TrueZIP.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top