Question

Environment: Hadoop 0.20.2-cdh3u5

I am trying to upload log data (10G) to HDFS with a customized tool which using SequenceFile.Writer.

SequenceFile.Writer w = SequenceFile.createWriter(
                hdfs,
                conf,
                p,
                LongWritable.class,
                Text.class,
                4096,
                hdfs.getDefaultReplication(),
                hdfs.getDefaultBlockSize(),
                compressionType,
                codec,
                null,
                new Metadata());

During the uploading process, if the tool crashed (without invoke the close() method explicitly), will the log that has been uploaded lost?

Should I invoke sync() or syncFs() timely, what do the two methods means?

Was it helpful?

Solution

Yes, probably.

sync() create a sync point. As stated in the book "Hadoop- The Definitive Guide" by Tom White (Cloudera)

a sync point is a point in the stream which can used by to resynchronize with a record boundary if the reader is "lost" - for example after seeking to an arbitrary position on the stream.

Now the implementation of syncFS() is pretty simple:

   public void syncFs() throws IOException {
      if (out != null) {
        out.sync();                               // flush contents to file system
      }
    }

where out is a FSDataOutputStream. Again, in the same book is stated:

HDFS provides a method for forcing all buffers to be synchronized to the datanodes via the sync() method on FSDataOutputStream. After a successful call return from sync() HDFS garantees that the data written up to that point in the file is persisted and visible to all readers. In the event of a crash (of the client or HDFS), the data will not be lost.

But a footnote warns to look to bug HDFS-200, since the visibility mentioned above was not always not always honored.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top