What does the sync and syncFs of SequenceFile.Writer means?
-
03-07-2021 - |
Question
Environment: Hadoop 0.20.2-cdh3u5
I am trying to upload log data (10G) to HDFS with a customized tool which using SequenceFile.Writer.
SequenceFile.Writer w = SequenceFile.createWriter(
hdfs,
conf,
p,
LongWritable.class,
Text.class,
4096,
hdfs.getDefaultReplication(),
hdfs.getDefaultBlockSize(),
compressionType,
codec,
null,
new Metadata());
During the uploading process, if the tool crashed (without invoke the close() method explicitly), will the log that has been uploaded lost?
Should I invoke sync() or syncFs() timely, what do the two methods means?
Solution
Yes, probably.
sync()
create a sync point. As stated in the book "Hadoop- The Definitive Guide" by Tom White (Cloudera)
a sync point is a point in the stream which can used by to resynchronize with a record boundary if the reader is "lost" - for example after seeking to an arbitrary position on the stream.
Now the implementation of syncFS()
is pretty simple:
public void syncFs() throws IOException {
if (out != null) {
out.sync(); // flush contents to file system
}
}
where out
is a FSDataOutputStream
. Again, in the same book is stated:
HDFS provides a method for forcing all buffers to be synchronized to the datanodes via the
sync()
method onFSDataOutputStream
. After a successful call return fromsync()
HDFS garantees that the data written up to that point in the file is persisted and visible to all readers. In the event of a crash (of the client or HDFS), the data will not be lost.
But a footnote warns to look to bug HDFS-200, since the visibility mentioned above was not always not always honored.