문제

I'm attempting to convert some CSV files into AVRO files.

The code I've written runs fine on many CSV files that I've tested against but on some files I find that there is some data missing from the AVRO file.

Here is the outline of code in the csv->avro conversion. I'm using 1.7.5 of the C library

// initialize line counter
lineno = 0;

// make a schema first
avro_schema_from_json_length (...);

// make a generic class from schema
iface = avro_generic_class_from_schema( schema );

// get the record size and verify that it is 109 
avro_schema_record_size (schema);

// get a generic value
avro_generic_value_new (iface, &tuple);

// make me an output file
fp = fopen ( outputfile, "wb" );

// make me a filewriter
avro_file_writer_create_fp (fp, outputfile, 0, schema, &db);

// now for the code to emit the data

while (...)
{
    avro_value_reset (&tuple);

    // get the CSV record into the tuple
    ...

    // write that tuple
    avro_file_writer_append_value (db, &tuple);

    lineno ++;

    // flush the file
    avro_file_writer_flush (db);
}

// close the output file
avro_file_writer_close (db);

// other cleanup
avro_value_iface_decref (iface);
avro_value_decref (&tuple);

// close output file
fflush (outfp);
fclose (outfp);

When I run this program on a CSV file with 448621 rows of data and one header row, it comes out correctly with the fact that it processed 448621 rows of data.

Now the reader of this is a modified avrocat.c

Here is the code.

wschema = avro_file_reader_get_writer_schema(reader);
iface = avro_generic_class_from_schema(wschema);
avro_generic_value_new(iface, &value);

int rval;
lineno = 0;

while ((rval = avro_file_reader_read_value(reader, &value)) == 0) {
lineno ++;
avro_value_reset(&value);
}

// If it was not an EOF that caused it to fail,
// print the error.
if (rval != EOF) 
{
fprintf(stderr, "Error: %s\n", avro_strerror());
}
else
{
printf ( "%s %lld\n", filename, lineno );

}

When I run this against the avro file I just created, I find that it only has 448609 rows of data.

Not sure what happened to the rest ...

What am I missing, doing wrong? What additional information would someone need to help debug this?

I've tried a bunch of things.

The addition of the flush code to the avro file is one. I've tried to dump the avro file (using avrocat) and find out what is missing and it tends to be rows at the end.

도움이 되었습니까?

해결책

It appears that this is a bug in c 1.7.5 which was just fixed in c 1.7.6.

The bug in question is

https://issues.apache.org/jira/browse/AVRO-1364

Solution: upgrade to 1.7.6 ... where I verified that this problem does not exist.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top