Question

I've recently found out about protocol buffers and was wondering if they could be applied to my specific problem.

Basically I have some CSV data that I need to convert to a more compact format for storage as some of the files are several gig.

Each field in the CSV has a header, and there are only two types, strings and decimals (because sometimes there are alot of significant digits and I need to handle all numbers the same way). But each file will have different column names for each field.

As well as capturing the original CSV data I need to be able to add extra information to the file before saving. And I was hoping to make this future proof by handling different file versions.

So, is it possible to use protocol buffers to capture a random number of randomly named columns of data, like a CSV file?

Was it helpful?

Solution

Well, it's certainly representable. Something like:

message CsvFile {
    repeated CsvHeader header = 1;
    repeated CsvRow row = 2;
}

message CsvHeader {
    require string name = 1;
    require ColumnType type = 2;
}

enum ColumnType {
    DECIMAL = 1;
    STRING = 2;
}

message CsvRow {
    repeated CsvValue value = 1;
}

// Note that the column is implicit based on position within row    
message CsvValue {
    optional string string_value = 1;
    optional Decimal decimal_value = 2;
}

message Decimal {
    // However you want to represent it (there are various options here)
}

I'm not sure how much benefit it will provide, mind you... You can certainly add more information (add to the CsvFile message) and future proofing is in the "normal PB way" - only add optional fields, etc.

OTHER TIPS

Well, protobuf-net (my version) is based on regular .NET types, so no (since it won't cope with different schemas all the time). But Jon's version might allow dynamic types. Personally, I'd just use CSV and run it through GZipStream - I expect that will be fine for the purpose.


Edit: actually, I forgot: protobuf-net does support extensible objects, but you need to be a bit careful... it would depend on the full context, I expect.

Plus Jon's approach of nested data would probably work too.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top