To do bulk loads into Cassandra, I would advise looking at this article from DataStax. Basically you need to do 2 things for bulk loading:
- Your output data won't natively fit into Cassandra, you need to transform it to SSTables.
- Once you have your SSTables, you need to be able to stream them into Cassandra. Of course you don't simply want to copy each SSTable to every node, you want to only copy the relevant part of the data to each node
In your case when using the BulkOutputFormat
, it should do all that as it's using the sstableloader
behind the scenes. I've never used it with MultipleOutputs
, but it should work fine.
I think the error in your case is that you're not using MultipleOutputs
correctly: you're still doing a context.write
, when you should really be writing to your MultipleOutputs
object. The way you're doing it right now, since you're writing to the regular Context
, it will get picked up by the default output format of TextOutputFormat
and not the one you defined in your MultipleOutputs
. More information on how to use the MultipleOutputs
in your reducer here.
Once you write to the correct output format of BulkOutputFormat
like you defined, your SSTables should get created and streamed to Cassandra from each node in your cluster - you shouldn't need any extra step, the output format will take care of it for you.
Also I would advise looking at this post, where they also explain how to use BulkOutputFormat
, but they're using a ConfigHelper
which you might want to take a look at to more easily configure your Cassandra endpoint.