Using amazon data pipeline to backup dynamoDB data to S3

Question 1

If you are setting up your Data Pipeline through the DynamoDB Console's Import/Export button, you will have to create a separate pipeline per table. If you are using Data Pipeline directly (either through the Data Pipeline API or through the Data Pipeline console), you can export multiple tables in the same pipeline. For each table, simply add an additional DynamoDBDataNode, and an EmrActivity to link that Data Node to the output S3DataNode.

Regarding your year_month prefix use case, using the data pipeline sdk to change the table names periodically seems like the best approach. Another approach could be to make a copy of the script that export EmrActivity is running (you can see the script location under the "step" of the activity), and instead change the way that the hive script determines the table name by checking the current date. You would need to make a copy of this script and host the modified script in your own S3 bucket, and point the EmrActivity to that location instead of the default. I have not tried either approach before, but both are theoretically possible.

More general information about exporting DynamoDB tables can be found in the DynamoDB Developer Guide, and more detailed information can be found in the AWS Data Pipeline developer guide.

Question 2

Its a old question but I was looking for the answer in last days. When adding multiple DynamoDBDataNode, you can still use one single S3DataNode like output. Just differentiate folders in the S3 bucket through specifying different output.directoryPath in the EmrActivity Step field.

Like this: #{output.directoryPath}/newFolder

Every new folder will be automatically created in the s3 bucket.