Downloading a large dataset on the web directly into AWS S3

https://datascience.stackexchange.com/questions/5589

dataset
aws

16-10-2019
|

Question

Does anyone know if it's possible to import a large dataset into Amazon S3 from a URL?

Basically, I want to avoid downloading a huge file and then reuploading it to S3 through the web portal. I just want to supply the download URL to S3 and wait for them to download it to their filesystem. It seems like an easy thing to do, but I just can't find the documentation on it.

Solution

Since you obviously posses an AWS account I'd recommend the following:

Create an EC2 instance (any size)
Use wget(or curl) to fetch the file(s) to that EC2 instance. For example: wget http://example.com/my_large_file.csv.
Install s3cmd
Use s3cmd to upload the file to S3. For example: s3cmd cp my_large_file.csv s3://my.bucket/my_large_file.csv

Since connections made between various AWS services leverage AWS's internal network, uploading from an EC2 instance to S3 is pretty fast. Much faster than uploading it from your own computer. This way allows you to avoid downloading the file to your computer and saving potentially significant time uploading it through the web interface.

OTHER TIPS

Launch an EC2 instance with enough storage
ssh to the instance
Obtain the curl command corresponding to the download from your local machine. You can use the developer options in Google chrome -> network tab -> copy -> copy as curl (this step is necessary for some websites requiring authentication such as kaggle)
From the instance terminal, run the curl command (append -o output_file to the command). This will download and save the file
Configure aws credentials to connect the instance to s3 (one way is to use the command aws config, provide AWS access key Id and secret),

Use this command to upload the file to s3:

aws s3 cp path-to-file s3://bucket-name/

Refer Aws documentation : http://aws.amazon.com/code there are libraries available for most of the programing languages. So you can create a bucket and configure in your code to fetch data from url and write to this bucket in s3

for eg in python :

from boto.s3.key import Key
k = Key(bucket)
k.key = 'foobar'
k.set_contents_from_string(url_data)

Ref : https://boto.readthedocs.org/en/latest/s3_tut.html

You can mount your s3 bucket to ec2 instance and then cd to the /path/to/s3_mounted_on_a_folder, there you can simply use the command:

wget https://your.download.url/

to mount s3 to your ec2, use s3fs.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange