How to copy csv data file to Amazon RedShift?

Question 1

The problem is finally resolved by using:

copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ',' removequotes;

More information can be found here http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Question 2

Now Amazon Redshift supports CSV option for COPY command. It's better to use this option to import CSV formatted data correctly. The format is shown bellow.

COPY [table-name] FROM 's3://[bucket-name]/[file-path or prefix]'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV;

The default delimiter is ( , ) and the default quotes is ( " ). Also you can import TSV formatted data with CSV and DELIMITER option like this.

COPY [table-name] FROM 's3://[bucket-name]/[file-path or prefix]'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV DELIMITER '\t';

There are some disadvantages to use the old way(DELIMITER and REMOVEQUOTES) that REMOVEQUOTES does not support to have a new line or a delimiter character within an enclosed filed. If the data can include this kind of characters, you should use CSV option.

See the following link for the details.

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Question 3

If you want to save your self some code/ you have a very basic use case you can use Amazon Data Pipeline. it stats a spot instance and perform the transformation within amazon network and it's really intuitive tool (but very simple so you can't do complex things with it)

Question 4

You can try with this

copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' csv;

CSV itself means comma separated values, no need to provide delimiter with this. Please refer link.

[http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format]

Question 5

I always this code:

COPY clinical_survey
FROM 's3://milad-test/clinical_survey.csv' 
iam_role 'arn:aws:iam::123456789123:role/miladS3xxx'
CSV
IGNOREHEADER 1
;

Description:
1- COPY the name of your file store in S3
2- FROM address of file
3- iam_role is a substitution for CREDENTIAL. Note that, iam_role should be defined in iam management menu at your console, and then in trust menu should be assigned to the user as well (That is the hardest part!)
4- CSV uses comma delimiter
5- IGNORHEADER 1 is a must! Otherwise it will throw an error. (skip one row of my CSV and consider it as a header)

Question 6

Since the resolution has already been provided, I'll not repeat the obvious.

However, in case you receive some more error which you're not able to figure out, simply execute on your workbench while you're connected to any of the Redshift accounts:

select * from stl_load_errors [where ...];

stl_load_errors contains all the Amazon RS load errors in historical fashion where a normal user can view details corresponding to his / her own account but a superuser can have all the access.

The details are captured elaborately at : Amazon STL Load Errors Documentation

Question 7

Little late to comment but it can be useful:-

You can use an open source project to copy tables directly from mysql to redshift - sqlshift.

It only requires spark and if you have yarn then it can also be used.

Benefits:- It will automatically decides distkey and interleaved sortkey using primary key.

Question 8

It looks like you are trying to load local file into REDSHIFT table. CSV file has to be on S3 for COPY command to work.

If you can extract data from table to CSV file you have one more scripting option. You can use Python/boto/psycopg2 combo to script your CSV load to Amazon Redshift.

In my MySQL_To_Redshift_Loader I do the following:

Extract data from MySQL into temp file.

loadConf=[ db_client_dbshell ,'-u', opt.mysql_user,'-p%s' % opt.mysql_pwd,'-D',opt.mysql_db_name, '-h', opt.mysql_db_server]    
...
q="""
%s %s
INTO OUTFILE '%s'
FIELDS TERMINATED BY '%s'
ENCLOSED BY '%s'
LINES TERMINATED BY '\r\n';
""" % (in_qry, limit, out_file, opt.mysql_col_delim,opt.mysql_quote)
p1 = Popen(['echo', q], stdout=PIPE,stderr=PIPE,env=env)
p2 = Popen(loadConf, stdin=p1.stdout, stdout=PIPE,stderr=PIPE)
...

Compress and load data to S3 using boto Python module and multipart upload.

conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = s3_key_name
k.set_contents_from_file(file_handle, cb=progress, num_cb=20, 
reduced_redundancy=use_rr )

Use psycopg2 COPY command to append data to Redshift table.

sql="""
copy %s from '%s' 
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s' 
DELIMITER '%s' 
FORMAT CSV %s 
%s 
%s 
%s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)