How to get from Postgres database to Hadoop Sequence File?

Question 1

You can export data from your database as a CSV or tab-delimited, or pipe-delimited, or Ctrl-A (Unicode 0x0001) - delimited files. Then you can copy those files into HDFS and run a very simple MapReduce job, maybe even consisting just of a Mapper and configured to read the file format you used and to output the sequence files.

This would allow to distribute the load for the creating of the sequence files between the servers of the Hadoop cluster.

Also, most likely, this will not be a one-time deal. You will have to load the data from the Postgres database into HDFS on the regular basis. They you would be able to tweak your MapReduce job to merge the new data in.

Question 2

You can serialize your data using Avro, although it won't be very fast (especially when using python like in the example) and then load it into the hdfs.

Assuming you have database foo:

postgres=# \c foo
You are now connected to database "foo" as user "user".
foo=# 

foo=# \d bar
                              Table "public.bar"
Column |          Type           |                     Modifiers                     
--------+-------------------------+---------------------------------------------------
key    | integer                 | not null default nextval('bar_key_seq'::regclass)
value  | character varying(1024) | not null

You can create avro schema like below:

{"namespace": "foo.avro",
 "type": "record",
 "name": "bar",
 "fields": [
     {"name": "id", "type": "int"},
     {"name": "value", "type": "string"}
 ]
}

And then serialize your data row by row:

import psycopg2
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("foo.avsc").read())
writer = DataFileWriter(open("foo.avro", "w"), DatumWriter(), schema)

c = psycopg2.connect(user='user', password='s3cr3t', database='foo')
cur = c.cursor()
cur.execute('SELECT * FROM bar')

for row in cur.fetchall():
    writer.append({"id": row[0], "value": row[1]})

writer.close()
cur.close()
c.close()

Alternatively you can use serialize your data using plain json.

Question 3

There is http://sqoop.apache.org/ which should do what you ask.