Question

I need to get data from a Postgres database to an Accumulo database. We're hoping to using sequence files to run map/reduce job to do this, but aren't sure how to start. For internal technical reasons, we need to avoid Sqoop.

Will this be possible without Sqoop? Again, I'm really not sure where to start. Do I write a java class to read all records (millions) into JDBC and somehow output that to an HDFS sequence file?

Thanks for any input!

P.S. - I should have mentioned that using a delimited file is the problem we're having now. Some of our are long character fields that contain the delimiter, and therefore don't parse correctly. The field may even have a tab in it. We wanted to go from Postgres straight to HDFS without parsing.

Was it helpful?

Solution 2

You can export data from your database as a CSV or tab-delimited, or pipe-delimited, or Ctrl-A (Unicode 0x0001) - delimited files. Then you can copy those files into HDFS and run a very simple MapReduce job, maybe even consisting just of a Mapper and configured to read the file format you used and to output the sequence files.

This would allow to distribute the load for the creating of the sequence files between the servers of the Hadoop cluster.

Also, most likely, this will not be a one-time deal. You will have to load the data from the Postgres database into HDFS on the regular basis. They you would be able to tweak your MapReduce job to merge the new data in.

OTHER TIPS

You can serialize your data using Avro, although it won't be very fast (especially when using python like in the example) and then load it into the hdfs.

Assuming you have database foo:

postgres=# \c foo
You are now connected to database "foo" as user "user".
foo=# 

foo=# \d bar
                              Table "public.bar"
Column |          Type           |                     Modifiers                     
--------+-------------------------+---------------------------------------------------
key    | integer                 | not null default nextval('bar_key_seq'::regclass)
value  | character varying(1024) | not null

You can create avro schema like below:

{"namespace": "foo.avro",
 "type": "record",
 "name": "bar",
 "fields": [
     {"name": "id", "type": "int"},
     {"name": "value", "type": "string"}
 ]
}

And then serialize your data row by row:

import psycopg2
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("foo.avsc").read())
writer = DataFileWriter(open("foo.avro", "w"), DatumWriter(), schema)

c = psycopg2.connect(user='user', password='s3cr3t', database='foo')
cur = c.cursor()
cur.execute('SELECT * FROM bar')

for row in cur.fetchall():
    writer.append({"id": row[0], "value": row[1]})

writer.close()
cur.close()
c.close()

Alternatively you can use serialize your data using plain json.

There is http://sqoop.apache.org/ which should do what you ask.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top