Domanda

I am working on a machine learning project at the moment which requires me to transfer the data from an old Java app(which is also the custodian of the data in current paradigm) to a python service which will do all the machine learning related stuff. So, I have data to the tune of a few GBs that needs to flow through the network.

What would be the most efficient way of transferring that data?

This information may be useful-

  1. The JAVA application is deployed as a 3-tier AWS instance and uses elastic search, postgres and neo4j.
  2. The python application will be deployed on a separate AWS instance.
  3. The data exists in Neo4J, is currently not encoded, but can be written to CSVs, or transformed into objects.

Help is appreciated! Thanks in advance!

È stato utile?

Soluzione

I see this as a perfect example of using stream processing pipelines such as Apache Kafka (or Apache Flink).

The rationale to use them is that you can add as much producers (Java app) or consumers (Python app) as possible. You also do not have to worry about if they work in different speeds as Kafka will buffer it.

Before passing the data to Kafka you might have to serialize it (if it is not ASCII). For that matter you might want to use JSON (Alternatively you could use something like Apache Avro which let you easily partition the data).

Altri suggerimenti

Why not just let both apps read from the same database? Or if you cannot do that, you could write the data to S3 with one app and read it from S3 with the other app. The target app can listen to events in S3 for every file which is written and then just load it.

Maybe I am oversimplifying it but it seems easy...(?)

There is also Snowflake SQL www.snowflake.com which you can connect to AWS using their technology called Snowpipe which basically lets you write to a resource in AWS and that will load it into the Snowflake DB (but this is probably overdoing it in this case).

Autorizzato sotto: CC-BY-SA insieme a attribuzione
scroll top