Pergunta

Brief overview of general data flow

The general goal of my system is to allow users to upload many different types of files containing data (PDF, CSV, ZIP, etc.), then index it and perform some basic analysis to make it searchable and to be able to draw links between the files.

The general flow goes like this:

  1. User uploads files to S3 bucket through web server
  2. Web server completes upload
  3. Web server send message over message bus containing S3 bucket name and file name
  4. Document processing service notified of file over message bus
  5. Document processing service parses document and extracts metadata using Apache Tika
  6. If extracted content contains text, text should be processed with Apache Spark in order to perform some analysis on the text (i.e., using Spark-NLP)

I have a general proof of concept working here, but had some concerns before continuing to build out the system.

Concerns

  • Files to be processed may be very small, or very large
  • Running Apache Tika inside a Spark job may not be the best idea. IIRC, there's no native integration there.
  • Since I'm new to Spark, I think I might be using it for a job it isn't really designed to do
    • These files are small, and need to be processed as they come in, but I don't want to send the whole file content over Kafka, etc. to use Spark Streaming
    • Getting a file name and bucket location doesn't feel like the right way to start the Spark job (at least not with my limited set of knowledge)
  • Current processing of these files is unpredictable
    • Sometimes it gets completed in 0.5 seconds, sometimes 5 seconds. Likely because I have three worker nodes on the system.
  • These jobs are initiated inside of a microservice. When the service starts, I create a SparkSession and re-use it every time a file processing request comes in on the message bus.

Overall, I'm concerned that I'm using Spark in an incorrect way. Does it even make sense to use Spark citing my overall architecture?

Foi útil?

Solução

Answers to each question below:

1. Running Apache Tika inside a Spark job may not be the best idea. IIRC, there's no native integration there.

I haven't used Tika before, but it looks like its a framework implemented in Java. It also looks like it has a binding for python as well.

If you are using Spark with Scala or Java, using Tika is as simple as adding it to your JVM classpath and using the relevant API calls you need (MediaType detect(java.io.InputStream input, Metadata metadata) seems like a useful one in your situation).

If using Spark with Python, you just need to make sure Tika is imported in your environment.

2. Since I'm new to Spark, I think I might be using it for a job it isn't really designed to do

Actually, the motivation for creating Spark was the need for iterative data processing and to avoid writing to disk (as in the case of Map Reduce) in each iteration.

Spark can effectively leverage in-memory processing and caching to run iterative algorithms (which are most ML algorithms) to process the data efficiently and quickly. A similar iterative job in Map Reduce can take several hours (due to the slowness of writing to disk for each iteration).

While I'm not familiar with APIs of Spark-NLP, if it has what you need to perform the processing, it can be a good way to process large quantities text efficiently.

3. These files are small, and need to be processed as they come in, but I don't want to send the whole file content over Kafka, etc. to use Spark Streaming

You don't really need to use kafka to notify Spark Streaming of a new file in an S3 bucket. Spark streaming can use HDFS/an S3 bucket as a data source.

Whenever a new file is created in the bucket, it will be streamed. However, changes to existing files will not be streamed.

It is generally recommended to process your data as close as you can to your source data. So running spark on AWS and reading the data from S3 can be a good way of reducing the costs/slowness of data transmission over the network.

However, it looks like you need some way to identify if a document is in a specific text or language and then process it using spark streaming. So you may want to have some way to filter out documents before sending them to an S3 bucket.

I would need to understand your use case better, but you can probably use a separate application to process messages from Kafka/read the file location and then detect the language in a file before moving it to an S3 bucket for processing by spark streaming. (I'm making lot of assumptions here though. So I'd suggest digging a bit deeper in your use case to see what fits your need).

4. Current processing of these files is unpredictable

Can't really comment on performance without knowing more details of the job or the cluster you are running on. Skew in processing speed can exist across different nodes in a cluster. And your job will be as fast as the slowest node in the cluster. You may need to improve your spark code - proper partitioning and reducing data skew might help.

5. These jobs are initiated inside of a microservice. When the service starts, I create a SparkSession and re-use it every time a file processing request comes in on the message bus.

I haven't really used spark in a microservice before. And I think you might be using spark as a service, rather than a processing framework, which is what it is designed as. In any case, I recommend running your spark job as close to the data as possible to reduce costs of network transmission delays and to allow your job to scale more effectively with larger files.

Licenciado em: CC-BY-SA com atribuição
scroll top