Scalable storage + processing clusters (Is Hadoop what I need?)

https://stackoverflow.com/questions/13550592

02-12-2021
|

Question

Objectives
I need to implement file storage and processing back end for a web application. The application has these characteristics:

(#1) Clients will store files of various formats and sizes (could be in gigabyte range)
(#2) Sometimes clients will need to retrieve the file itself
(#3) Sometimes clients will need to retrieve output data ("OD" here onwards), where processing is performed on a previously-stored file to generate the OD. Important note: the OD size is typically a very small fraction of the original file size--a 2GB file may produce a 1MB OD).
(#4) Sometimes clients will apply transformations to the file (e.g. file patching).

Considering a solution
I could use a storage cluster (e.g. SAN) to achieve #1 and #2, and then a compute cluster for #3 and #4. But to shuttle lots of data between the SAN and compute cluster (imagine 100's of users requesting ODs or patching files) doesn't seem right to me, especially because the file data can be huge and most of the time clients only need small ODs or nothing (patching operation consumes client input but does not return data to the client).

So I think what I need is a node cluster where each node is a big data node AND a competent processing node in order to avoid traffic between the storage and processing clusters (because now they are one). A node is responsible for processing on the files it stores so network bandwidth is avoided. If a node happens to be overloaded with processing requests, that node may then offload some work to neighboring nodes (thus still incurring bandwidth costs but only when necessary).

Questions
(1) Wikimedia uses "file servers" and separate "image scaler" servers...but in my case I'm worried about large, unnecessary bandwidth. Is my worry justified, and therefore is the separation of storage/processing nodes inappropriate in my case?

(2) Is my approach (cluster of big storage + powerful processing nodes) desirable? Or should I be considering a different architecture?

(2) I've considered Hadoop but don't know if it is suited to the task (huge bandwidth cost and I'm not really processing bigdata). Also But if Hadoop is right for the task, please say why.

(3) Are there open-source/other frameworks I can use for managing these server clusters?

(4) If there aren't, I suppose I will have to develop an in-house solution. How might I get started?

Whew. That was a lot. Thanks in advance!

Solution

Hadoop and using both HDFS and MR is possibly a workable solution for you. Caveats and considerations, though:

Are the algorithms that you will use to create your "OD" in general parallelizable? If they are not, you might not benefit from data locality and hadoop will be copying data for a file from the datanodes holding it to a single node doing the processing.
Using mapreduce, you will not be able to modify files in place. So you will also have to consider a post-processing step where the output file is renamed to the input file and other such housekeeping.
Managing/deploying a cluster is not very difficult. Check out Cloudera Manager and Hortonworks Data Platform. These should give you everything from deployment to management and monitoring. However the Cloudera offering might have some licensing costs beyond a certain number of nodes. HDP has no such restrictions AFAIK.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow