Is hadoop designed only for “simple” data processing jobs, where communications between the distributed nodes are sparse?

https://softwareengineering.stackexchange.com/questions/246038

04-10-2020
|

Pergunta

I am not a professional coder, but rather an engineer/mathematician that uses computer to solve numerical problems. So far most of my problems are math-related, such as solving large scale linear systems, perform linear algebra operations, fft, mathematical optimization etc.. At smaller scale, these are best handled by Matlab (and similar high-level math packages). At large scale, however, I have to resort to old-fashioned (and VERY TEDIOUS) lower-level languages such as Fortran or C/C++. That is not all, when the problem become excessively large, parallelization becomes necessary, and MPI in that regard is a nightmare to wrestle through.

I have been hearing about hadoop and "big data" recently on a regular basis. But my impression was hadoop is NOT a general-purpose parallel computing paradigm, in the sense that it handles data that need minimal mutual communications (correct me if I am wrong). For example, in a general LA operation on a large data set distributed among many nodes, rather than processing one's own piece of data independently, each node must communicate with EVERY other node to send/get info before it's own data can be correctly processed, and the whole data set is updated only after ALL communications have been done. I hope I made my point clear: in numerical applications, data-processing is naturally GLOBAL, and the "graph" for communication often is a FULL GRAPH (though there ARE exceptions).

My question is, is hadoop suitable for such purpose, or is it only designed for "simple" data processing jobs, where communications between the distributed nodes are sparse?

Solução

The canonical use for Hadoop processes as a tree. The initial query range is split into sub-ranges. The query for each sub-range is sent to the node which stores the data for that sub-range. The job runs on the node for that node's sub-range of data and reports its answer back. The sub-answers are aggregated at the higher lever to answer the initial question. So, in general it does not support your "full graph" case.

I say "query" since that's the typical case. The framework can process any workload which can be dispersed and amalgamated.

The algorithm is known as Map Reduce. Here's an SO question on it.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange