Pergunta

I have millions of tweets currently stored in HDFS and I plan to analyze them from Spark (Data mining, text mining, Frequent Term-Based Text Clustering, Social Network Analysis) however, do not know if there is any benefit in using a database instead of HDFS for handling data.

There is some justification (in terms of efficiency, workload, etc.) to work with data from any database (perhaps MondoDB) instead of directly into HDFS (stored in json format)? Given that the analysis I will do it from Spark.

Foi útil?

Solução

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Spark work mostly in memory.

As a first answer I will say that to make analytics no need to put the data in database.

Licenciado em: CC-BY-SA com atribuição
scroll top