Pergunta

I am developing a database for an analysis job that requires parsing terabytes of information into a database. My job is to create and populate a database and using it to answer analyst's questions. In the course of such questions, we identify deficiencies in the way that the data is being represented in the database. I then have to alter the database structure, which often involves re-parsing and re-reading everything into the database.

The parsing is time consuming, but straightforward to parallelize; since I got a 48-node cluster it doesn't take very long any more. However, I started the project using MariaDB (frankly, no one knew the data-sets would get so big). The big time holdup is now writing into the database: billions of writes every time I modify the database taking over 24-hours with 2 months of collected data. Since we will be gathering data for many more months and years, I can see where the write requirements will soon become prohibitive.

I have to make recommendations about how/if we will expand the database in the future. However, I know next to nothing about distributed databasing. I would like to know: will switching to a distributed database be expected to decrease the write time by increasing the number of nodes to which I can write.

Regarding the data itself, it is designed with an indexed master table (~10^7 rows as of now) and several un-indexed subsidiary tables designed to be joined with the master (~10^9 rows), if that is relevant.

Just so you know, because the database supports strategic analysis and report generation, and not real time information, there is not as much incentive to speed up database read times. Leaving the data un-indexed and accepting long join times is just fine; report generation runs overnight. But the long time for database re-building is starting to chew up whole work days, which is not as fine.

Foi útil?

Solução

...be expected to decrease the write time by increasing the number of nodes to which I can write.

Yes. While the article is dated, Netflix performed a scalability experiment back in 2011, as detailed here:

Benchmarking Cassandra Scalability on AWS - Over a million writes per second

Essentially, they found that Cassandra was indeed linearly scalable. That is, the amount of ops/second that your cluster can sustain is directly proportionate to the number of nodes in your cluster.

the database supports strategic analysis and report generation, and not real time information

This is the part that makes me pause. Cassandra is typically not a good fit as an analytics back-end. The reason for this, is that (due to its distributed nature) your table structures must be modeled according to your read queries.

Essentially, that means one table serves one query. Additional queries requiring different columns in the WHERE clause, typically have to be served by different tables hosting duplicate data. This can quickly become cumbersome in an OLAP environment.

So while Cassandra's masterless, distributed architecture can help scale your writes, your reads quickly lose flexibility.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a dba.stackexchange
scroll top