vertica for non-analytics

https://stackoverflow.com/questions/11924929

25-06-2021
|

Question

I have a big analytics module in my system and plan to use vertica for it. Someone suggested that we also use vertica in the rest of our app (standard crud app with models from our domain) so not to manage multiple databases.

would vertica fit this dual scenario?

Solution

As is often the case these days, a meaningful answer depends on what you need to do. In a general sense, 'big data' solutions have grown from large data volume deficiencies in RDBMS systems. No 'big data' solution can compete with the core capabilities of RDBMS systems, ie complex analytics, but RDBMS systems are poor (expensive) solutions for large data volume procesing. Practical solutions for now have to be hybrid solutions. Vertica can be good once data is loaded, but I believe (not an expert) it requires denormalisation of data and pre-sorting before loading to perform at it's best. For large data volumes this may add significantly to the required resources. There is a definite benefit to using one system for all your needs, but there are also benefits to keeping your options open.

The approach I take is to store and index new data and then provide specific feeds to various reporting/analytic engines as required. This separates the collection and storage of raw data from the complex analytic processing. I am happy to provide more details if you are interested. This separation addresses a core problem which has always been present in database systems. In the past you used to hear 'store fast, report slowly or store slowly, report fast, but you cannot do both'. The search for a complete solution has, in the last few years, spawned the many NoSQL offerings which typically address the 'store fast' task. Some systems also provide impressive query performance by storing data in memory or cache but this requires many servers for large data volumes. I believe NoSQL and SQL solutions can, and will be, integrated, but this is till down the track.

To give you some context, I work with scenarios where at least 1 billion records a day are loaded. If you are dealing with say 100 million records a day (big is relative), then your Vertica approach will probably suffice, otherwise I think you need to expand your options.

OTHER TIPS

High frequency UPDATEs is probably where Vertica lags behind the worst. I would avoid using it for such data models.

Alec - I would like to respectfully challenge your comments on Vertica. In no way do you need to denormalize or sort data before loading. Vertica also holds the record for fastest loading of data over all databases.

You also talk about Vertica not being able to do complex analytics as well as an RDBMS. Vertica IS an RDBMS and can do analytics faster than any other RDBMS and they prove it over and over.

As far as your numbers, in my use case I load roughly 5 million records per second into my Vertica cluster and have 100's of billions of records.

So Yaron - I would highly recommend you look at Vertica before you rule it out based on this information.

Test it. Each use case is different. Assuming Vertica is a solution for every use case is almost as bad as using MongoDB for every use case.

Vertica is a high performance analytics database, column oriented, designed to analyze incredibly large datasets and scale horizontally. It's also expensive, hard to administer, and documentation is spotty. The payoff in the right environment can be easily worth the work, obviously

MySQL is a traditional RDBMS, row oriented, designed to model relationships between structured data, and works well on a single node scale (though many companies have retrofitted it to great success, exemplar gratia, Facebook). It's incredibly well documented and seemingly works on any platform, language, or framework and can be used by anyone.

My guess is using Vertica for an employee address book database is like showing up to a blue collar job in a $3000 suit. Sure it works, but is it the right tool for the job? Maybe if you already have a Vertica license and your applications already have the requisite data adaptors/ORM/etc..., go ahead and give it a shot. It's still a SQL database so it should work fine in those situations. If your goal is minimal programming as opposed to optimal performance, then why use Vertica at all? Sounds like something simpler would be more ideal. Vertica may or may not give better performance in a regular CRUD application environment since it's not optimized for that, but you can always test both and see.

Vertiy have many issues with high concurrency (Many small transaction per minute ) In MPP systems , the data is segmented across the cluster and any time there is need to take cluster level lock ( mainly in commit time ) , so many commits many cluster level X locks . high concurrency is less the use case in DWH and reporting , so vertica is perfect for that . In most of the cases OLTP solutions ( like CRM and etc ) required to provide high concurrency for that very is bad choice

Thanks

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow