Question

How can NoSQL databases like MongoDB be used for data analysis? What are the features in them that can make data analysis faster and powerful?

Was it helpful?

Solution

To be perfectly honest, most NoSQL databases are not very well suited to applications in big data. For the vast majority of all big data applications, the performance of MongoDB compared to a relational database like MySQL is significantly is poor enough to warrant staying away from something like MongoDB entirely.

With that said, there are a couple of really useful properties of NoSQL databases that certainly work in your favor when you're working with large data sets, though the chance of those benefits outweighing the generally poor performance of NoSQL compared to SQL for read-intensive operations (most similar to typical big data use cases) is low.

  • No Schema - If you're working with a lot of unstructured data, it might be hard to actually decide on and rigidly apply a schema. NoSQL databases in general are very supporting of this, and will allow you to insert schema-less documents on the fly, which is certainly not something an SQL database will support.
  • JSON - If you happen to be working with JSON-style documents instead of with CSV files, then you'll see a lot of advantage in using something like MongoDB for a database-layer. Generally the workflow savings don't outweigh the increased query-times though.
  • Ease of Use - I'm not saying that SQL databases are always hard to use, or that Cassandra is the easiest thing in the world to set up, but in general NoSQL databases are easier to set up and use than SQL databases. MongoDB is a particularly strong example of this, known for being one of the easiest database layers to use (outside of SQLite). SQL also deals with a lot of normalization and there's a large legacy of SQL best practices that just generally bogs down the development process.

Personally I might suggest you also check out graph databases such as Neo4j that show really good performance for certain types of queries if you're looking into picking out a backend for your data science applications.

OTHER TIPS

One benefit of the schema-free NoSQL approach is that you don't commit prematurely and you can apply the right schema at query time using an appropriate tool like Apache Drill. See this presentation for details. MySQL wouldn't be my first choice in a big data setting.

Consider, try, and perhaps even use multiple databases. It's not just a "performance" issue at play here. It's really going to come down to your requirements. How much data are you talking about? what kind of data? how fast do you need it? Are you more read heavy or write heavy?

Here's one thing you can't do in a SQL database: Calculate sentiment. http://www.slideshare.net/shift8/mongodb-machine-learning

Of course the speed in that case may not be fast enough for your needs, but it is something that's possible. With some caching of specific aggregate values, it was quite acceptable even. Why would you do this? Convenience.

Convenience really is something that you're going to be persuaded by. That's exactly why (in my opinion) NoSQL databases were created. Performance too of course, but I'm trying to discount benchmarks and focus more on other concerns.

MongoDB (and some other NoSQL) databases have some very powerful features such as built-in map/reduce. This could result in a savings both in cost and time over using something like Hadoop. Or it could provide a prototype or MVP to launch a larger business.

What about graph databases? They're "NoSQL" too. Look at databases like OrientDB. If you want to argue performance ...I don't think you're gonna show me a SQL database that's faster there =) ...and graph databases have some really amazing application based on what you need to do.

Rule of technology (and the internet) don't get too comfortable with one thing. You're gonna be limited and set yourself up for failure.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top