Question

I have a set of n sentences (n >= 100k).

I want to perform queries of the sort where given a set of n_i sentences i return a set of m_i words that is present in one subset of n_i sentences and not present in the other.

Which database is appropriate for my use case. I have looked into sqlite using sqlalchemy and i havent been able to quite figure out how to store these m words to each n strings.

Also i prefer something that has python apis . Should i be using graph databases. I am a total noob at this stuff so any help would be greatly appreciated

Was it helpful?

Solution

This sounds more like a programming problem than a data storage and retrieval problem. It probably won't matter where you store the data. You could just use something like Azure Blob storage for the data, and provision a Spark cluster to process the data at scale, using Python. - David Browne

If you want an Open Source database, then go for PostgreSQL - you will find that it's infinitely more capable than MySQL for... well, virtually everything! You might be able to do some sort of CROSS JOIN (LATERAL) between your 50-mers using LAG and proceed from there.

Using joins is pure speculation on my part. Python would probably be better for this With respect to David Browne's suggestion, a programming solution may be required. Having said that, SQL (+ WITH RECURSIVE clause) is Turing complete, so in theory, anything's possible.

I suggest that you start a new question with some sample data. Put that sample data in dbfiddle.co.uk (or similar). The sample data doesn't have to be the same size as your real data. To start with, think of it as a proof of concept, so that people can understand the core of the problem - for this, actual size is unimportant. Use, say, sequences of 100 and 5-mers as an example? Concentrate on asking about the SQL problem rather than which server to use - you could though state a preference for Open Source? Python works with everything! - Vérace

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top