Question

Our project is looking to build a large database, and I am seeking the terms, techniques, and/or technologies to research for our implementation. I doubt our project is anything new, but want to leverage the common practices out there (and not learn from scratch).

A contrived, but very applicable, example of our project is a hierarchy like this:

  • There will exist a few Publishing Houses (order of 10, total)

  • PublishingHouses will have Publishers (order of 100, total)

  • Publishers will have Authors (order of 1000, total)
  • And Authors will have Books (order of 10000, total)

  • There will be Readers, who will have a record/review of Books (order of 5M, total)

A common reporting item for our system will be for a Publisher or Author to log into the system and gather the reviews of Readers. The trick is, they must only be able to see the Readers associated with the Books they control.

Our concern is that each query for a reporting action will have to sift through 5M Reader reviews to know if they match the PublishingHouse, Publisher, Author and/or Book in question.

What are the terms, techniques and/or technologies best suited to solve this problem? Could you explain why that would apply to our problem-set? I have more research to do, but hopefully your experience and answers will point us in the right direction.

Thanks!

(Still need more info, but my current solution is some joining tables for PublishingHouses to Publishers, Publishers to Authors, Authors to Books and use cascading JOINs when finding the Readers to ensure I have the right set. I've heard talks about "Views" that might apply here as well).

Was it helpful?

Solution

Sounds like a classic use-case for relational databases (MySQL, Oracle, etc). I wouldn't worry too much about having 5M rows, if the lookup columns are indexed (i.e. you use some additional disk space but get fast lookups) you will be able to search and join no problem.

If your 'order of' values are in the right ballpark, you are looking at something like 10,000,000,000 books, so this would be your main size issue. At a measly 1000 characters per review you are looking at 10TB of data for these alone. At that scale, it might be worth starting to look at 'Big Data' solutions such as Hadoop/Hbase. However, these are typically not optimised for fast lookup, and are more designed for batch-job analytics, so would need some tweaking for what you want.

Hope that helps!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top