MongoDB database schema designing for big data

https://dba.stackexchange.com/questions/196434

11-10-2020
|

Question

I have never designed a database schema in NoSQL, nor in RDBMS before but for my recent big data project I need to store and locate large amount of data in MongoDB(3.6). I have done it myself, but the querying was hard and speed was slow and I need some advice. Please help me out.

The first collection is 'news'.

It has 'oid', 'aid', 'date', 'section', 'rank', 'title'.

'oid', 'aid' is their unique ID and I need to be able to find 'date', 'section', 'rank', 'title' by querying them.

My current schema design is:

{
  "1234567890123": {    // oid + aid
    "date": "20180126",
    "section": "100",
    "rank": "1",
    "title": "Linus: Intel patch is a piece of shit ",
    "oid": "123",
    "aid": "4567890123"
  },
  "1234567890124": {   // oid + aid
    "date": "20180126",
    "section": "101",
    "rank": "2",
    "title": "Intel sues Torvalds for claiming fact",
    "oid": "123",
    "aid": "4567890124"
  }
}

About 180 of these individual news info will be stored in a day.

The second collection is 'comments'. Obviously, it has comments for items in 'news'.

Each comment object consists of comments specific to items('oid', 'aid') in the 'news' section.

The included data will be 'oid', 'aid', 'date', 'commentCount', 'deletedCommentsCount' and it has an array of comments 'uid', 'id', 'time', 'upvotes', 'downvotes', 'contents'

My current schema design is:

{
  "date": "20180126",   //Same with one in 'news'
  "oid": "123",
  "aid": "4567890123",
  "commentCount": "7",
  "deletedCommentsCount": "0",
  "reply": {
    "12345678901230": {         // oid + aid + commentid
      "id": "john_skeet",
      "time": "2018-01-26 07:32:45",
      "upvotes": "2",
      "downvotes": "0",
      "contents": "So true!"
    }        
     "12345678901231": {         // oid + aid + commentid
      "id": "sundar_pichai",
      "time": "2018-01-26 07:35:46",
      "upvotes": "5",
      "downvotes": "0",
      "contents": "@john_skeet Get back to work!"
    }
  }
}

In a day for the 180 news * 1160 comments(approx) so about 200,000 comments will be stored. (65MB / day)

I need to:

query all comments in specific 'oid'
query all comments in specific 'oid' and 'aid'
query all comments for specific word in comment 'contents'
query all comments for specific id in comment 'id'

.... so on

Can you tell me the proper way to design a database containing these data? And if you can, the query statements too. ANY comments or answers will be appreciated. Thanks.

Solution

Arthur!

I agree with you the most convenient data model here is to work on two collections, but I can see you have 1.5 collections.

Let me explain. Your comments collection stores almost all your data, except section, rank and title fields from news collection. You are embedding all the comments in an array in the comments collection, and this is inefficient due to the large amount of comments you have.

The schema I recommend you for the comments collection is simply to store one document per comment, referencing it to its news, so:

{
  "news_date": "20180126",
  "news_oid": "123",
  "news_aid": "4567890123",
  "commentCount": "7",
  "deletedCommentsCount": "0",
  "comment_id": "0",
  "id": "john_skeet",
  "time": "2018-01-26 07:32:45",
  "upvotes": "2",
  "downvotes": "0",
  "contents": "So true!"
}

In this way your required queries would be:

db.comments.find({ 'news_oid' : '123' })
db.comments.find({ 'news_oid' : '123', 'news_aid' : '4567890123' })
db.comments.find({ 'comment_id' : '0' })

The indexes you'd create would be:

db.comments.createIndex({ 'news_oid' : 1 }, { background : true })
db.comments.createIndex({ 'news_oid' : 1, 'news_aid' : 1 }, { background : true })
db.comments.createIndex({ 'comment_id' : 1 }, { background : true })

For searching I recommend you to use the full text search MongoDB 's capabilities.

Please, tell me anything you need!

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange