I am working to create a very big inverted index terms. What method would you suggest?

First

termId - > docId
  a        doc2[locations],doc5[locations],doc12[locations] 
  b        doc5[locations],doc7[locations],doc4[locations] 

Second

termId - > docId
  a        doc2[locations]
  a        doc5[locations]
  a        doc12[locations]
  b        doc5[locations]
  b        doc7[locations] 
  b        doc4[locations]  

p.s Lucene is not an option

有帮助吗?

解决方案

The right table design depends on how you plan on using the data. If you plan on using strings like "doc2[locations],doc5[locations],doc12[locations]" as is -- without any further postprocessing, then your First design is fine.

But if -- as your question tacitly suggests -- that you may at times want to regard doc2[locations], doc5[locations], etc. as separate entities, then you should definitely use your Second design.

Here are some use cases which show why the Second design is better:

  • If you use First and ask for all docs with termID = a then you get back a string like doc2[locations],doc5[locations],doc12[locations] which you then have to split.

    If you use Second, you get each doc as a separate row. No splitting!

    The Second structure is more convenient.

  • Or, suppose at some point doc5[locations] changes and you need to update your table. If you use the First design, you'd have to use some relatively complicated MySQL string function to find and replace the substring in all rows that contain it. (Note that MySQL does not come with regex substitution built in.)

    If you use the Second design, updating is easy:

    UPDATE table SET docId = "newdoc5[locations]" where docId = "doc5[locations]"
    
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top