Question

I have an application that performs association rule mining against a large text corpus. Itemsets that are generated have the following structure:

Item1, Item2, Item3, Item4, Frequency

In this case, all items are words (string elements) while the frequency is an integer value. So far, we I have used MySql to store the itemsets. However, the database gets insanely large and it was suggested to me to use NoSql databases and focus on Redis since it has an excellent support for various datatypes (note that I do not have much experience with Redis or any other NoSql databases).

My questions are:

  1. What is the most appropriate data structure to use to store these itemsets?
  2. How would I query my database to retrieve itemsets that start with a particular word?

Edit: Example data would be (items separated with | , the last item is frequency):

In - this - case - 3
Other - items - 2
This - is - an - 5
Lorem - ipsum - 3
In - other - terms - 2

A query would be:
Find all itemsets and their frequency where the first item is the word 'In'. The query should return:

In - this - case - 3
In - other - terms - 2
Was it helpful?

Solution

To achieve a behavior similar to like condition following could be done:

Solution 1:

Example Dataset

In - this - case - 3
Other - items - 2
This - is - an - 5
Lorem - ipsum - 3
In - other - terms - 2

Ans 1: List or set could be used as data structure depending on usage. In your case duplicate keys exist ( "In") hence using list.

Ans 2: this how list can be used:

Do keep in mind Redis list behaves like linked list.

$ redis-cli lpush In.list "In - this - case - 3"
OK

$ redis-cli lpush Other.list "Other - items - 2"
OK

$ redis-cli lpush This.list "This - is - an - 5"
OK

$ redis-cli lpush Lorem.list "Lorem - ipsum - 3"
OK

$ redis-cli lpush In.list "In - other - terms - 2"
OK

$redis-cli lrange In.list 0 -1
1) "In - other - terms - 2"
2) "In - this - case - 3"

Solution 2:

Other solution would be using list again:

We will have four main lists which will behave like columns in database and separate lists for words, these lists will store the index at which they are present in primary key list.

Sample data can depicted as:

Index Column1 Column2 Column3 Column4

 1    In         this    case     3
 2    Other      items   " "      2
 3    This       is      an       5
 4    Lorem      ipsum   " "      3
 5    In         other   terms    2

This depiction is valid if max 4 values are returned. We can have a dynamic columns also. For dynamic columns 1st column would be key and 2nd key would numeric part and remaining columns will have strings.

Index Column1 Column2 Column3 Column4 Column5

 1    In         3      this    case     " "
 2    Other      2      items   " "      " "
 3    This       5      an      " "      " "
 4    Lorem      3      ipsum   " "      " "
 5    In         2      other   terms    " "
 6    Hello      4      world   !         !

Continuing with fixed 4 columns solution:

   //first row
   $ redis-cli lpush column1 "In"
   1

   $ redis-cli lpush In.list 1
   1

   $ redis-cli lpush column2  "this"
   1
   $ redis-cli lpush column3  "case"
   1
   $ redis-cli lpush column4  3
   1

   //second row
   $ redis-cli lpush column1  "Other"
   2

   $ redis-cli lpush Other.list 2
   1

   $ redis-cli lpush column2  "items"
   2
   $ redis-cli lpush column3  " "
   2
   $ redis-cli lpush column4  2
   2

   //on same lines add 3rd, 4th row and then 5th row
   $ redis-cli lpush column1  "In"
   5

   $ redis-cli lpush In.list 5
   2

   $ redis-cli lpush column2  "items"
   5
   $ redis-cli lpush column3  " "
   5
   $ redis-cli lpush column4  2
   5

   To fetch data you can do something like :
   $ redis-cli lrange In.list 0 -1
   1) 5
   2) 1

   Using these to values as index query columns as
   $redis-cli lindex column1 5
   "In"

   $redis-cli lindex column2 5
   "other"
   $redis-cli lindex column3 5
   "terms"
   $redis-cli lindex column4 5
   2

But with second solution we introduce the cost of insert each string in separate list, but you could use bulk operations to perform them. Also we save blanks to have well defined row type implementation.

Solution 3:

Create structures for each row and serialize them store them in specific key list.

row 1 "In,this,case,3"

 lpush In.list StructureRepresent1stRow

This solution could be opted if you want to use structures and you complex values to be stored.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top