Frage

I am facing a challenge to write a bit smarter/advanced "related content" algorithm and don't know where to start so I decided to post a question if someone would point me in the right direction.

Our database contains a lot of articles and until now we queried related articles using keywords/tags but found out that usually we don't get very relevant results because most of the keywords are too general (e.g. government, taxes, ...).

The big idea is that we would somehow query whole contents and try to match the content that is most relevant to the topic that is currently displayed. But at the same time the algorithm should also "know" if the matched content has somehow a negative meaning.

For example let's look at 3 imaginary articles:

  • an article that says how you can fly cheaper if you book tickets over the internet
  • an article that says that prices for flight tickets are dropping because of ....
  • an article that says that more than 300 were killed in an airplane crash

In this case all three articles (their whole content) are somehow related to flying and planes but the third one has a negative meaning. So first two should be related to each other but the third one should not be related to first two in any way.

So my question is - how can something like this be done programmatically on a database with more than million articles? I understand that this can not be done just by SQL query - you would somehow need a dictionary or something, but I don't know where to start exploring this topic. So please, can someone point me in the right direction?

War es hilfreich?

Lösung

TL,DR
Check about TF*IDF on wiki, then about Cosine similarity.

Long Answer (with examples)

What is TF*IDF

TF*IDF stand for Term Frequency * Inverse Document Frequency.
It's one the way to method to create good tag for a document inside a large group.
The idea behind it is the extract what words used in a single document are descriptive for the same document. To do this it uses two different statistics, the first one is Term Frequency.

The Term Frequency is an indication of the importance of a word inside a single document or sentence.
For example the sentence

SQL Post. Asking about semantic in SQL with generic document example, SQL generic

getting the single words, and removing the clutter (noise word) we'll have

Word     | Count | Frequency
----------------------------
SQL      |   3   |  0.231
generic  |   2   |  0.154
Post     |   1   |  0.077
Asking   |   1   |  0.077
semantic |   1   |  0.077
document |   1   |  0.077
example  |   1   |  0.077

Term frequency can be calculated in many ways, in the table there are the simplier one, the raw count, and a normalized one, the frequency in the sentece, the second one is a better choice to have the TF*IDF normalized between 0 and 1.
The example show how the word "SQL" characterize the sample sentence, so it is a good candidate for a tag.

The Inverse Document Frequency is an indicator of the importance of a word in all the document corpus: a word that appear in every document is not a helpful for a search, one that appear in few document bring a lot more information.
The Inverse document frequency is calculated as

IDF = ln(Document count / Document with the word)

For example we are using three sentences as our corpus, plus our previous one

SQL Post. Asking about semantic in SQL with generic document example, SQL generic
C# Post. This is a C# answer with an example
SQL Post. Asking a good SQL question with an example
Math Post. This is a Math answer with an example of equation

If we calc the IDF of the best indicator for the previous sentence we have four document in total and four document in which "example" appear

IDF = ln(4/4) -> ln(1) -> 0

The word "example" is in every sentence, so it's not a good search item for our documents. Using the word "question" or "answer" instead we have

IDF = ln(4/1) -> ln(4) -> 1.386 for "question"
IDF = ln(4/2) -> ln(2) -> 0.693 for "answer"

They are both a better choice to search for in our documents.

Combining then we can have a single indicator of how good a word is important for a document and is a good choice in the documents corpus.
Updating the above table with IDF we have

Word     | Frequency|  IDF  | TF*IDF
-------------------------------------
SQL      |  0.231   | 0.693 | 0.160
generic  |  0.154   | 1.386 | 0.213
Post     |  0.077   | 0     | 0
Asking   |  0.077   | 0.693 | 0.053
semantic |  0.077   | 1.386 | 0.107
document |  0.077   | 1.386 | 0.107
example  |  0.077   | 0     | 0

Using the TF*IDF we see that even if "SQL" was the most prominent word within the sentence it is trumped by "generic" if we consider the whole list of documents, as "SQL" is present in two lines.

TF*IDF can give a list of relevant word for every sentence that are meaningful within the whole library of sentence/documents.

Calculating the list of Word / TF*IDF for every document is the start line, to check the similarity of two or more documents we can use the cosine similarity

What is Cosine Similarity

The cosine similarity can be view as a metric: a way to calculate the distance between two point. In our case the points are our sentences. To measure a distance we need the coordinates of the points, for the sentences the coordinates are their list of word TF*IDF.

The formula of the cosine similarity is

sim = (A*B)/(||A||*||B||)

where A and B are the sentences coordinates.
Expending the formula from its vector form it become

sim = Sum(A[word] * B[word])/(Sqrt(Sum(A[word]^2)) * Sqrt(Sum(B[word]^2)))

or

sim = cross_product/(norm(A) * norm(B))

where

cross_product = Sum(A[word] * B[word])
norm(X) = Sqrt(Sum(X[word]^2))

For example we can use the first and the third sentences from before, the word vector for the third one is

Word     | Frequency|  IDF  | TF*IDF
-------------------------------------
SQL      |  0.2     | 0.693 | 0.139
Asking   |  0.1     | 0.693 | 0.069
good     |  0.1     | 1.386 | 0.139
question |  0.1     | 1.386 | 0.139

For the cross product we can do the math only for the word that appears in both vector, for the other words we'll have 0 as the other value.

cross_product = 0.160*0.053 (SQL) + 0.023*0.069 (Asking) = 0,02587
norm(1) = sqrt(0.160^2 + 0.213^2 + 0.053^2 + 0.107^2 + 0.107^2) = 0.31090
norm(3) = sqrt(0.139^2 + 0.069^2 + 0.139^2 + 0.139^2) = 0.24992
sim = cross_product/(norm(1) * norm(3)) = 0.333

As the TD*IDF are strictly positive, the cosine similarity values will be strictly positive.

As the TD*IDF used is normalized the cosine similarity will have boundaries [0;1] were 0 mean no shared information and 1 mean that the documents are identical, if not for noise word.

A SQLServer Example

SQLServer, with the full-text search, have can be used to do the job, the base to do it is the function sys.dm_fts_parser that, given a string, return a table with the single words, a value to indicate is a word is a noise and other information.
To calculate the TD*IDF the most important thing is to split a document into words, every DB that can to this can do the job, my choice on one is merely based on my own expertise.

Attention sys.dm_fts_parser can be executed only by user that have sysadmin role.

We begin by creating a temp table with test data

SELECT 1 AS Id, N'SQL Post. Asking about semantic in SQL with generic document'
              + N' example, SQL generic' AS txt 
INTO #testTable
UNION ALL SELECT 2, N'C# Post. This is a C# answer with an example' 
UNION ALL SELECT 3, N'SQL Post. Asking a good SQL question with an example' 
UNION ALL SELECT 4, N'Math Post. This is a Math answer with an example of'
                  + N' equation'

Then we go on creating the temp table with the sentence-word TD*IDF

With TF AS (
SELECT DISTINCT id, display_term, special_term
     , CAST(COUNT(display_term) 
            OVER (PARTITION BY id, display_term) AS DECIMAL(10, 8))
     / COUNT(occurrence) OVER (PARTITION BY id) TF
FROM   #testTable
       CROSS APPLY sys.dm_fts_parser('"'+REPLACE(txt,'"','""')+'"', 1033, 0,0)
WHERE  TXT IS NOT NULL
  AND  display_term NOT LIKE 'nn%'
  AND  special_term <> 'End Of Sentence'
), IDF AS (
SELECT display_term word
     , sentences = COUNT(DISTINCT tt.ID)
     , sentence_with_word 
     = COUNT(DISTINCT CASE WHEN tt.txt LIKE '%' + tf.display_term + '%' 
                           THEN tt.id 
                           ELSE NULL 
                      END)
   , IDF = LOG(CAST(COUNT(DISTINCT tt.ID) AS DECIMAL (10, 8)) 
             / COUNT(DISTINCT CASE WHEN tt.txt LIKE '%' + tf.display_term + '%' 
                                   THEN tt.id 
                                   ELSE NULL 
                              END))
FROM   #testTable tt
       CROSS JOIN TF
WHERE  TF.special_term = 'Exact Match'
group by display_term
)
SELECT tf.Id sentence, word
     , TD = TF.TF, IDF.IDF
     , TD_IDF = TF.TF * IDF.IDF
INTO   #sentence_word_TD_IDF
FROM   TF
       INNER JOIN IDF ON tf.display_term = IDF.word

Having the TD*IDF of every word in every sentence we can go on with the cosine similarity between the sentence 1 and 3

WITH S1 AS (
SELECT word, TD_IDF 
FROM   #sentence_word_TD_IDF 
WHERE  sentence = 1
), S2 AS (
SELECT word, TD_IDF 
FROM   #sentence_word_TD_IDF 
WHERE  sentence = 3
), cat AS (
SELECT word = COALESCE(S1.word, S2.word)
     , word_S1_TD_IDF = COALESCE(S1.TD_IDF, 0)
     , word_S2_TD_IDF = COALESCE(S2.TD_IDF, 0)
FROM   S1
       FULL JOIN S2 ON S1.word = S2.word
)
SELECT cross_product = SUM(word_S1_TD_IDF * word_S2_TD_IDF)
     , norm_1 = SQRT(SUM(word_S1_TD_IDF * word_S1_TD_IDF))
     , norm_2 = SQRT(SUM(word_S2_TD_IDF * word_S2_TD_IDF))
     , co_sim = SUM(word_S1_TD_IDF * word_S2_TD_IDF)
              / (SQRT(SUM(word_S1_TD_IDF * word_S1_TD_IDF)) 
               * SQRT(SUM(word_S2_TD_IDF * word_S2_TD_IDF)))
FROM   cat

The queries have some partial step to be easier to check, for example in the CTE IDF there are the columns 'sentences' and 'sentence_with_word' instead of only IDF.

Those query can run on table without a fulltext index.
With a fulltext index it's possible to get the stems, to work with the root of the words instead of the inflectional form that are in the senteces.

To have a faster response to a query it's better to create a vector table with the TF*IDF value for every word of every document, and a many to many join table with the cosine similarity for every document to every other document instead, so that only the values for the new document or for the search query have to be calculated.

Further step: Polarize

All of that is not enough, the OP asked for an idea to differ similar sentences with different "emotion", that is good news and bad news should never to related, even if the similarity is high.
To do this we have to trick the cosine similarity, the lever is in the definition

As the TD*IDF are strictly positive, the cosine similarity values will be strictly positive.

It is possible to categorize an article searching for bad keyword, if something is find then the TD*IDF for the whole document will be changed to negative.
The cosine similarity between a good news and a bad new will be calculated with positive (for the good news) and negative (for the bad news) TD*IDF, that means that if there are common word the cross product will be negative, and a negative cross product becomes a negative similarity, as the SQRT(TD_IDF ^ 2) will be positive regardless of the nature of TD_IDF.
For the cosine similarity of two news with the same "emotion" we'll have TD*IDF with the same sign, that will combine in a positive cross product and a positive similarity.

With this change the boundaries of the cosine similarity will become ]-1;1]

  • -1 can never be reach as at least one word need to different to carry the different emotion
  • negative number mean similar documents with opposite "emotion"
  • 0 mean no correlation between the documents
  • positive number mean similar document with the same "emotion"
  • 1 mean identical document if not for noise word (and they have to be in the same number)
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top