TL,DR
Check about TF*IDF on wiki, then about Cosine similarity.
Long Answer (with examples)
What is TF*IDF
TF*IDF stand for Term Frequency * Inverse Document Frequency.
It's one the way to method to create good tag for a document inside a large group.
The idea behind it is the extract what words used in a single document are descriptive for the same document.
To do this it uses two different statistics, the first one is Term Frequency.
The Term Frequency is an indication of the importance of a word inside a single document or sentence.
For example the sentence
SQL Post. Asking about semantic in SQL with generic document example, SQL generic
getting the single words, and removing the clutter (noise word) we'll have
Word | Count | Frequency
----------------------------
SQL | 3 | 0.231
generic | 2 | 0.154
Post | 1 | 0.077
Asking | 1 | 0.077
semantic | 1 | 0.077
document | 1 | 0.077
example | 1 | 0.077
Term frequency can be calculated in many ways, in the table there are the simplier one, the raw count, and a normalized one, the frequency in the sentece, the second one is a better choice to have the TF*IDF normalized between 0 and 1.
The example show how the word "SQL" characterize the sample sentence, so it is a good candidate for a tag.
The Inverse Document Frequency is an indicator of the importance of a word in all the document corpus: a word that appear in every document is not a helpful for a search, one that appear in few document bring a lot more information.
The Inverse document frequency is calculated as
IDF = ln(Document count / Document with the word)
For example we are using three sentences as our corpus, plus our previous one
SQL Post. Asking about semantic in SQL with generic document example, SQL generic
C# Post. This is a C# answer with an example
SQL Post. Asking a good SQL question with an example
Math Post. This is a Math answer with an example of equation
If we calc the IDF of the best indicator for the previous sentence we have four document in total and four document in which "example" appear
IDF = ln(4/4) -> ln(1) -> 0
The word "example" is in every sentence, so it's not a good search item for our documents. Using the word "question" or "answer" instead we have
IDF = ln(4/1) -> ln(4) -> 1.386 for "question"
IDF = ln(4/2) -> ln(2) -> 0.693 for "answer"
They are both a better choice to search for in our documents.
Combining then we can have a single indicator of how good a word is important for a document and is a good choice in the documents corpus.
Updating the above table with IDF we have
Word | Frequency| IDF | TF*IDF
-------------------------------------
SQL | 0.231 | 0.693 | 0.160
generic | 0.154 | 1.386 | 0.213
Post | 0.077 | 0 | 0
Asking | 0.077 | 0.693 | 0.053
semantic | 0.077 | 1.386 | 0.107
document | 0.077 | 1.386 | 0.107
example | 0.077 | 0 | 0
Using the TF*IDF we see that even if "SQL" was the most prominent word within the sentence it is trumped by "generic" if we consider the whole list of documents, as "SQL" is present in two lines.
TF*IDF can give a list of relevant word for every sentence that are meaningful within the whole library of sentence/documents.
Calculating the list of Word / TF*IDF for every document is the start line, to check the similarity of two or more documents we can use the cosine similarity
What is Cosine Similarity
The cosine similarity can be view as a metric: a way to calculate the distance between two point. In our case the points are our sentences. To measure a distance we need the coordinates of the points, for the sentences the coordinates are their list of word TF*IDF.
The formula of the cosine similarity is
sim = (A*B)/(||A||*||B||)
where A and B are the sentences coordinates.
Expending the formula from its vector form it become
sim = Sum(A[word] * B[word])/(Sqrt(Sum(A[word]^2)) * Sqrt(Sum(B[word]^2)))
or
sim = cross_product/(norm(A) * norm(B))
where
cross_product = Sum(A[word] * B[word])
norm(X) = Sqrt(Sum(X[word]^2))
For example we can use the first and the third sentences from before, the word vector for the third one is
Word | Frequency| IDF | TF*IDF
-------------------------------------
SQL | 0.2 | 0.693 | 0.139
Asking | 0.1 | 0.693 | 0.069
good | 0.1 | 1.386 | 0.139
question | 0.1 | 1.386 | 0.139
For the cross product we can do the math only for the word that appears in both vector, for the other words we'll have 0 as the other value.
cross_product = 0.160*0.053 (SQL) + 0.023*0.069 (Asking) = 0,02587
norm(1) = sqrt(0.160^2 + 0.213^2 + 0.053^2 + 0.107^2 + 0.107^2) = 0.31090
norm(3) = sqrt(0.139^2 + 0.069^2 + 0.139^2 + 0.139^2) = 0.24992
sim = cross_product/(norm(1) * norm(3)) = 0.333
As the TD*IDF are strictly positive, the cosine similarity values will be strictly positive.
As the TD*IDF used is normalized the cosine similarity will have boundaries [0;1] were 0 mean no shared information and 1 mean that the documents are identical, if not for noise word.
A SQLServer Example
SQLServer, with the full-text search, have can be used to do the job, the base to do it is the function sys.dm_fts_parser that, given a string, return a table with the single words, a value to indicate is a word is a noise and other information.
To calculate the TD*IDF the most important thing is to split a document into words, every DB that can to this can do the job, my choice on one is merely based on my own expertise.
Attention sys.dm_fts_parser can be executed only by user that have sysadmin role.
We begin by creating a temp table with test data
SELECT 1 AS Id, N'SQL Post. Asking about semantic in SQL with generic document'
+ N' example, SQL generic' AS txt
INTO #testTable
UNION ALL SELECT 2, N'C# Post. This is a C# answer with an example'
UNION ALL SELECT 3, N'SQL Post. Asking a good SQL question with an example'
UNION ALL SELECT 4, N'Math Post. This is a Math answer with an example of'
+ N' equation'
Then we go on creating the temp table with the sentence-word TD*IDF
With TF AS (
SELECT DISTINCT id, display_term, special_term
, CAST(COUNT(display_term)
OVER (PARTITION BY id, display_term) AS DECIMAL(10, 8))
/ COUNT(occurrence) OVER (PARTITION BY id) TF
FROM #testTable
CROSS APPLY sys.dm_fts_parser('"'+REPLACE(txt,'"','""')+'"', 1033, 0,0)
WHERE TXT IS NOT NULL
AND display_term NOT LIKE 'nn%'
AND special_term <> 'End Of Sentence'
), IDF AS (
SELECT display_term word
, sentences = COUNT(DISTINCT tt.ID)
, sentence_with_word
= COUNT(DISTINCT CASE WHEN tt.txt LIKE '%' + tf.display_term + '%'
THEN tt.id
ELSE NULL
END)
, IDF = LOG(CAST(COUNT(DISTINCT tt.ID) AS DECIMAL (10, 8))
/ COUNT(DISTINCT CASE WHEN tt.txt LIKE '%' + tf.display_term + '%'
THEN tt.id
ELSE NULL
END))
FROM #testTable tt
CROSS JOIN TF
WHERE TF.special_term = 'Exact Match'
group by display_term
)
SELECT tf.Id sentence, word
, TD = TF.TF, IDF.IDF
, TD_IDF = TF.TF * IDF.IDF
INTO #sentence_word_TD_IDF
FROM TF
INNER JOIN IDF ON tf.display_term = IDF.word
Having the TD*IDF of every word in every sentence we can go on with the cosine similarity between the sentence 1 and 3
WITH S1 AS (
SELECT word, TD_IDF
FROM #sentence_word_TD_IDF
WHERE sentence = 1
), S2 AS (
SELECT word, TD_IDF
FROM #sentence_word_TD_IDF
WHERE sentence = 3
), cat AS (
SELECT word = COALESCE(S1.word, S2.word)
, word_S1_TD_IDF = COALESCE(S1.TD_IDF, 0)
, word_S2_TD_IDF = COALESCE(S2.TD_IDF, 0)
FROM S1
FULL JOIN S2 ON S1.word = S2.word
)
SELECT cross_product = SUM(word_S1_TD_IDF * word_S2_TD_IDF)
, norm_1 = SQRT(SUM(word_S1_TD_IDF * word_S1_TD_IDF))
, norm_2 = SQRT(SUM(word_S2_TD_IDF * word_S2_TD_IDF))
, co_sim = SUM(word_S1_TD_IDF * word_S2_TD_IDF)
/ (SQRT(SUM(word_S1_TD_IDF * word_S1_TD_IDF))
* SQRT(SUM(word_S2_TD_IDF * word_S2_TD_IDF)))
FROM cat
The queries have some partial step to be easier to check, for example in the CTE
IDF there are the columns 'sentences' and 'sentence_with_word' instead of only IDF.
Those query can run on table without a fulltext index.
With a fulltext index it's possible to get the stems, to work with the root of the words instead of the inflectional form that are in the senteces.
To have a faster response to a query it's better to create a vector table with the TF*IDF value for every word of every document, and a many to many join table with the cosine similarity for every document to every other document instead, so that only the values for the new document or for the search query have to be calculated.
Further step: Polarize
All of that is not enough, the OP asked for an idea to differ similar sentences with different "emotion", that is good news and bad news should never to related, even if the similarity is high.
To do this we have to trick the cosine similarity, the lever is in the definition
As the TD*IDF are strictly positive, the cosine similarity values will be strictly positive.
It is possible to categorize an article searching for bad keyword, if something is find then the TD*IDF for the whole document will be changed to negative.
The cosine similarity between a good news and a bad new will be calculated with positive (for the good news) and negative (for the bad news) TD*IDF, that means that if there are common word the cross product will be negative, and a negative cross product becomes a negative similarity, as the SQRT(TD_IDF ^ 2)
will be positive regardless of the nature of TD_IDF.
For the cosine similarity of two news with the same "emotion" we'll have TD*IDF with the same sign, that will combine in a positive cross product and a positive similarity.
With this change the boundaries of the cosine similarity will become ]-1;1]
- -1 can never be reach as at least one word need to different to carry the different emotion
- negative number mean similar documents with opposite "emotion"
- 0 mean no correlation between the documents
- positive number mean similar document with the same "emotion"
- 1 mean identical document if not for noise word (and they have to be in the same number)