Given a column containing ngrams in a VARCHAR with utf8mb4_unicode_ci collation:

+---------------------------+
| ngram                     |
+---------------------------+
| stack overflow            |
| stack                     |
| overflow                  |
| stack overflow protection |
| overflow protection       |
| protection                |
+---------------------------+

And a query:

SELECT * FROM ngrams WHERE ngram IN ('stack', 'stack overflow', 'protection', 'overflow')

Given the rows returned by this query, how can I keep only the rows with the longest ngrams from the returned rows?

In this example, I get 3 rows: stack, stack overflow, and protection.

Then, I need to filter rows like this:

  • I filter out stack, because stack overflow exists in the returned rows
  • I keep stack overflow, because no other returned row is a ngram containing stack overflow (there is stack overflow protection in the table, but it's not in the returned rows)
  • I keep protection too
  • I filter out overflow, because stack overflow exists in the returned rows

It must be done in MySQL because of collations (comparisons outside of MySQL wouldn't give the same results than in MySQL). (Unless I'm not aware of some MySQL function allowing to expose the collated version of a string.)


I can think of the following solution: (sql fiddle)

SELECT  ngram
FROM    ngrams n1
WHERE   n1.ngram IN ('stack', 'stack overflow', 'protection')
AND     NOT EXISTS (
    SELECT  1
    FROM    ngrams n2
    WHERE   n2.ngram IN ('stack', 'stack overflow', 'protection')
    AND     LENGTH(n2.ngram) > LENGTH(n1.ngram)
    AND     CONCAT(' ', n2.ngram, ' ') LIKE CONCAT('% ', n1.ngram, ' %')
)

It's inefficient, though, since the sub-query will be executed for every matched ngram.


So I'm searching for

  • either a way to make this query efficient
  • or a way to do this reliably outside of MySQL (taking collations into account)
有帮助吗?

解决方案

If I understand your logic correctly, this query should give you the correct result:

SELECT n1.ngram
FROM
  ngrams n1 LEFT JOIN ngrams n2
  ON
    n2.ngram IN ('stack', 'stack overflow', 'protection')
    AND n2.ngram LIKE CONCAT('%', n1.ngram, '%')
    AND CHAR_LENGTH(n1.ngram) < CHAR_LENGTH(n2.ngram)
WHERE
  n1.ngram IN ('stack', 'stack overflow', 'protection')
  AND n2.ngram IS NULL;

Please see fiddle here. But since I expect that your table could have a lot of records, while your list of words is certanly much limited, why not remove the shortest ngrams from this list before executing the actual query? My idea is to reduce the list

('stack', 'stack overflow', 'protection')

to

('stack overflow', 'protection')

and this query should do the trick:

SELECT *
FROM
  ngrams
WHERE
  ngram IN (
    SELECT s1.ngram
    FROM (
      SELECT DISTINCT ngram
      FROM ngrams
      WHERE ngram IN ('stack','stack overflow','protection')
    ) s1 LEFT JOIN (
      SELECT DISTINCT ngram
      FROM ngrams
      WHERE ngram IN ('stack','stack overflow','protection')
    ) s2
      ON s2.ngram LIKE CONCAT('%', s1.ngram, '%')
         AND CHAR_LENGTH(s1.ngram) < CHAR_LENGTH(s2.ngram)
    WHERE
      s2.ngram IS NULL
  );

Yes I'm querying the table ngrams twice before joining the result back to ngrams again, because we have to make sure that the longest value actually exists in the table, but if you have a proper index on the ngram column the two derived queries that use DISTINCT should be very efficient:

ALTER TABLE ngrams ADD INDEX idx_ngram (ngram);

Fiddle is here.

Edit:

As samuil correctly noted, if you just need to find the shortest ngram and not the whole rows associated to it, then you don't need the outer query, and you can just execute the inner query. With the proper index, two SELECT DISTINCT queries will be very efficient, and even if the JOIN cannot be optimized (n2.ngram LIKE CONCAT('%', n1.ngram, '%') can't take advantage of an index) it will be executed only on a few already filtered records and should be quite fast.

其他提示

After doing this without first looking at the other solutions, I see that it's similar to your existing best solution, but slightly simpler to read and possibly a bit more efficient;

SELECT n1.ngram
FROM ngrams n1
LEFT JOIN ngrams n2
  ON n2.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
 AND n1.ngram <> n2.ngram
 AND INSTR(n2.ngram, n1.ngram) > 0
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
 AND n2.ngram IS NULL;

An SQLfiddle to test with.

Since there is no calculation on the AND n1.ngram <> n2.ngram line, the query should be able to use indexes a bit more efficiently.

You are trying to filter the ngrams in the query itself. It is probably more efficient to do it in two steps. Start with a table with all possible ngrams:

CREATE TABLE original (ngram varchar(100) NOT NULL)
GO

CREATE TABLE refined (ngram varchar(100) NOT NULL PRIMARY KEY)
GO

INSERT INTO original (ngram)
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack', 'stack overflow', 'protection')
GO

INSERT INTO refined (ngram)
SELECT ngram
FROM original

Then delete the ones you do not want. For each ngram, generate all possible substrings. For each substring, delete that entry (if any) from the list. It takes a couple of nested loops, but unless your ngrams contain an extremely large number of words, it should not take much time.

CREATE PROCEDURE refine()
BEGIN
    DECLARE done INT DEFAULT FALSE;
    DECLARE words varchar(100);
    DECLARE posFrom, posTo int;
    DECLARE cur CURSOR FOR SELECT ngram FROM original;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    OPEN cur;

    read_loop: LOOP
        FETCH cur INTO words;
        IF done THEN
            LEAVE read_loop;
        END IF;

        SET posFrom = 1;
        REPEAT
            SET posTo = LOCATE(' ', words, posFrom);
            WHILE posTo > 0 DO
                DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom, posTo - posFrom);
                SET posTo = LOCATE(' ', words, posTo + 1);
            END WHILE;
            IF posFrom > 1 THEN
                DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom);
            END IF;
            SET posFrom = LOCATE(' ', words, posFrom) + 1;
        UNTIL posFrom = 1 END REPEAT;
    END LOOP;

    CLOSE cur;
END

What's left, is a table with only the longest ngrams:

CALL refine;

SELECT ngram FROM refined;

SQL Fiddle: http://sqlfiddle.com/#!2/029dc/1/1


EDIT: I added an index on table refined; now it should run in O(n) time.

I think you can use self inner join on LIKE %original string% and choose only those rows that have ngram length equal to the longest joined ngram length.

SELECT n1.* FROM ngrams n1
  INNER JOIN ngrams n2 ON
    n2.ngram LIKE CONCAT('%', `n1`.`ngram`, '%')
    AND n2.ngram IN ('stack overflow', 'stack')
  WHERE n1.ngram IN ('stack overflow', 'stack')
  GROUP BY n1.ngram
  HAVING MAX(CHAR_LENGTH(n2.ngram)) = CHAR_LENGTH(n1.ngram);

Downside of this solution is that you need to provide your string list twice.


It turns out that you don't need to provide list twice:

SELECT n1.*
  FROM ngrams n1
  INNER JOIN ngrams n2 ON
    n2.ngram LIKE CONCAT('%', `n1`.`ngram`, '%')
    AND n2.ngram IN ('stack overflow', 'stack')
  GROUP BY n1.ngram
  HAVING MAX(CHAR_LENGTH(n2.ngram)) = CHAR_LENGTH(n1.ngram);

This slight modification to your query:

SELECT  ngram
FROM    ngrams n1
WHERE   n1.ngram IN ('stack', 'stack overflow', 'protection') AND
        NOT EXISTS (SELECT  1
                    FROM    ngrams n2
                    WHERE   n2.ngram IN ('stack', 'stack overflow', 'protection') AND
                            n2.ngram <> n1.ngram AND
                            n2.ngram LIKE CONCAT('% ', n1.ngram, ' %')
                   );

Should be pretty optimally fast with an index on ngrams(ngram). Note that this simplifies the like condition. I see no reason why you should be worried about word boundaries. Wouldn't "stacks" be a longer version of "stack"? (Although the items referred to by n-grams can be words, I associate them with letters unless otherwise noted.)

With the index, this should be equivalent in performance to other solutions using join.

If I had to do this zillions of times and the ngram table were not too big, I would preprocess it to get all pairs of "generalizations" -- ngram_pairs. This changes the above to

SELECT  ngram
FROM    ngrams n1
WHERE   n1.ngram IN ('stack', 'stack overflow', 'protection') AND
        NOT EXISTS (SELECT  1
                    FROM    ngram_pairs np
                    WHERE   np.ngram1 = n1.ngram and
                            np.ngram2 in ('stack', 'stack overflow', 'protection') 
                   )

This should perform much better than the like with an index on ngram_pairs(ngram1, ngram2). The following is the code for generating ngram_pairs:

create table ngram_pairs as
    select n1.ngram as ngram1, n2.ngram as ngram2
    from ngrams n1 join
         ngrams n2
         on length(n1.ngram) < length(n2.ngram) and
            n2.ngram like concat('%', n1.ngram, '%');

create index ngram_pairs_ngram1_ngram2 on ngram_pairs(ngram1, ngram2);

Try this query using user variable

select 
  ngram
from 
  (select 
    ngram, 
    @t:=if(@prev=rank, @t+1, 1) as num,
    @prev:=rank
  from 
    (select 
      ngram,
      @rank:=if(@prev like concat(ngram,'%'), @rank, @rank+1) as rank,
      CHAR_LENGTH(ngram) as size,
      @prev:=ngram
    from 
      tbl 
    join 
      (select 
         @prev:='', 
         @rank:=1) t 
    where 
       ngram in ('stack overflow', 'stack', 'protection')
    order by 
       rank, size desc
   )t
  join 
    (select 
       @t:=0, 
       @prev:=0) t1
    ) t 
  where 
    num =1

Fiddle

|          NGRAM |
|----------------|
| stack overflow |
|     protection |

The following query only scans the data once and provides the correct results (fiddle):

SELECT my_ngrams.ngram
  FROM (SELECT CASE WHEN @v LIKE CONCAT('%',n1.ngram,'%') THEN 1 ELSE 0 END AS ngram_match
             , @v:=concat(@v,',',n1.ngram) AS ngram_concat
             , n1.ngram
          FROM    ngrams n1, (SELECT @v := '') r
         WHERE   n1.ngram IN ('stack', 'stack overflow', 'overflow', 'protection', 'overflow protection')
      ORDER BY length(n1.ngram) DESC) my_ngrams
 WHERE my_ngrams.ngram_match <> 1
;

However, it relies on the behavior of user-defined variables in MySQL (http://dev.mysql.com/doc/refman/5.5/en/user-variables.html) and should be used with some caution as a result.

The "order by" is important to the solution as that impacts how the user-defined variable is evaluated on a row-by-row basis which impacts which rows get matched by the case and later filtered.

It also concatenates all results together to search through for ngram matches before filtering so you should be aware that you could end up with a concatenated string that is wider than the maximum allowed by MySQL (http://dev.mysql.com/doc/refman/5.5/en/char.html).

This should be very efficient even for large tables as long as the column is indexed properly.

Here is an alternative using a LEFT JOIN.

The table is self-joined on the condition that no ngram exists that is contained within another ngram and that it is not equal to the ngram in the self-joined table. Sub-queries have been avoided, keeping performance in mind.

EDIT:

Added filter conditions.

SELECT n1.ngram
FROM ngrams n1
LEFT JOIN 
(
  SELECT ngram
  FROM ngrams
  WHERE ngram IN ('stack', 'stack overflow', 'protection')) n2
ON n2.ngram like Concat('%', n1.ngram, '%') and n1.ngram <> n2.ngram
WHERE n2.ngram IS NULL
AND n1.ngram IN ('stack', 'stack overflow', 'protection');

If you are checking to see if only the start of the ngram is contained in another ngram, you can replace the JOIN condition with ON n2.ngram like Concat(n1.ngram, '%') and n1.ngram <> n2.ngram.

I added more values in the SQL Fiddle:

  1. 'xyz' (which is not contained to any other ngram)
  2. 'stack overflow exception' (which is another parent of 'stack overflow')
  3. 'stack overflow exception handling' (which is the parent of 'stack overflow exception')

SQL Fiddle demo

Reference:

JOIN syntax on MySQL Reference Manual

SELECT * FROM   ngrams a WHERE  a.n NOT IN (SELECT DISTINCT a.n 
                 FROM   ngrams b
                 WHERE b.n != a.n 
                    AND b.n LIKE CONCAT('%', a.n, '%'));
SELECT  a.ngram FROM ngram a  CROSS JOIN (SELECT ngram AS ngram1 FROM ngram) b 
ON b.ngram1 LIKE CONCAT('%', a.ngram, '%') 
WHERE length(a.ngram) <= length(b.ngram1) 
GROUP BY a.ngram HAVING COUNT(a.ngram) = 1 ORDER BY LENGTH(b.ngram1) DESC

Try this one: Fiddle

SELECT * 
FROM   tab 
WHERE  ngram NOT IN (SELECT DISTINCT b.ngram 
                     FROM   tab a, 
                            tab b 
                     WHERE  a.ngram != b.ngram 
                            AND a.ngram LIKE Concat('%', b.ngram, '%'));

If you want to include only those on the list which exists in the table then try this query:-

SELECT b.ngram ab 
FROM   (SELECT * 
        FROM   tab 
        WHERE  ngram IN ( 'stack', 'stack overflow', 'protection' )) a, 
       (SELECT * 
        FROM   tab 
        WHERE  ngram IN ( 'stack', 'stack overflow', 'protection' )) b 
WHERE  a.ngram LIKE Concat('%', b.ngram, '%') 
GROUP  BY b.ngram 
HAVING Count(*) = 1 

Demo2

Try

 ORDER BY LENGTH(ngram) DESC and use LIMIT 1

EDIT:

try that :

  SELECT n1.ngram
  FROM ngrams n1 
  INNER JOIN ngrams n2
  ON LENGTH(n2.ngram) < LENGTH(n1.ngram)
  WHERE   n2.ngram IN ('stack', 'stack overflow', 'protection')
  GROUP BY n1.ngram
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top