If I understand your logic correctly, this query should give you the correct result:
SELECT n1.ngram
FROM
ngrams n1 LEFT JOIN ngrams n2
ON
n2.ngram IN ('stack', 'stack overflow', 'protection')
AND n2.ngram LIKE CONCAT('%', n1.ngram, '%')
AND CHAR_LENGTH(n1.ngram) < CHAR_LENGTH(n2.ngram)
WHERE
n1.ngram IN ('stack', 'stack overflow', 'protection')
AND n2.ngram IS NULL;
Please see fiddle here. But since I expect that your table could have a lot of records, while your list of words is certanly much limited, why not remove the shortest ngrams from this list before executing the actual query? My idea is to reduce the list
('stack', 'stack overflow', 'protection')
to
('stack overflow', 'protection')
and this query should do the trick:
SELECT *
FROM
ngrams
WHERE
ngram IN (
SELECT s1.ngram
FROM (
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack','stack overflow','protection')
) s1 LEFT JOIN (
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack','stack overflow','protection')
) s2
ON s2.ngram LIKE CONCAT('%', s1.ngram, '%')
AND CHAR_LENGTH(s1.ngram) < CHAR_LENGTH(s2.ngram)
WHERE
s2.ngram IS NULL
);
Yes I'm querying the table ngrams
twice before joining the result back to ngrams
again, because we have to make sure that the longest value actually exists in the table, but if you have a proper index on the ngram column the two derived queries that use DISTINCT should be very efficient:
ALTER TABLE ngrams ADD INDEX idx_ngram (ngram);
Fiddle is here.
Edit:
As samuil correctly noted, if you just need to find the shortest ngram and not the whole rows associated to it, then you don't need the outer query, and you can just execute the inner query. With the proper index, two SELECT DISTINCT queries will be very efficient, and even if the JOIN cannot be optimized (n2.ngram LIKE CONCAT('%', n1.ngram, '%')
can't take advantage of an index) it will be executed only on a few already filtered records and should be quite fast.