Domanda

I have several million filenames I need to search through. They look like this:

LG_MARGINCALL_HD2CH_127879834_EN.mov

If someone searches for any of the following, it should match:

  • margin
  • margin call
  • margin call mov
  • margin call hd en
  • margin call hd en mov

What I am currently using is a mysql %LIKE% search. Something like:

SELECT filename FROM path WHERE filename LIKE '%margin%' AND filename LIKE '%mov%'

It is deathly slow (can take up to ten seconds for a search). Note that it does work though.

What would be a better way to do the above search? Either using mysql or another program.

È stato utile?

Soluzione

Your search strategy is, as you noticed, slow. It's slow because

 LIKE '%something%'

has to scan the table to find matches. Leading % signs on LIKE searches are an excellent way to wreck performance.

I don't know how many columns are in your path table. If there are a lot of columns you could do two quick things to improve performance:

  1. get rid of SELECT * and list the names of the columns you want in your resultset.
  2. create a compound index consisting of your filename column followed by the other columns you need to retrieve.

(This won't help if you only have a few columns in your table.)

You can't use straight-out-of-the-software-package FULLTEXT searching for this stuff, because that's designed for language text.

If I had to make this work fast for production, I would do this:

First, create a new table called "searchterm" containing

 filename_id INT   the id number of a row in your path table
 searchterm  VARCHAR(20)  a fragment of a filename.

Second, write a program that reads the filename_id and filename values, and inserts a bunch of different rows for each one into searchterm. For the item you've shown the values should be:

LG_MARGINCALL_HD2CH_127879834_EN.mov   (original)
LG  MARGINCALL  HD2CH  127879834  EN  mov   (split on punctuation)
 HD 2 CH                                    (split on embedded numerics)
 MARGIN CALL                                (split on an app-specific list of words)

So, you'd have a bunch of entries in your searchterm table, all with the same filename_id value and lots of different little chunks of text.

Finally, when searching you could do this.

 SELECT path.id, path.filename, path.whatever,
        COUNT(DISTINCT searchterms.term) AS termcount
   FROM path
   JOIN searchterm ON path.filenanme_id = search.filename_id
  WHERE searchterm.term  IN ('margin','call','hd','en', 'mov')
  GROUP BY path.id, path.filename, path.whatever
  ORDER BY path.filename, COUNT(DISTINCT searchterms.term) DESC

This little query finds all the matching fragments to what you're search for. It returns multiple file names, and it presents them in order of what matches the most terms.

What I'm suggesting is that you create your own application-specific kinda- sorta- full text search system. If your really have several million multimedia files, this is surely worth your effort.

Altri suggerimenti

It seems clear that you need full-text search functionality.

There are multiple solutions out there that can respond to this, one of the best at the moment being Elastic Search.

It has all the capabilities to handle real time full-text search. And it goes largely beyond just this by providing auto-suggestions, autocomplete, etc.

And it's open source.

Stop using like statement instead use match() and use a full text index for your searching column and your table must be a MYISAM one(I don't know whether it is or not)

I suggest 2 things to try for better performance. The 1st one is to use keyword EXPLAIN in front of select. This may gives you some help about the slow performance of the query. But I think it will not do much help. The 2nd thing is to use REGEXP. An example of all these:

EXPLAIN SELECT filename FROM path WHERE filename LIKE REGEXP '^.*MAR{1}.*mov{1}'

but you have to search a little more to optimize the regular expression.

Try to use SPHINX for fulltext search. http://sphinxsearch.com/

This might be faster than using AND:

SELECT filename FROM path WHERE filename LIKE '%margin%call%hd%en%mov%'

But having a "%" in the start of the string will always make it slow.

You should use a full-text search index on the field, and then use something like:

SELECT filename FROM path WHERE MATCH(filename) AGAINST('+margin +call +hd +en +mov' IN BOOLEAN MODE);
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top