Question

I use a database to represent a list of files, and some metadata associated to each of them. I need to update regularly this list of files, adding only the new files and deleting files that do not exist anymore (I have not to touch the existing rows in the table as I would lose the metadata associated).

My current queries take only seconds when I have around 10000 files, but take one hour with my current 150000-files table.

After some research on the Internet, I have been the following process :

  1. Populate a table "newfiles" with the results of the scan
  2. DELETE FROM files WHERE path NOT IN (SELECT path FROM newfiles);
  3. INSERT INTO files (SELECT * FROM newfiles WHERE path NOT IN (SELECT path FROM files));

I also have indexes :

CREATE INDEX "files_path" ON "files" ("path");
CREATE INDEX "files_path_like" ON "files" ("path" varchar_pattern_ops);
CREATE INDEX "files_path" ON "newfiles" ("path");
CREATE INDEX "files_path_like" ON "newfiles" ("path" varchar_pattern_ops);

(I mostly use these indexes for searching in the database; my application has a search engine in files.)

Each of these two queries take more than one hour when I have 150000 files. How can I optimize that ?

Thank you.

Was it helpful?

Solution

Try NOT EXISTS instead of NOT IN, as in:

DELETE FROM files WHERE NOT EXISTS
  (SELECT 1 FROM newfiles WHERE newfiles.path=files.path);

Also if newfiles is populated each time from scratch, make sure that you ANALYZE newfiles before issuing any query that uses it, so that the optimizer can work with good statistics.

If that doesn't solve it, try EXPLAIN or EXPLAIN ANALYZE on your queries to have the execution plan and append it to the question.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top