Domanda

I have the following table storing data about images:

images
 - id (int)
 - sample_1_1 (int)
 - sample_1_2 (int)
 - sample_1_3 (int)
 - sample_2_1 (int)
 - sample_2_2 (int)
 - sample_2_3 (int)
 - ... # Up until sample_25_3

The task is to calcuate the distance between the collected data. Currently, I am using a 75-dimensional (that's right, 3 * 25 = 75) euclid distance calculation programmed as stored procedures into the database:

CREATE DEFINER=`root`@`localhost`
FUNCTION `distanceBetween`(compareId INT, toId INT) RETURNS double
    READS SQL DATA
    DETERMINISTIC
BEGIN
 DECLARE distance DOUBLE;
SELECT euclidDistance(
 i1.sample_1_1, i1.sample_1_2, i1.sample_1_3,
 i2.sample_1_1, i2.sample_1_2, i2.sample_1_3,
 ...
 ) INTO distance
FROM images i1, (SELECT * FROM images WHERE id = toId) i2
WHERE i1.id = compareId;
RETURN distance;
END

With another subroutine calculating the actual distance between 2 75-dim. vectors:

CREATE DEFINER=`root`@`localhost`
    FUNCTION `euclidDistance`(
 img1_sample1_1 INT, img1_sample1_2 INT, img1_sample1_3 INT,
 img2_sample1_1 INT, img2_sample1_2 INT, img2_sample1_3 INT,
 ...
 ) RETURNS double
RETURN SQRT(
   quadDiff(img1_sample1_1, img2_sample1_1)
 + quadDiff(img1_sample1_2, img2_sample1_2)
 + quadDiff(img1_sample1_3, img2_sample1_3)
 + ...
)

And another subroutine to calculate the squared difference between two values:

CREATE DEFINER=`root`@`localhost`
FUNCTION `quadDiff`(var1 INT, var2 INT) RETURNS int(11)
RETURN POW(var1 - var2, 2)

The functions itself are perfectly fine and return deterministic results that are both mathematically and logically correct.

The problem comes when I want to get the "closest" images to a given image - meaning the images that have the least distance to any given image. To do so, I use another procedure:

CREATE DEFINER=`root`@`localhost`
PROCEDURE `getSimilarImages`(imageId INT, `limit` INT)
BEGIN
    SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance
    FROM images i1, (SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2
    WHERE i1.id = imageId
    ORDER BY distance
    LIMIT 10;
END

The database currently features around 30.000 images. This means that a CALL getSimilarImages(123, 10); takes about 12 seconds to finish. This is way too long for any application, be it web or application based.

So, I want to speed things up. What are my options? Do you see any potential to optimize the process of image comparing or calculating the distances?

I thought about caching the result of the procedure, but I have no clue how to do so. I could also compare every image to every other image as soon as new images are added, but that will adding images a very long process, which is also not acceptable.

I can provide more information on the system setup if it helps, but I appreciate any pointers you can provide. The current situation is not good and I really need to do something, because the image database will only grow with every hour the system is up.

È stato utile?

Soluzione

As you've discovered, your ORDER BY distance(a,b) operation is computing a WHOLE lot of those 75-dimensional distances, and unsurprisingly, it's taking a long time. It has to compute the whole lot so it can do the ORDER operation. Ouch.

Here's an observation about Euclidean distance that may help you: the bounding box is a useful approximation. When you're using GetSimilarImages, can you include only images that are within a particular threshold distance of the imageId you're using?

Let's say you only care about images within rad distance of your imageId. Then you could rework your GetSimilarImages query like this.

PROCEDURE `getSimilarImages`(imageId INT, `limit` INT, rad INT)
BEGIN
    SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance
    FROM images i1, 
    (SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2
    WHERE i1.id = imageId
      AND i1.img_sample_1_1 BETWEEN i2.img_sample_1_1 - rad
                                AND i2.img_sample_1_1 + rad
      AND i1.img_sample_1_2 BETWEEN i2.img_sample_1_2 - rad
                                AND i2.img_sample_1_2 + rad
      AND i1.img_sample_1_3 BETWEEN i2.img_sample_1_3 - rad
                                AND i2.img_sample_1_3 + rad
    ORDER BY distance
    LIMIT 10;
END

In this example code I have arbitrarily chosen the first three of your 75 dimensions to use for the bounding-box inclusion code (the three BETWEEN clauses). For this optimization to work, you'll need to create indexes on at least some of the dimensions used for bounding-box inclusion.

There's nothing special about choosing three dimensions, or about choosing any particular dimensions. If you know from your data that certain of your dimensions discriminate better between your images, those are the ones to choose. You could choose as many dimensions as you need. But, of course there's indexing overhead.

Altri suggerimenti

Code a UDF C function or call a native C function that calls a a GPU function.

Optimizations tips for this case:

Add index on columns id and duplicateImageId preferable a clustured index.

Try to avoid multiple selects on huge table images.

You can increase performance slightly by reducing the number of function calls as you are calling function inside function. All these function calls need to be maintained in memory stack which consumes hardware resources.

Benchmark for performance after every change you make in code for optimization.

CREATE PROCEDURE getSimilarImages(imageId INT unsigned, limit INT unsigned)
BEGIN
    SELECT i2.id, i2.filename, euclidDistance(
 i1.sample_1_1, i1.sample_1_2, i1.sample_1_3,
 i2.sample_1_1, i2.sample_1_2, i2.sample_1_3,
 ...
 ) AS distance
    FROM images i1, (SELECT id, filename 
                     FROM images 
                     WHERE id <> imageId AND 
                           duplicateImageId IS NULL
                    ) i2
    WHERE i1.id = imageId
    ORDER BY distance
    LIMIT 10;
END;

CREATE FUNCTION euclidDistance(
 img1_sample1_1 INT, img1_sample1_2 INT, img1_sample1_3 INT,
 img2_sample1_1 INT, img2_sample1_2 INT, img2_sample1_3 INT,
 ...
 ) RETURNS double
RETURN SQRT(
   POW(img1_sample1_1 - img2_sample1_1, 2)
 + POW(img1_sample1_2 - img2_sample1_2, 2)
 + POW(img1_sample1_3 - img2_sample1_3, 2)
 + ...
);

Hope this helps.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top