PostgreSQL GiST for compressed entry type

https://dba.stackexchange.com/questions/150595

03-10-2020
|

Domanda

I am using PostgreSQL 9.5 and I'm trying to understand how to implement a GiST index where I have a representation that is a lossily compressed version of the indexed type. For instance, say I have images stored in BYTEA type, and for the index I store the colour ranges (rmin, rmax, gmin, gmax, bmin, bmax), and I want to compare images based upon colour similarity - e.g. with a === operator that returns true when the colours ranges are exactly the same, allowing me to facilitate queries like:

SELECT COUNT(*)
FROM icons, avatars
WHERE icon.image === avatar.image AND avatar.id = 123;

where icons and avatars are both tables with an image field of type BYTEA.

Having looked through the implementation documentation is looks like this should be possible. Using the above example situation, I think I could do as follows:

the union method would generate the bounding range of all entries
picksplit and penalty would just try to minimise the ranges, similar to a R-Tree
compress would take the BYTEA data and calculate the colour range
decompress would be an identity function
consistent (for the === operator) would return true if the entry's colour range contained the query range for internal nodes, and only if the ranges exactly matched for leaf nodes.

Is this the right approach? I'm not clear about when the compression steps take place. For instance, consistent is presumably called multiple times on different nodes of the tree. So does this mean that the query will have re-calculate the colour range of the query data every time? And in the index, will the leaf nodes contain a copy of the image data or just its colour range?

Note The example given is just for illustrative purposes. My question is about lossy representations in GiST, not indexing images.

Soluzione

As Erwin said, defining a custom gist index for this might be overkill.

I'm not clear about when the compression steps take place. For instance, consistent is presumably called multiple times on different nodes of the tree. So does this mean that the query will have re-calculate the colour range of the query data every time?

The last paragraph of the documentation you reference tells you how to cache the calculated value if you want to. For a worked example, see the use of fn_extra in contrib/pg_trgm/trgm_gist.c

And in the index, will the leaf nodes contain a copy of the image data or just its colour range?

That depends on how you implement the compress function. The function has access to the knowledge of if it is called on a leaf entry or a non-leaf entry. If you only compress things in non-leafs, then the leafs won't be compressed. So it is up to you.

Altri suggerimenti

Since you already store the values (rmin, rmax, gmin, gmax, bmin, bmax) for the image column, a btree index on those covers equality checks just fine:

CREATE INDEX foo1 ON icons (rmin, rmax, gmin, gmax, bmin, bmax);

This query will use the index:

SELECT COUNT(*)
FROM   avatars a
JOIN   icons   i
WHERE  a.id = 123
AND   (a.rmin, a.rmax, a.gmin, a.gmax, a.bmin, a.bmax)
    = (i.rmin, i.rmax, i.gmin, i.gmax, i.bmin, i.bmax);

Of course, you need another index on avatars.id, or possibly on (id, rmin, rmax, gmin, gmax, bmin, bmax) to allow index-only scans.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a dba.stackexchange