Question

I'm not familiar with PostgreSQL functions and stored procedures, but I've done my best to learn during the past few days. I tried hard before asking here.

Basically, I have a situation in which I cannot use simple SQL and in which a function would be most helpful. (That is because I am sending the query via AJAX to a Postgres-based webservice that returns JSONP, and, because the query is built in JavaScript based on a non-predetermined number of variables, it can grow past the 2000 or so URL character limit allowed in MSIE.)

Say, I have a table called clients:

+-------------+-------------------+-------------+---------------+
|   CLIENT    | MONTHLY_PURCHASES | SALES_VALUE | RETURNS_VALUE |
+-------------+-------------------+-------------+---------------+
| Mercury Ltd | 3                 | 400000      | 30000         |
| Saturn Plc  | 11                | 150000      | 30000         |
| Uranus Ltd  | 4                 | 80000       | 1000          |
+-------------+-------------------+-------------+---------------+

The query is supposed to return clients ranked by various criteria contained in the columns. The number of columns may grow in the future.

For instance, if I want to get the top 10 best clients, ranked from 100 (best) to 0 (worst), the SQL query would be:

WITH var AS (
    --we need the min and max values for each criteria, to calculate the rank later
    SELECT 
      MIN(monthly_purchases) AS min_pur,
      MAX(monthly_purchases) AS max_pur,
      MIN(sales_value) AS min_sales,
      MAX(sales_value) AS max_sales,
      MIN(returns_value) AS min_returns,
      MAX(returns_value) AS max_returns
    FROM clients
),
--standardise values to a 0 to 100 range, so we can compare apples with oranges, and assign weights to each criteria (from 0 to 1)
weights AS (        
    SELECT client,
      --the higher the number of purchases the better. Weight: 0.2 out of 1.
      0.2 * (clients.monthly_purchases - var.min_pur) / (var.max_pur - var.min_pur) * 100 AS rnk_pur,
      --the higher the value of sales, the better. Weight: 0.4 out of 1.
      0.4 * (clients.sales_value - var.min_sales) / (var.max_sales - var.min_sales) * 100 AS rnk_sales,
      --the lower the value of returns the better. Weight: 0.4 out of 1.
      0.4 * (1 - (clients.returns_value - var.min_returns) / (var.max_returns - var.min_returns)) * 100 AS rnk_returns
    FROM clients, var
)
SELECT weights.client, weights.rnk_pur + weights.rnk_sales + weights.rnk_returns as overall_rank FROM weights ORDER BY overall_rank DESC LIMIT 10

All good, but in reality the number of columns is larger (about 40), and the user could use at once anything between 1 and 15 for ranking.

As such, the SQL route is not feasible. I tried creating a function which would at least do the standardisation of values:

--Firstly, a function to find the highest value in an array
DROP FUNCTION IF EXISTS array_max(float[]);

CREATE OR REPLACE FUNCTION array_max(float[])
RETURNS float
AS $$
  select max(x) from unnest($1)x order by 1;
$$
LANGUAGE 'sql';

--Secondly, a function to find the lowest value in an array
DROP FUNCTION IF EXISTS array_min(float[]);

CREATE OR REPLACE FUNCTION array_min(float[])
RETURNS float
AS $$
  select min(x) from unnest($1)x order by 1;
$$
LANGUAGE 'sql';

--Finally, our function
DROP FUNCTION IF EXISTS standardise(float[], float);

CREATE OR REPLACE FUNCTION standardise(myarray float[], val float)
RETURNS float AS
$$

DECLARE
  minimum float;
  maximum float;
  calc_result float;
BEGIN
  minimum = array_min(myarray);
  maximum = array_max(myarray);

  calc_result = (val - minimum) / (maximum - minimum) * 100;

  RETURN calc_result;
END;
$$
LANGUAGE 'plpgsql' IMMUTABLE;

Not surprisingly, the function is pretty slow. If used like this:

SELECT 0.5 * standardise((SELECT array(SELECT sales_value FROM clients)), clients.sales_value) AS rnk_sales
FROM clients

... it's acceptable. Anything that involves ordering slows it down to a crawl. I.e:

SELECT 0.5 * standardise((SELECT array(SELECT sales_value FROM clients)), clients.sales_value) AS rnk_sales
FROM clients ORDER BY rnk_sales LIMIT 10

Is there any way in which I can improve the speed of above function. Or, perhaps a totally different approach? Any help would be much appreciated. Thanks!

UPDATE:

I ran EXPLAIN ANALYZE with the last query. For this I chose only a sample from the whole table, as it was taking way too long. I cancelled the query after waiting for 10 minutes. This is on a table with 1000 clients:

EXPLAIN ANALYZE SELECT 0.5 * standardise((SELECT array(SELECT sales_value FROM clients_sample)), clients_sample.sales_value) AS rnk_sales
FROM clients_sample ORDER BY rnk_sales LIMIT 10

Result:

Limit  (cost=78.82..78.83 rows=10 width=8) (actual time=357.806..357.822 rows=10 loops=1)
  InitPlan 2 (returns $1)
    ->  Result  (cost=12.00..12.00 rows=1 width=0) (actual time=1.267..1.268 rows=1 loops=1)
          InitPlan 1 (returns $0)
            ->  Seq Scan on clients_sample clients_sample_1  (cost=0.00..12.00 rows=1000 width=8) (actual time=0.002..0.666 rows=1000 loops=1)
  ->  Sort  (cost=66.82..67.32 rows=1000 width=8) (actual time=357.805..357.809 rows=10 loops=1)
        Sort Key: ((0.5::double precision * standardise($1, clients_sample.sales_value)))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Seq Scan on clients_sample  (cost=0.00..62.50 rows=1000 width=8) (actual time=1.870..356.742 rows=1000 loops=1)
Total runtime: 357.850 ms
Was it helpful?

Solution

Sanitized helper functions

CREATE OR REPLACE FUNCTION array_max(float[])
  RETURNS float AS
'SELECT max(x) from unnest($1) x'
LANGUAGE sql;

ORDER BY 1 would be useless, since max(x) returns a single row anyway.
Same for array_min(float[]);

However, do not use these functions. It is cheaper to get min() and max() in a single call.

Main function:

Use a simple SQL function instead:

CREATE OR REPLACE FUNCTION standardise(_arr float[], _val float)
  RETURNS float AS
$func$
SELECT ((_val - min_x) * 100) / (max_x - min_x)
FROM (
   SELECT min(x) AS min_x, max(x) AS max_x
   FROM   unnest($1) x
   ) sub
$func$
LANGUAGE sql IMMUTABLE;
  • Using a subquery to get both aggregates at once.
  • Multiply first for typically higher precision.
  • Don't quote the language name.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top