Question

Vertica's shiny feature:

I have two tables that I would like to join with INTERPOLATE and expect data from the second table to be interpolated with the latest available. But unfortunately I am not able to get my desired result. I have checked out the Vertica documentation related to the INTERPOLATE feature and tried an example which worked fine.

CREATE TABLE a
( family int,
  date DATE,
  id int
);

CREATE TABLE b
( Id int,
  date DATE,
  datapoint float
);

INSERT INTO a VALUES (1, '20130603', 1);
INSERT INTO a VALUES (1, '20130604', 1);
INSERT INTO a VALUES (1, '20130605', 1);
INSERT INTO a VALUES (1, '20130606', 1);
INSERT INTO a VALUES (1, '20130607', 1);

INSERT INTO b VALUES (1, '20130603', 3.00);

SELECT a.family, a.date, a.id, a.date, b.datapoint
  FROM a
  LEFT
  JOIN b
    ON a.id = b.id
   AND a.date INTERPOLATE PREVIOUS VALUE b.date;

vertdeva01:20130612-095628 > \g
 family |    date    | id |    date    | data
--------+------------+----+------------+------
      1 | 2013-06-03 |  1 | 2013-06-03 |    3
      1 | 2013-06-04 |  1 | 2013-06-04 |    3
      1 | 2013-06-05 |  1 | 2013-06-05 |    3
      1 | 2013-06-06 |  1 | 2013-06-06 |    3
      1 | 2013-06-07 |  1 | 2013-06-07 |    3

There I get the results as expected. The values in table b are interpolated according to the dates in table a.

But when I try something similar to a slightly more complex scenario I don't really get what I want.

What I intend to achieve:

What I intend to achieve is select latest available data from b for every id in a that matches the corresponding date in a. So if a has an (id,date) combination then I would like to fetch data from b for that id and date. But if there is no data for that id in b on that date, then fetch what is available AS OF that date. Fetch data that would be valid as of the date in a. In other words, a feel-back behavior. A data point for an id is valid in b as long as there is no other data point for that id after that date. I hope that makes sense. I know a way to do this using MAX() and GROUP BY. I would like to know if the same is possible using INTERPOLATE

An Example:

Just to give you an idea of what it is, I played with an example. This time I just modified the previously created tables to have more fields.

CREATE TABLE a
( family int,
  family_name varchar(50),
  industry varchar(15),
  style_flag varchar(1),
  id int,
  id_name varchar(50),
  id1 int,
  id2 int,
  id3 int,
  date DATE,
  id4 int
);

CREATE TABLE b
( id4 int,
  flag int,
  period int,
  date DATE,
 datapoint float
);

INSERT INTO a VALUES (1, '1family', 'comp', 'A', 1, '1 id', 101, 201, 301, '20130603', 401);
INSERT INTO a VALUES (1, '1family', 'comp', 'A', 2, '2 id', 102, 202, 302, '20130603', 402);
INSERT INTO a VALUES (1, '1family', 'comp', 'A', 3, '3 id', 103, 203, 303, '20130603', 403);

INSERT INTO a VALUES (2, '2family', 'bio', 'A', 5, '5 id', 105, 205, 305, '20130603', 405);
INSERT INTO a VALUES (2, '2family', 'bio', 'A', 7, '7 id', 107, 207, 307, '20130603', 407);
INSERT INTO a VALUES (2, '2family', 'bio', 'A', 9, '9 id', 109, 209, 309, '20130603', 409);

INSERT INTO b VALUES (401, 1, 10, '20130501', 2.00);
INSERT INTO b VALUES (401, 1, 20, '20130501', 1.50);
INSERT INTO b VALUES (401, 2, 10, '20130409', 12.34);
INSERT INTO b VALUES (401, 2, 20, '20130401', 10.56);

INSERT INTO b VALUES (402, 1, 10, '20130501', 2.00);
INSERT INTO b VALUES (402, 2, 20, '20130409', 12.34);
INSERT INTO b VALUES (402, 2, 20, '20130401', 10.56);
INSERT INTO b VALUES (402, 2, 20, '20130515', 20.55);

when I run the following query:

SELECT a.family, a.family_name,       a.industry, a.style_flag,
       a.id, a.id_name,
       a.id1, a.id2, a.id3, a.date,
       b.id4, b.flag, b.period, b.datapoint
  FROM a
  LEFT
  JOIN b
    ON a.id4 = b.id4
   AND a.date INTERPOLATE PREVIOUS VALUE b.date;

I get the following:

family | family_name | industry | style_flag | id | id_name | id1 | id2 | id3 |    date    | id4 | flag | period | datapoint
--------+-------------+----------+------------+----+---------+-----+-----+-----+------------+-----+------+--------+-----------
      2 | 2family     | bio      | A          |  5 | 5 id    | 105 | 205 | 305 | 2013-06-03 |     |      |        |
      1 | 1family     | comp     | A          |  1 | 1 id    | 101 | 201 | 301 | 2013-06-03 | 401 |    1 |     10 |         2
      1 | 1family     | comp     | A          |  3 | 3 id    | 103 | 203 | 303 | 2013-06-03 |     |      |        |
      2 | 2family     | bio      | A          |  9 | 9 id    | 109 | 209 | 309 | 2013-06-03 |     |      |        |
      2 | 2family     | bio      | A          |  7 | 7 id    | 107 | 207 | 307 | 2013-06-03 |     |      |        |
      1 | 1family     | comp     | A          |  2 | 2 id    | 102 | 202 | 302 | 2013-06-03 | 402 |    2 |     20 |     20.55

But I need to select the latest value available for an id from b for a kind of group by of (id4,flag, period), instead of what it is currently giving me as a result. is there a way that I can make use of the INTERPOLATE feature for this? Or should I take a completely different approach. The problem is data in table b is sparse. We may not have a datapoint every day, where as in a we have a data point every day.

I also tried filling in the gaps between the data points in b using TIMESERIES clause and TS_FIRST_VALUE(datapoint, 'const'). But there again, the latest date available for a combination of id4, flag, period in b could be way back in time when compared to the date for an id in a. And I end up with the same problem as demonstrated above.

Any guidance would be highly appreciated.

Was it helpful?

Solution

Eakan,

I don't have access to a Vertica environment to test this but I think that the issue with your second example is that your result set only has one date in it, since there's only one date in the a table. However there are multiple id4s. So when you ask, in your query to interpolate across date gaps there are no gaps to interpolate over. The gaps that you see in the values from your b table are actually on different id4 values from the a table.

I'm not sure if you can have more than one interpolate in the join clause but how about this:

SELECT     a.family, a.family_name, a.industry, a.style_flag, a.id, a.id_name,
           a.id1, a.id2, a.id3, a.date, a.id4, b.id4, b.flag, b.period, b.datapoint
FROM       a
LEFT JOIN  b
ON         a.id4 INTERPOLATE PREVIOUS VALUE b.id4 AND
           a.date INTERPOLATE PREVIOUS VALUE b.date;

I think this will fill down the values from the b table but this might not be what you want. Perhaps you only mean to interpolate over dates and you are thinking that it didn't work just because you don't have more than one date in your a table. Put a few more dates in that a table and then re-run the query to see what I mean.

Contact me offline if you want to discuss in more detail. If you can hook me up with access to a working vertica environment to play with then I can try out some more ideas for you.

Alan

FOLLOW UP:

Ok, so I managed to get access to a test Vertica environment.

Firstly it seems that you can’t have more than one interpolated join predicate …

vmartdb=> SELECT     a.family, a.family_name, a.industry, a.style_flag, a.id, a.id_name,
vmartdb->            a.id1, a.id2, a.id3, a.date, a.id4, b.id4, b.flag, b.period, b.datapoint
vmartdb-> FROM       est_cal.a AS a
vmartdb-> LEFT JOIN  est_cal.b AS b
vmartdb-> ON         a.id4 INTERPOLATE PREVIOUS VALUE b.id4 AND
vmartdb->            a.date INTERPOLATE PREVIOUS VALUE b.date;
ERROR 2093:  A join can have only one set of interpolated predicates
vmartdb=>

So that’s that.

Then I tried to add a few more dates to the a table and saw that your original query did indeed interpolate over the gaps for the b table fields …

CREATE TABLE est_cal.a (
  family int,
  family_name varchar(50),
  industry varchar(15),
  style_flag varchar(1),
  id int,
  id_name varchar(50),
  id1 int,
  id2 int,
  id3 int,
  date DATE,
  id4 int
);

CREATE TABLE est_cal.b (
  id4 int,
  flag int,
  period int,
  date DATE,
 datapoint float
);

INSERT INTO est_cal.a VALUES (1, '1family', 'comp', 'A', 1, '1 id', 101, 201, 301, '20130603', 401);
INSERT INTO est_cal.a VALUES (1, '1family', 'comp', 'A', 2, '2 id', 102, 202, 302, '20130603', 402);
INSERT INTO est_cal.a VALUES (1, '1family', 'comp', 'A', 3, '3 id', 103, 203, 303, '20130603', 403);
INSERT INTO est_cal.a VALUES (2, '2family', 'bio', 'A', 5, '5 id', 105, 205, 305, '20130603', 405);
INSERT INTO est_cal.a VALUES (2, '2family', 'bio', 'A', 7, '7 id', 107, 207, 307, '20130603', 407);
INSERT INTO est_cal.a VALUES (2, '2family', 'bio', 'A', 9, '9 id', 109, 209, 309, '20130603', 409);
INSERT INTO est_cal.a VALUES (1, '1family', 'comp', 'A', 1, '1 id', 101, 201, 301, '20130604', 401);
INSERT INTO est_cal.a VALUES (1, '1family', 'comp', 'A', 2, '2 id', 102, 202, 302, '20130604', 402);
INSERT INTO est_cal.a VALUES (1, '1family', 'comp', 'A', 3, '3 id', 103, 203, 303, '20130604', 403);
INSERT INTO est_cal.a VALUES (2, '2family', 'bio', 'A', 5, '5 id', 105, 205, 305, '20130605', 405);
INSERT INTO est_cal.a VALUES (2, '2family', 'bio', 'A', 7, '7 id', 107, 207, 307, '20130605', 407);
INSERT INTO est_cal.a VALUES (2, '2family', 'bio', 'A', 9, '9 id', 109, 209, 309, '20130605', 409);
INSERT INTO est_cal.a VALUES (1, '1family', 'comp', 'A', 1, '1 id', 101, 201, 301, '20130605', 401);
INSERT INTO est_cal.a VALUES (1, '1family', 'comp', 'A', 2, '2 id', 102, 202, 302, '20130605', 402);
INSERT INTO est_cal.a VALUES (1, '1family', 'comp', 'A', 3, '3 id', 103, 203, 303, '20130605', 403);
INSERT INTO est_cal.a VALUES (2, '2family', 'bio', 'A', 5, '5 id', 105, 205, 305, '20130605', 405);
INSERT INTO est_cal.a VALUES (2, '2family', 'bio', 'A', 7, '7 id', 107, 207, 307, '20130605', 407);
INSERT INTO est_cal.a VALUES (2, '2family', 'bio', 'A', 9, '9 id', 109, 209, 309, '20130605', 409);

INSERT INTO est_cal.b VALUES (401, 1, 10, '20130501', 2.00);
INSERT INTO est_cal.b VALUES (401, 1, 20, '20130501', 1.50);
INSERT INTO est_cal.b VALUES (401, 2, 10, '20130409', 12.34);
INSERT INTO est_cal.b VALUES (401, 2, 20, '20130401', 10.56);
INSERT INTO est_cal.b VALUES (402, 1, 10, '20130501', 2.00);
INSERT INTO est_cal.b VALUES (402, 2, 20, '20130409', 12.34);
INSERT INTO est_cal.b VALUES (402, 2, 20, '20130401', 10.56);
INSERT INTO est_cal.b VALUES (402, 2, 20, '20130515', 20.55);


SELECT     a.family, a.family_name, a.industry, a.style_flag, a.id, a.id_name,
           a.id1, a.id2, a.id3, a.date AS a_date, b.date AS b_date, a.id4 AS a_id4, b.id4 AS b_id4, b.flag, b.period, b.datapoint
FROM       est_cal.a AS a
LEFT JOIN  est_cal.b AS b
ON         a.id4 = b.id4 AND
           a.date INTERPOLATE PREVIOUS VALUE b.date;

Which generated the following results ...

family | family_name | industry | style_flag | id | id_name | id1 | id2 | id3 |   a_date   |   b_date   | a_id4 | b_id4 | flag | period | datapoint
--------+-------------+----------+------------+----+---------+-----+-----+-----+------------+------------+-------+-------+------+--------+-----------
      2 | 2family     | bio      | A          |  7 | 7 id    | 107 | 207 | 307 | 2013-06-03 |            |   407 |       |      |        |
      2 | 2family     | bio      | A          |  7 | 7 id    | 107 | 207 | 307 | 2013-06-04 |            |   407 |       |      |        |
      2 | 2family     | bio      | A          |  7 | 7 id    | 107 | 207 | 307 | 2013-06-05 |            |   407 |       |      |        |
      2 | 2family     | bio      | A          |  7 | 7 id    | 107 | 207 | 307 | 2013-06-05 |            |   407 |       |      |        |
      2 | 2family     | bio      | A          |  5 | 5 id    | 105 | 205 | 305 | 2013-06-03 |            |   405 |       |      |        |
      2 | 2family     | bio      | A          |  5 | 5 id    | 105 | 205 | 305 | 2013-06-04 |            |   405 |       |      |        |
      2 | 2family     | bio      | A          |  5 | 5 id    | 105 | 205 | 305 | 2013-06-05 |            |   405 |       |      |        |
      2 | 2family     | bio      | A          |  5 | 5 id    | 105 | 205 | 305 | 2013-06-05 |            |   405 |       |      |        |
      2 | 2family     | bio      | A          |  9 | 9 id    | 109 | 209 | 309 | 2013-06-03 |            |   409 |       |      |        |
      2 | 2family     | bio      | A          |  9 | 9 id    | 109 | 209 | 309 | 2013-06-04 |            |   409 |       |      |        |
      2 | 2family     | bio      | A          |  9 | 9 id    | 109 | 209 | 309 | 2013-06-05 |            |   409 |       |      |        |
      2 | 2family     | bio      | A          |  9 | 9 id    | 109 | 209 | 309 | 2013-06-05 |            |   409 |       |      |        |
      1 | 1family     | comp     | A          |  1 | 1 id    | 101 | 201 | 301 | 2013-06-03 | 2013-05-01 |   401 |   401 |    1 |     10 |         2
      1 | 1family     | comp     | A          |  1 | 1 id    | 101 | 201 | 301 | 2013-06-04 | 2013-05-01 |   401 |   401 |    1 |     10 |         2
      1 | 1family     | comp     | A          |  1 | 1 id    | 101 | 201 | 301 | 2013-06-05 | 2013-05-01 |   401 |   401 |    1 |     10 |         2
      1 | 1family     | comp     | A          |  3 | 3 id    | 103 | 203 | 303 | 2013-06-03 |            |   403 |       |      |        |
      1 | 1family     | comp     | A          |  3 | 3 id    | 103 | 203 | 303 | 2013-06-04 |            |   403 |       |      |        |
      1 | 1family     | comp     | A          |  3 | 3 id    | 103 | 203 | 303 | 2013-06-05 |            |   403 |       |      |        |
      1 | 1family     | comp     | A          |  2 | 2 id    | 102 | 202 | 302 | 2013-06-03 | 2013-05-15 |   402 |   402 |    2 |     20 |     20.55
      1 | 1family     | comp     | A          |  2 | 2 id    | 102 | 202 | 302 | 2013-06-04 | 2013-05-15 |   402 |   402 |    2 |     20 |     20.55
      1 | 1family     | comp     | A          |  2 | 2 id    | 102 | 202 | 302 | 2013-06-05 | 2013-05-15 |   402 |   402 |    2 |     20 |     20.55
(21 rows)

So this shows that the date interpolation is working as it should.

Alan

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top