Question

I'm trying to write a complex query using PostgreSQL 9.2.4, and I'm having trouble getting it working. I have a table which contains a time range, as well as several other columns. When I store data in this table, if all of the columns are the same and the time ranges overlap or are adjacent, I combine them into one row.

When I retrieve them, though, I want to split the ranges at day boundaries - so for example:

2013-01-01 00:00:00 to 2013-01-02 23:59:59

would be selected as two rows:

2013-01-01 00:00:00 to 2013-01-01 23:59:59
2013-01-02 00:00:00 to 2013-01-02 23:59:59

with the values in the other columns the same for both retrieved entries.

I have seen this question which seems to more or less address what I want, but it's for a "very old" version of PostgreSQL, so I'm not sure it's really still applicable.

I've also seen this question, which does exactly what I want, but as far as I know the CONNECT BY statement is an Oracle extension to the SQL standard, so I can't use it.

I believe I can achieve this using PostgreSQL's generate_series, but I'm hoping there's a simple example out there demonstrating how it can be used to do this.

This is the query I'm working on at the moment, which currently doesn't work (because I can't reference the FROM table in a joined subquery), but I believe this is more-or-less the right track.

Here's the fiddle with the schema, sample data, and my working query.

Update: I just found out a fun fact, thanks to this question, that if you use a set-returning function in the SELECT part of the query, PostgreSQL will "automagically" do a cross join on the set and the row. I think I'm close to getting this working.

Was it helpful?

Solution

First off, your upper border concept is broken. A timestamp with 23:59:59 is no good. The data type timestamp has fractional digits. What about 2013-10-18 23:59:59.123::timestamp?

Include the lower border and exclude the upper border everywhere in your logic. Compare:

Building on this premise:

Postgres 9.2 or older

SELECT id
     , stime
     , etime
FROM   timesheet_entries t
WHERE  etime <= stime::date + 1  -- this includes upper border 00:00

UNION ALL
SELECT id
     , CASE WHEN stime::date = d THEN stime ELSE d END     -- AS stime
     , CASE WHEN etime::date = d THEN etime ELSE d + 1 END -- AS etime
FROM (
   SELECT id
        , stime
        , etime
        , generate_series(stime::date, etime::date, interval '1d')::date AS d
   FROM   timesheet_entries t
   WHERE  etime > stime::date + 1
   ) sub
ORDER  BY id, stime;

Or simply:

SELECT id
     , CASE WHEN stime::date = d THEN stime ELSE d END     -- AS stime
     , CASE WHEN etime::date = d THEN etime ELSE d + 1 END -- AS etime
FROM (
   SELECT id
        , stime
        , etime
        , generate_series(stime::date, etime::date, interval '1d')::date AS d
   FROM   timesheet_entries t
   ) sub
ORDER  BY id, stime;

The simpler one may even be faster.
Note a corner case difference when stime and etime both fall on 00:00 exactly. Then a row with a zero time range is added at the end. There are various ways to deal with that. I propose:

SELECT *
FROM  (
   SELECT id
        , CASE WHEN stime::date = d THEN stime ELSE d END     AS stime
        , CASE WHEN etime::date = d THEN etime ELSE d + 1 END AS etime
   FROM (
      SELECT id
           , stime
           , etime
           , generate_series(stime::date, etime::date, interval '1d')::date AS d
      FROM   timesheet_entries t
      ) sub1
   ORDER  BY id, stime
   ) sub2
WHERE  etime <> stime;

Postgres 9.3+

In Postgres 9.3+ you would better use LATERAL for this

SELECT id
     , CASE WHEN stime::date = d THEN stime ELSE d END     AS stime
     , CASE WHEN etime::date = d THEN etime ELSE d + 1 END AS etime
FROM   timesheet_entries t
     , LATERAL (SELECT d::date
                FROM   generate_series(t.stime::date, t.etime::date, interval '1d') d
                ) d
ORDER  BY id, stime;

Details in the manual.
Same corner case as above.

SQL Fiddle demonstrating all.

OTHER TIPS

There is simply solution (if intervals starts in same time)

postgres=# select i, i + interval '1day' - interval '1sec' 
  from generate_series('2013-01-01 00:00:00'::timestamp, '2013-01-02 23:59:59', '1day') g(i);
          i          │      ?column?       
─────────────────────┼─────────────────────
 2013-01-01 00:00:00 │ 2013-01-01 23:59:59
 2013-01-02 00:00:00 │ 2013-01-02 23:59:59
(2 rows)

I wrote a table function, that do it for any interval. It is fast - two years range divide to 753 ranges in 10ms

create or replace function day_ranges(timestamp, timestamp)
returns table(t1 timestamp, t2 timestamp) as $$
begin
  t1 := $1;
  if $2 > $1 then
    loop
      if t1::date = $2::date then
        t2 := $2;
        return next;
        exit;
      end if;
      t2 := date_trunc('day', t1) + interval '1day' - interval '1sec';
      return next;
      t1 := t2 + interval '1sec';
    end loop;
  end if;
  return;
end;
$$ language plpgsql;

Result:

postgres=# select * from day_ranges('2013-10-08 22:00:00', '2013-10-10 23:00:00');
         t1          │         t2          
─────────────────────┼─────────────────────
 2013-10-08 22:00:00 │ 2013-10-09 23:59:59
 2013-10-09 00:00:00 │ 2013-10-09 23:59:59
 2013-10-10 00:00:00 │ 2013-10-10 23:00:00
(3 rows)

Time: 6.794 ms

and faster (and little bit longer) version based on RETURN QUERY

create or replace function day_ranges(timestamp, timestamp)
returns table(t1 timestamp, t2 timestamp) as $$
begin
  t1 := $1; t2 := $2;
  if $1::date = $2::date then
    return next;
  else
    -- first day
    t2 := date_trunc('day', t1) + interval '1day' - interval '1sec';
    return next;
    if $2::date > $1::date + 1 then
      return query select d, d + interval '1day' - interval '1sec'
                      from generate_series(date_trunc('day', $1 + interval '1day')::timestamp,
                                           date_trunc('day', $2 - interval '1day')::timestamp,
                                           '1day') g(d);
    end if;
    -- last day 
    t1 := date_trunc('day', $2); t2 := $2;
    return next;
  end if;
  return;
end;
$$ language plpgsql;

You may build the result by simply generating all days in the entire set of ranges and join this to the ranges themselves using the OVERLAPS operator. This will both discard the days for which there is no data and do the required cartesian product for the days with data.

OVERLAPS behavior is explained in the doc as this:

(start1, end1) OVERLAPS (start2, end2)
(start1, length1) OVERLAPS (start2, length2)
This expression yields true when two time periods (defined by their endpoints) overlap, false when they do not overlap. The endpoints can be specified as pairs of dates, times, or time stamps; or as a date, time, or time stamp followed by an interval. When a pair of values is provided, either the start or the end can be written first; OVERLAPS automatically takes the earlier value of the pair as the start. Each time period is considered to represent the half-open interval start <= time < end, unless start and end are equal in which case it represents that single time instant. This means for instance that two time periods with only an endpoint in common do not overlap.

With your timesheet_entries table, the query would be:

select days.day, timesheet_entries.* from
  (select day from generate_series(
         (select min(stime) from timesheet_entries),
         (select max(etime) from timesheet_entries),
         '1 day'::interval) day
  ) days
 join timesheet_entries 
  on (stime,etime) overlaps (days.day,days.day+'1 day'::interval)
 order by 1;

This technique makes it also easy to add the days without data, calendar-like. To do so, just replace the join by a left join.

As pointed by @Erwin in his answer, etime should not represent the last second (...:59:59) of the interval but the next second as an excluded upper bound. It may not matter with your current data, but if certain rows had stime at ...:59:59 or if you happened to need sub-second resolution, that would be a problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top