Parsing COPY's binary format to access a tsrange

https://dba.stackexchange.com/questions/190780

10-10-2020
|

Domanda

How is tsrange stored in binary?

For example create table

CREATE TABLE public.test (t tsrange);
INSERT INTO test VALUES ('[2010-01-01 14:30, 2010-01-01 15:30)');
INSERT INTO test VALUES ('[2011-01-01 14:31, 2015-11-01 15:30)');
INSERT INTO test VALUES ('[2017-01-01 14:31, 2018-11-01 15:30)');
COPY test TO '/tmp/pgcopy' WITH (FORMAT binary);
COPY test TO '/tmp/pgcopy.csv' WITH (FORMAT csv);

It outputs:

 cat /tmp/pgcopy.csv                                                                                                                                                                                                  
"[""2010-01-01 14:30:00"",""2010-01-01 15:30:00"")"
"[""2011-01-01 14:31:00"",""2015-11-01 15:30:00"")"
"[""2017-01-01 14:31:00"",""2018-11-01 15:30:00"")"


hexdump -C /tmp/pgcopy
00000000  50 47 43 4f 50 59 0a ff  0d 0a 00 00 00 00 00 00  |PGCOPY..........|
00000010  00 00 00 00 01 00 00 00  19 02 00 00 00 08 00 01  |................|
00000020  1f 19 f9 a9 aa 00 00 00  00 08 00 01 1f 1a d0 3d  |...............=|
00000030  4e 00 00 01 00 00 00 19  02 00 00 00 08 00 01 3b  |N..............;|
00000040  c8 89 51 11 00 00 00 00  08 00 01 c6 7b 1a 3a 0e  |..Q.........{.:.|
00000050  00 00 01 00 00 00 19 02  00 00 00 08 00 01 e8 08  |................|
00000060  0d 77 11 00 00 00 00 08  00 02 1c 9a dc 4d 0e 00  |.w...........M..|
00000070  ff ff                                             |..|
00000072

One field is:

00 00 00 19 02 00 00 00 08 00 01 e8 08 0d 77 11 00 00 00 00 08  00 02 1c 9a dc 4d 0e 00

There:

00000019 - is 25 bytes length

02 - brackets

00000008 - subfield length

0001e808 0d771100 and 00021c9a dc4d0e00 - stored timestamp with miroseconds.

How to convert it to integer timestamp?

Soluzione

As a minor note, COPY .. (WITH BINARY) doesn't have brackets. It's the flags (which represent amongst other things the brackets).

`COPY ... (WITH BINARY)`

From the docs on COPY

To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL source, in particular the *send and *recv functions for each column's data type (typically these functions are found in the src/backend/utils/adt/ directory of the source distribution).

Further the docs say binary format (currently) has

11 bytes signature
4 bytes for flags
4 byte potential variable-width field, currently not in use so we skip the size (4 byte \0\0\0\0) this technically not nice. If these four bytes had 15, we'd have to skip over not just the four, but an addition 15.

Then the tuple has

2 bytes for field count

Then the fields have

4 byte length qualifier followed by that many bytes of field data. (Which we already in the case of timestampor tsrange)

So essentially we skip 25 bytes to get to the first column

The `tsrange`

So it's in the format specified by range_send You can see that explained a bit below in the comments above range_recv

Binary representation: The first byte is the flags, then the lower bound (if present), then the upper bound (if present). Each bound is represented by a 4-byte length header and the binary representation of that bound (as returned by a call to the send function for the subtype).

The `timestamp` subtype

In your case, that subtype is timestamp and the send is timestamp_send.

You can see a timestamp is stored as 8 bytes, and that's just sent with a simple pq_sendint64 (a 64 bit/8 byte int). You'll have to read how timestamp_recv works to see how you should handle a binary representation of a timestamp. Hint: it gets into a struct for in-memory representation in timestamp2tm

/* timestamp2tm()
 * Convert timestamp data type to POSIX time structure.
 * Note that year is _not_ 1900-based, but is an explicit full value.
 * Also, month is one-based, _not_ zero-based.
 * Returns:
 *   0 on success
 *  -1 on out of range

I'm not going to more-entertain this here, but maybe next go around.

Playing around

We try first DEADBEEFin it which an isolation to track the marker 8 byte marker.

psql -d test -c 'COPY ( SELECT E'\''DEADBEEF'\'' ) TO STDOUT WITH ( FORMAT BINARY );' |
  od --skip-bytes=25 --endian big --read-bytes=8 -c

Now we swap that out..

psql -d test -c 'COPY ( SELECT $$2010-01-01 14:30$$::timestamp without time zone ) TO STDOUT WITH ( FORMAT BINARY );' |
  od --skip-bytes=25 --endian big --read-bytes=8 --format=d8 -x

Result: paren-comments added.

0000031      315671400000000  (timestamp in int8)
         0001 1f19 f9a9 aa00  (hex representation)
0000041

And that's your number for the first time zone. For tsrange as per the above section we have

one byte for on the interval
upper: 4 bytes (header) + 8 bytes (timestamp)
lower: 4 bytes (header) + 8 bytes (timestamp)

So to access the first internal timestamp, we skip an addition 5 bytes, on top of already 25 bytes of skipage for 30 bytes total.

psql -d test -c 'COPY ( SELECT $$[2010-01-01 14:30, 2010-01-01 15:30)$$::tsrange ) TO STDOUT WITH ( FORMAT BINARY );' |
  od --skip-bytes=30 --endian big --read-bytes=8 --format=d8 -x

This gives us the same result as above..

0000036      315671400000000
         0001 1f19 f9a9 aa00
0000046

Simply change --skip-bytes to 42 to skip over that 8 byte time stamp, and the next 4 bytes of header for the lower and you'll get another time stamp.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a dba.stackexchange

Parsing COPY's binary format to access a tsrange

COPY ... (WITH BINARY)

The tsrange

The timestamp subtype

Playing around

`COPY ... (WITH BINARY)`

The `tsrange`

The `timestamp` subtype