Parsing COPY's binary format to access a tsrange
-
10-10-2020 - |
Domanda
How is tsrange stored in binary?
For example create table
CREATE TABLE public.test (t tsrange);
INSERT INTO test VALUES ('[2010-01-01 14:30, 2010-01-01 15:30)');
INSERT INTO test VALUES ('[2011-01-01 14:31, 2015-11-01 15:30)');
INSERT INTO test VALUES ('[2017-01-01 14:31, 2018-11-01 15:30)');
COPY test TO '/tmp/pgcopy' WITH (FORMAT binary);
COPY test TO '/tmp/pgcopy.csv' WITH (FORMAT csv);
It outputs:
cat /tmp/pgcopy.csv
"[""2010-01-01 14:30:00"",""2010-01-01 15:30:00"")"
"[""2011-01-01 14:31:00"",""2015-11-01 15:30:00"")"
"[""2017-01-01 14:31:00"",""2018-11-01 15:30:00"")"
hexdump -C /tmp/pgcopy
00000000 50 47 43 4f 50 59 0a ff 0d 0a 00 00 00 00 00 00 |PGCOPY..........|
00000010 00 00 00 00 01 00 00 00 19 02 00 00 00 08 00 01 |................|
00000020 1f 19 f9 a9 aa 00 00 00 00 08 00 01 1f 1a d0 3d |...............=|
00000030 4e 00 00 01 00 00 00 19 02 00 00 00 08 00 01 3b |N..............;|
00000040 c8 89 51 11 00 00 00 00 08 00 01 c6 7b 1a 3a 0e |..Q.........{.:.|
00000050 00 00 01 00 00 00 19 02 00 00 00 08 00 01 e8 08 |................|
00000060 0d 77 11 00 00 00 00 08 00 02 1c 9a dc 4d 0e 00 |.w...........M..|
00000070 ff ff |..|
00000072
One field is:
00 00 00 19 02 00 00 00 08 00 01 e8 08 0d 77 11 00 00 00 00 08 00 02 1c 9a dc 4d 0e 00
There:
00000019
- is 25 bytes length
02
- brackets
00000008
- subfield length
0001e808 0d771100
and 00021c9a dc4d0e00
- stored timestamp with miroseconds.
How to convert it to integer timestamp?
Soluzione
As a minor note, COPY .. (WITH BINARY)
doesn't have brackets. It's the flags (which represent amongst other things the brackets).
COPY ... (WITH BINARY)
From the docs on COPY
To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL source, in particular the
*send
and*recv
functions for each column's data type (typically these functions are found in thesrc/backend/utils/adt/
directory of the source distribution).
Further the docs say binary format (currently) has
- 11 bytes signature
- 4 bytes for flags
- 4 byte potential variable-width field, currently not in use so we skip the size (4 byte
\0\0\0\0
) this technically not nice. If these four bytes had 15, we'd have to skip over not just the four, but an addition 15.
Then the tuple has
- 2 bytes for field count
Then the fields have
- 4 byte length qualifier followed by that many bytes of field data. (Which we already in the case of
timestamp
ortsrange
)
So essentially we skip 25 bytes to get to the first column
The tsrange
So it's in the format specified by range_send
You can see that explained a bit below in the comments above range_recv
Binary representation: The first byte is the flags, then the lower bound (if present), then the upper bound (if present). Each bound is represented by a 4-byte length header and the binary representation of that bound (as returned by a call to the send function for the subtype).
The timestamp
subtype
In your case, that subtype is timestamp and the send is timestamp_send
.
You can see a timestamp is stored as 8 bytes, and that's just sent with a simple pq_sendint64
(a 64 bit/8 byte int). You'll have to read how timestamp_recv
works to see how you should handle a binary representation of a timestamp. Hint: it gets into a struct for in-memory representation in timestamp2tm
/* timestamp2tm()
* Convert timestamp data type to POSIX time structure.
* Note that year is _not_ 1900-based, but is an explicit full value.
* Also, month is one-based, _not_ zero-based.
* Returns:
* 0 on success
* -1 on out of range
I'm not going to more-entertain this here, but maybe next go around.
Playing around
We try first DEADBEEF
in it which an isolation to track the marker 8 byte marker.
psql -d test -c 'COPY ( SELECT E'\''DEADBEEF'\'' ) TO STDOUT WITH ( FORMAT BINARY );' |
od --skip-bytes=25 --endian big --read-bytes=8 -c
Now we swap that out..
psql -d test -c 'COPY ( SELECT $$2010-01-01 14:30$$::timestamp without time zone ) TO STDOUT WITH ( FORMAT BINARY );' |
od --skip-bytes=25 --endian big --read-bytes=8 --format=d8 -x
Result: paren-comments added.
0000031 315671400000000 (timestamp in int8)
0001 1f19 f9a9 aa00 (hex representation)
0000041
And that's your number for the first time zone. For tsrange
as per the above section we have
- one byte for on the interval
- upper: 4 bytes (header) + 8 bytes (timestamp)
- lower: 4 bytes (header) + 8 bytes (timestamp)
So to access the first internal timestamp, we skip an addition 5 bytes, on top of already 25 bytes of skipage for 30 bytes total.
psql -d test -c 'COPY ( SELECT $$[2010-01-01 14:30, 2010-01-01 15:30)$$::tsrange ) TO STDOUT WITH ( FORMAT BINARY );' |
od --skip-bytes=30 --endian big --read-bytes=8 --format=d8 -x
This gives us the same result as above..
0000036 315671400000000
0001 1f19 f9a9 aa00
0000046
Simply change --skip-bytes
to 42 to skip over that 8 byte time stamp, and the next 4 bytes of header for the lower
and you'll get another time stamp.