Why is my database 12 times bigger than inserted data?

https://dba.stackexchange.com/questions/193126

10-10-2020
|

Question

I have a short question concerning my database size. I need to insert data in a database. Before the insert, some calculations need to be done.

The point is: from 50 mb plain data (~700,000 lines), this results in 600 mb db size. This is a factor 12! I am sure I am doing something wrong here. Could you help me to reduze the size of my db? Source of database size is the web postgres admin interface.

Here's the insert:

CREATE TYPE CUSTOMER_TYPE AS ENUM
('enum1', 'enum2', 'enum3', '...', 'enum15');       ## max lenght of enum names ~15

CREATE TABLE CUSTOMER(
   CUSTOMER_ONE    TEXT PRIMARY KEY NOT NULL,       ## max 35 char String
   ATTRIBUTE_ONE   TEXT UNIQUE,                     ## max 35 char String
   ATTRIBUTE_TWO   TEXT UNIQUE,                     ## max 51 char String
   ATTRIBUTE_THREE TEXT UNIQUE,                     ## max 52 char String
   ATTRIBUTE_FOUR  TEXT UNIQUE,                     ## max 64 char String
   ATTRIBUTE_FIFE  TEXT UNIQUE,                     ## 1-80 char String
   CUSTOMER_TYPE   PRIVATEKEYTYPE                   ## see enum
);

I don't really need that enum since I can insert it without, too. Does a enum have an effect on the database size?

Is there a way to reduce the size? Is it possible to reach factor x4 (instead of x12)? If not, I could delete some of the columns, if necessary.

Maybe there are other Postgres data types for character data?

After feedback here, my updated table looks now like this:

CREATE TABLE CUSTOMER(
   CUSTOMER_ONE    TEXT PRIMARY KEY NOT NULL,       ## max 35 char String
   ATTRIBUTE_ONE   TEXT UNIQUE,                     ## max 35 char String
   ATTRIBUTE_TWO   TEXT,                            ## max 51 char String
   ATTRIBUTE_THREE TEXT,                            ## max 52 char String
   ATTRIBUTE_FOUR  TEXT,                            ## max 64 char String
   ATTRIBUTE_FIFE  TEXT,                            ## 1-80 char String
   CUSTOMER_TYPE   PRIVATEKEYTYPE                   ## see enum
);

Before: 12x

Now: 7x :)

Are there any more possible optimizations? (Except deleting columns?) Maybe other data types using less space?

Solution

text is the best type for plain text. Text values alone occupy about the same internally (depends on encoding details, too of course: UTF-8?). A small overhead of 1 byte per field in your case. See:

What is the overhead for varchar(n)?

The enum column contributes very little to the total size. 4 bytes per row (a real quantity internally). The length of enum names is practically irrelevant, stored onceas data type name, so with a maximum of NAMEDATALEN (normally 63 characters). Minimal overhead for the type declaration. So the enum column actually occupies less space inside the DB than in your text file.
There are various types of overhead in Postgres storage, it has to be like that. Some of them one-time costs, some other scale with the number of rows or relations etc. There are catalog tables (system data), indexes, relationships, table statistics, bookkeeping data etc.

12x the size seems more than average for 50 MB of raw text data. But that really depends on details. Many small rows have a lot of overhead, few large rows not so much, can even be compressed and smaller. a_horse already pointed to a major reason: 6 indexes - some of which obviously unnecessary, too.

Each row in a table has a tuple header occupying 23 bytes (plus alignment padding, typically 24 bytes), and possibly alignment padding between columns and tuples, 4 bytes for the item identifier, some overhead per data page (typically 8kb page size). Roughly ~ 35 bytes per row in your case.

About as much for every index tuple (header has only 8 bytes, but indexes bloat more). 6 indexes, ~ 150 bytes per table row minimum. Plus every text column is saved twice, 1x table, 1x index. And btree indexes start with a FILLFACTOR of 90, so ~ + 10 % on everything.

Your average row size is ~ 75 bytes (50MB / 700000 - assuming "lines" correspond to rows). Your table and indexes occupy about 6 times the size of the raw text. The rest is other overhead mentioned above.

Measure the size of a PostgreSQL table row

That's all assuming no table bloat (yet). Once you work with the DB (update, delete, transactions rolled back etc.) there will be dead rows and such, possibly multiplying total space. That very much depends on write patterns.

You can save a lot by removing unnecessary indexes.

And a couple of bytes per row if you move the enum column to the top of the table: less alignment padding, should average at 1.5 bytes per row, roughly 1 MB total, hardly worth it. See:

Calculating and saving space in PostgreSQL

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange