PostgreSQL Query Optimization and the Postmaster Process'
-
03-07-2019 - |
Question
I currently working with a larger wikipedia-dump derived PostgreSQL database; it contains about 40 GB of data. The database is running on an HP Proliant ML370 G5 server with Suse Linux Enterprise Server 10; I am querying it from my laptop over a private network managed by a simple D-Link router. I assigned static DHCP (private) IPs to both laptop and server.
Anyway, from my laptop, using pgAdmin III, I send off some SQL commands/queries; some of these are CREATE INDEX, DROP INDEX, DELETE, SELECT, etc. Sometimes I send a command (like CREATE INDEX), it returns, telling me that the query was executed perfectly, etc. However, the postmaster process assigned to such a command seems to remain sleeping on the server. Now, I do not really mind this, for I say to myself that PostgreSQL maintains a pool of postmasters ready to process queries. Yet, if this process eats up 6 GB of it 9.4 GB assigned RAM, I worry (and it does so for the moment). Now maybe this is a cache of data that is kept in [shared] memory in case another query happens to need to use that same data, but I do not know.
Another thing is bothering me.
I have 2 tables. One is the page table; I have an index on its page_id column. The other is the pagelinks tables which has the pl_from column that references either nothing or a variable in the page.page_id column; unlike the page_id column, the pl_from has no index (yet). To give you an idea of the scale of the tables and the necessity for me to find a viable solution, page table has 13.4 million rows (after I deleted those I do not need) while the pagelinks table has 293 million.
I need to execute the following command to clean the pagelinks table of some of its useless rows:
DELETE FROM pagelinks USING page WHERE pl_from NOT IN (page_id);
So basically, I wish to rid the pagelinks table of all links coming from a page not in the page table. Even after disabling the nested loops and/or sequential scans, the query optimizer always gives me the following "solution":
Nested Loop (cost=494640.60..112115531252189.59 rows=3953377028232000 width=6)
Join Filter: ("outer".pl_from <> "inner".page_id)"
-> Seq Scan on pagelinks (cost=0.00..5889791.00 rows=293392800 width=17)
-> Materialize (cost=494640.60..708341.51 rows=13474691 width=11)
-> Seq Scan on page (cost=0.00..402211.91 rows=13474691 width=11)
It seems that such a task would take more than weeks to complete; obviously, this is unacceptable. It seems to me that I would much rather it use the page_id index to do its thing...but it is a stubborn optimizer and I might be wrong.
Solution 2
Indeed, I decided to CREATE a temporary table to speed up query execution:
CREATE TABLE temp_to_delete AS(
(SELECT DISTINCT pl_from FROM pagelinks)
EXCEPT
(SELECT page_id FROM page));
DELETE FROM pagelinks USING temp_to_delete
WHERE pagelinks.pl_from IN (temp_to_delete.pl_from);
Surprisingly, this query completed in about 4 hours while the initial query had remained active for about 14hrs before I decided to kill it. More specifically, the DELETE returned:
Query returned successfully: 31340904 rows affected, 4415166 ms execution time.
As for the first part of my question, it seems that the postmaster process indeed keeps some info in cache; when another query requires info not in the cache and some memory (RAM), the cache is emptied. And the postmasters are indeed but a pool of process'.
It has also occurred to me that the gnome-system-monitor is a myth for it gives incomplete information and is worthless in informational value. It is mostly due to this application that I have been so confused lately; for example, it does not consider the memory usage of other users (like the postgres user!) and even tells me that I have 12 GB of RAM left when this is so untrue. Hence, I tried out a couple of system monitors for I like to know how postgreSQL is using its resources, and it seems that xosview is indeed a valid tool.
Hope this helps!
OTHER TIPS
To your second question; you could try creating a new table with just the records you need with a CREATE TABLE AS statement; if the new table is sufficiently small, it might be faster- but it might not help either.
Your postmaster process will stay there as long as the connection to the client is open. Does pgadmin close the connection ? I don't know.
Memory used could be shared_buffers (check your config settings) or not.
Now, the query. For big maintenance operations like this, feel free to set work_mem to something large like a few GB. You look like you got lots of RAM, so use it.
set work_mem to '4GB'; EXPLAIN DELETE FROM pagelinks WHERE pl_from NOT IN (SELECT page_id FROM page);
It should seq scan page, hash it, and seq scan pagelinks, peeking in the hash to check for page_ids. It should be quite fast (much faster than 4 hours !) but you need a large work_mem for the hash.
But since you delete a significant portion of your table, it might be faster to do it like this :
CREATE TABLE pagelinks2 AS SELECT a.* FROM pagelinks a JOIN pages b ON a.pl_from = b.page_id;
(you could use a simple JOIN instead of IN)
You can also add an ORDER BY on this query, and your new table will be nicely ordered on disk for optimal access later.