Question

I currently working with a larger wikipedia-dump derived PostgreSQL database; it contains about 40 GB of data. The database is running on an HP Proliant ML370 G5 server with Suse Linux Enterprise Server 10; I am querying it from my laptop over a private network managed by a simple D-Link router. I assigned static DHCP (private) IPs to both laptop and server.

Anyway, from my laptop, using pgAdmin III, I send off some SQL commands/queries; some of these are CREATE INDEX, DROP INDEX, DELETE, SELECT, etc. Sometimes I send a command (like CREATE INDEX), it returns, telling me that the query was executed perfectly, etc. However, the postmaster process assigned to such a command seems to remain sleeping on the server. Now, I do not really mind this, for I say to myself that PostgreSQL maintains a pool of postmasters ready to process queries. Yet, if this process eats up 6 GB of it 9.4 GB assigned RAM, I worry (and it does so for the moment). Now maybe this is a cache of data that is kept in [shared] memory in case another query happens to need to use that same data, but I do not know.

Another thing is bothering me.

I have 2 tables. One is the page table; I have an index on its page_id column. The other is the pagelinks tables which has the pl_from column that references either nothing or a variable in the page.page_id column; unlike the page_id column, the pl_from has no index (yet). To give you an idea of the scale of the tables and the necessity for me to find a viable solution, page table has 13.4 million rows (after I deleted those I do not need) while the pagelinks table has 293 million.

I need to execute the following command to clean the pagelinks table of some of its useless rows:

DELETE FROM pagelinks USING page WHERE pl_from NOT IN (page_id);

So basically, I wish to rid the pagelinks table of all links coming from a page not in the page table. Even after disabling the nested loops and/or sequential scans, the query optimizer always gives me the following "solution":

Nested Loop  (cost=494640.60..112115531252189.59 rows=3953377028232000 width=6)
  Join Filter: ("outer".pl_from <> "inner".page_id)"
  ->  Seq Scan on pagelinks  (cost=0.00..5889791.00 rows=293392800 width=17)
  ->  Materialize  (cost=494640.60..708341.51 rows=13474691 width=11)
        ->  Seq Scan on page  (cost=0.00..402211.91 rows=13474691 width=11)

It seems that such a task would take more than weeks to complete; obviously, this is unacceptable. It seems to me that I would much rather it use the page_id index to do its thing...but it is a stubborn optimizer and I might be wrong.

Was it helpful?

Solution 2

Indeed, I decided to CREATE a temporary table to speed up query execution:

CREATE TABLE temp_to_delete AS(
    (SELECT DISTINCT pl_from FROM pagelinks) 
        EXCEPT 
    (SELECT page_id FROM page));
DELETE FROM pagelinks USING temp_to_delete 
    WHERE pagelinks.pl_from IN (temp_to_delete.pl_from);

Surprisingly, this query completed in about 4 hours while the initial query had remained active for about 14hrs before I decided to kill it. More specifically, the DELETE returned:

Query returned successfully: 31340904 rows affected, 4415166 ms execution time.

As for the first part of my question, it seems that the postmaster process indeed keeps some info in cache; when another query requires info not in the cache and some memory (RAM), the cache is emptied. And the postmasters are indeed but a pool of process'.

It has also occurred to me that the gnome-system-monitor is a myth for it gives incomplete information and is worthless in informational value. It is mostly due to this application that I have been so confused lately; for example, it does not consider the memory usage of other users (like the postgres user!) and even tells me that I have 12 GB of RAM left when this is so untrue. Hence, I tried out a couple of system monitors for I like to know how postgreSQL is using its resources, and it seems that xosview is indeed a valid tool.

Hope this helps!

OTHER TIPS

To your second question; you could try creating a new table with just the records you need with a CREATE TABLE AS statement; if the new table is sufficiently small, it might be faster- but it might not help either.

Your postmaster process will stay there as long as the connection to the client is open. Does pgadmin close the connection ? I don't know.

Memory used could be shared_buffers (check your config settings) or not.

Now, the query. For big maintenance operations like this, feel free to set work_mem to something large like a few GB. You look like you got lots of RAM, so use it.

set work_mem to '4GB'; EXPLAIN DELETE FROM pagelinks WHERE pl_from NOT IN (SELECT page_id FROM page);

It should seq scan page, hash it, and seq scan pagelinks, peeking in the hash to check for page_ids. It should be quite fast (much faster than 4 hours !) but you need a large work_mem for the hash.

But since you delete a significant portion of your table, it might be faster to do it like this :

CREATE TABLE pagelinks2 AS SELECT a.* FROM pagelinks a JOIN pages b ON a.pl_from = b.page_id;

(you could use a simple JOIN instead of IN)

You can also add an ORDER BY on this query, and your new table will be nicely ordered on disk for optimal access later.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top