Question

Say I run the following query on psql:

> select a.c1, b.c2 into temp_table from db.A as a inner join db.B as b 
> on a.x = b.x limit 10;

I get the following message:

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'c1' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

  1. What is a DISTRIBUTED BY column?
  2. Where is temp_table stored? Is it stored on my client or on the server?
Was it helpful?

Solution

  1. DISTRIBUTED BY is how Greenplum determines which segment will store each row. Because Greenplum is an MPP database in most production databases you will have multiple segment servers. You want to make sure that the Distribution column is the column you will join on usaly.

  2. temp_table is a table that will be created for you on the Greenplum cluster. If you haven't set search_path to something else it will be in the public schema.

OTHER TIPS

For your first question, the DISTRIBUTE BY clause is used for telling the database server how to store the database on the disk. (Create Table Documentation)

I did see one thing right away that could be wrong with the syntax on your Join clause where you say on a.x = s.x --> there is no table referenced as s. Maybe your problem is as simple as changing this to on a.x = b.x?

As far as where the temp table is stored, I believe it is generally stored on the database server. This would be a question for your DBA as it is a setup item when installing the database. You can always dump your data to a file on your computer and reload at a later time if you want to save your results (without printing.)

As I know, tmp table is stored in memory. It is faster when there are less data and it is recommended to use temp table. In the opposite, as temp table is stored into memory, if there are too much data it will consume very large memory. It is recommended to use regular tables with distributed clause. As it will be distributed across your cluster.

In addition, tmp table is stored into a special schema, so you don't need to specify the schema name when creating the temp table, and it only exist in the current connection, after you close the current connection, postgresql will drop the table automatically.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top