Load matrix from file into PostgreSQL table

https://stackoverflow.com/questions/10407545

05-06-2021
|

Question

I have a file universities.txt which looks like this:

Alabama

Air University
Alabama A&M University
Alabama State University
Concordia College-Selma
Faulkner University
Huntingdon College
Jacksonville State University
Judson College
Miles College
Oakwood College
Samford University
Southeastern Bible College
Southern Christian University
Spring Hill College
Stillman College
Talladega College
University of North Alabama
University of South Alabama
University of West Alabama

Alaska

Alaska Bible College
Alaska Pacific University
Sheldon Jackson College
University of Alaska - Anchorage
University of Alaska - Fairbanks
University of Alaska - Southeast

Arizona

American Indian College of the Assemblies of God
Arizona State University
Arizona State University East
Arizona State University West
DeVry University-Phoenix
Embry-Riddle Aeronautical University
Grand Canyon University
Northcentral University
Northern Arizona University

.. and so on, where in this case Alabama, Alaska and Arizona are locations and everything else are universities. What I want to do is load the location into a table called Location and the Universities into a table called University, where the Id of the Location table is a FK to the University table, like this:

CREATE TABLE Location (
Id          SERIAL PRIMARY KEY,
Name        TEXT
);

CREATE TABLE University (
Id          SERIAL PRIMARY KEY,
Location    INTEGER REFERENCES Location (Id) NOT NULL,
Name        TEXT
);

So what I want to do in Postgres is something like this:

for (int i=0 until i = universities.size()  i++){
//each entry in the universities vector is a tuple with the first entry being the country/state
//and the second entry being a vector of the universities as String's
Vector tuple = (Vector)universities.get(i);
//insert into location table
String state = (String)tuple.get(0); 
Vector u = (Vector)tuple.get(1);
for(int j=0; until j =u.size(); j++){
//insert into university table with i as FK to location table

Anyone knows how to do this?

Solution

Here is a pure SQL solution.

Use COPY to import your file into a temporary table and one DML statement with data modifying CTEs (requires PostgreSQL 9.1 or later) to do the rest. Should be fast for both steps:

Temporary table with a single text column ( dropped automatically at end of session):

CREATE TEMP TABLE tmp (txt text);

Import data from file:

COPY tmp FROM '/path/to/file.txt'

If you are doing this from a remote client, use the meta command \copy of psql instead.

My solution depends on the data format displayed in the question. I.e.: there is an empty row before and after a city. I assume actual empty strings in the import file. Make sure to have a leading row with an empty string before the first city to avoid a special case.

Rows will be inserted in order. I use that for the following window functions without ordering.

WITH x AS (
    SELECT txt
          ,row_number() OVER () AS rn
          ,lead(txt) OVER () = '' AND
            lag(txt) OVER () = '' AS city
    FROM   tmp                -- don't remove empty rows just yet
    ), y AS (
    SELECT txt, city
          ,sum(city::int) OVER w AS id
    FROM   x
    WHERE  txt <> ''          -- remove empty rows now
    WINDOW w AS (ORDER BY rn)
    ), l AS (
    INSERT INTO location (id, name)
    SELECT id, txt
    FROM   y
    WHERE  city
    ), u AS (
    INSERT INTO university u (location, name)
    SELECT id, txt
    FROM   y
    WHERE  NOT city
    )
SELECT setval('location_id_seq', max(id))
FROM   y;

Voilá.

CTE x marks cities based on an empty string value in the rows before and after them.
CTE y adds a running sum of cities (id), thereby forming a perfectly valid id for each city and its unis.
CTEs l and u do the inserting, which is now easy.
The final SELECT sets the next value for the sequence attached to location.id. We have not been using it, so we must set it to the current maximum value or we run into duplicate key errors with future INSERTs to location.

OTHER TIPS

Transforming the original into a table is the safest way... then you can just upload it with COPY.

BEGIN { bl=0; body=0; header=""; } 
$0 == "" && body==1 && header!="" { header=""; body=0; bl=1; next; }
$0 == "" && body==0 { bl=1; next; }
$0 != "" && header=="" { header=$0; bl=0; next; }
$0 != "" && bl==1 && header!="" { body=1; print header, ",", $0 }

something like that AWK would turn your file into a table which you could then upload with a straight psql copy statement:

COPY university_data_file_table FROM awk-mashed-file;

then you could transform that table into your separate ones:

CREATE TABLE country AS SELECT DISTINCT country FROM university_data_file_table;
CREATE TABLE university AS SELECT country.id, udft.university FROM country, university_data_file_table udft WHERE udft.country = country.country;

Something like that would be easily scriptable with a psql script. As I say, you do have to do the initial transform.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow