Create integer id columns from existing string columns (integer coding?)

https://dba.stackexchange.com/questions/242888

06-02-2021
|

Question

I have an PostgreSQL server with an existing table which has two fixed-width-non-unique-string (variable size) columns such as this:

| ID_STRING_A | ID_STRING_B |
|   'AAAA'   |    'BBBB'    |   
|   'BBBB'   |    'CCCC'    | 
|   'AAAA'   |    'DDDD'    |

Now I want to compute an integer representation for the both-column-elements and store them into additional columns. The result should look like this:

| ID_STRING_A | ID_STRING_B | ID_INT_A | ID_INT_B |
|   'AAAA'    |   'BBBB'    |     1    |     2    |
|   'BBBB'    |   'CCCC'    |     2    |     3    |
|   'AAAA'    |   'DDDD'    |     1    |     4    |

My frist approach based on the answers is:

Unfortunately, my update part seems to be highly iniefficient although there are indices on ID_STRING_A/B. While the query itself is done in minutes, the update part seems not to end. Here's the code:

ALTER TABLE mytable ADD COLUMN ID_INT_B integer;
ALTER TABLE mytable ADD COLUMN ID_INT_A integer;

UPDATE mytable SET ID_INT_A = g.ID_INT_A , ID_INT_B = g.ID_INT_B FROM
(
    WITH T( n , s ) AS 
    ( 
        SELECT ROW_NUMBER() OVER ( ORDER BY s ) , s
        FROM 
        ( 
            SELECT ID_STRING_A FROM mytable
            UNION 
            SELECT ID_STRING_B FROM mytable
        ) AS X( s )
    )
    SELECT m.ctid AS id_ , m.ID_STRING_A AS ID_STRING_A , m.ID_STRING_B AS ID_STRING_B , T1.n AS ID_INT_A , T2.n AS ID_INT_B FROM mytable AS m
    JOIN T AS T1 ON m.ID_STRING_A = T1.s
    JOIN T AS T2 ON m.ID_STRING_B = T2.s
) AS g
WHERE mytable.ctid = g.id_

Solution

I guess you can use the ASCII function:

SELECT ID_STRING_A,ID_STRING_B
     , ASCII(ID_INT_A) - 64 AS ID_INT_A
     , ASCII(ID_INT_B) - 64 AS ID_INT_B
FROM ...

Perhaps the intention's more clear using:

     , ASCII(ID_INT_A) - ASCII('A') + 1 AS ID_INT_A

EDIT, since the question where changed something like this is possible:

WITH T (n, s) as ( 
    SELECT row_number() over (order by s), s
    FROM ( 
        SELECT ID_STRING_A FROM mytable
        UNION 
        SELECT ID_STRING_B FROM mytable
    ) as X (s)
)
SELECT m.ID_STRING_A, m.ID_STRING_B, T1.n, T2.n
FROM mytable as m
JOIN T as T1
    ON m.ID_STRING_A = T1.s
JOIN T as T2
    ON m.ID_STRING_B = T2.s

EDIT, updating table

I have a gut feeling that this can be done in a simpler way, but I cross joined the cte with itself and filtered with WHERE to update both columns at once:

ALTER TABLE mytable
    ADD ID_INT_A INT;

ALTER TABLE mytable
    ADD ID_INT_B INT;

WITH cte (n, s) as ( 
    SELECT row_number() over (order by s), s
    FROM ( 
        SELECT ID_STRING_A FROM mytable
        UNION 
        SELECT ID_STRING_B FROM mytable
    ) as X (s)
), cte2 (n1,s1,n2,s2) as (
    SELECT c1.n, c1.s, c2.n, c2.s
    FROM cte c1
    CROSS JOIN cte c2
)
UPDATE mytable
    SET ID_INT_A = cte2.n1
      , ID_INT_B = cte2.n2
FROM cte2
WHERE mytable.ID_STRING_A = cte2.s1
  AND mytable.ID_STRING_B = cte2.s2
;

It should be noted that this is a 1-time operation. If you decide to add AABB later on, the enumeration will be wrong

OTHER TIPS

CREATE TEMP TABLE map (
   id serial PRIMARY KEY,
   str text NOT NULL
);

INSERT INTO map (str)
SELECT DISTINCT id_string_a
FROM mytab;

ALTER TABLE mytab ADD id_int_a integer;

UPDATE mytab
SET id_int_a = map.id
FROM map
WHERE mytab.id_string_a = map.str;

DROP TABLE map;

id_string_b is left as an exercise to the reader.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange