How to automate the detection of the column type of a csv file for a table creation script?

https://dba.stackexchange.com/questions/253302

18-02-2021
|

Question

I am new to pgAdmin and Azure database. I have a huge csv file with around 220 columns and I want to create a table out of it in pgAdmin4 to push it to Azure.

However I don't know how to detect automatically the types of columns.

Here is an exemple :

IDI GKID    S01Q01  S02Q01_Gender ...
100093  enq030059569748fc89091fdd91cc337cac44eca90  Yes, I agree    Female ...

I'm doing the script to create the table. However, given the number of columns I would like to automate it to get the script that would allow me to add the csv to the database in pgAdmin4 for Microsoft Azure.

Then, after transposing the csv header, I get:

    IDI
    GKID
    S01Q01
    S02Q01_Gender
    ...

I'm doing the script to create the table. However, given the number of columns I would like, in the best case, to automate the detection of the type of column to be able to write it in the right part, in the worst case to put a generic type of the type TEXT.

So far, I've tried

output = ""
file_name = "columns.txt"
string_to_add = " TINYTEXT,"

with open(file_name, 'r') as f:
    file_lines = [''.join([x.strip(), string_to_add, '\n']) for x in f.readlines()]

with open(file_name, 'w') as f:
    f.writelines(file_lines)

It gives me back:

IDI TINYTEXT,
GKID TINYTEXT,
S01Q01 TINYTEXT,
S02Q01_Gender TINYTEXT,
...

And, then, I can do:

CREATE TABLE my_table (
IDI TINYTEXT,
GKID TINYTEXT,
S01Q01 TINYTEXT,
S02Q01_Gender TINYTEXT,
...

But I'm not sure that this enough to make a table able to receive my csv file.

Solution

there are command line tools that do field type inferencing.

One is xsv https://github.com/BurntSushi/xsv/

Running this command on this (https://gist.githubusercontent.com/aborruso/3b1af402f0d2ed49465f218d19be81d9/raw/c0e95b320924e9e49902633d16e7ab253046ca16/input.csv)

xsv stats input.csv --everything | xsv table

you have

field       type     sum                 min       max         min_length  max_length  mean                stddev              median              mode  cardinality
id          Integer  5050                1         100         1           3           50.5                28.86607004772212   50.5                N/A   100
first_name  Unicode                      Annabal   Willabella  3           11                                                                      N/A   98
last_name   Unicode                      Albinson  Zaniolini   3           13                                                                      N/A   100
f           Float    2063.2419999999984  0.2656    51.1245     4           7           20.632419999999996  12.603955889545158  17.930799999999998  N/A   100

Using csvkit (https://csvkit.readthedocs.io/en/latest/index.html) and running

csvstat --csv input.csv

you will have

column_id column_name type   nulls unique min   max    sum       mean   median stdev  len freq
1         id          Number False 100    1     100    5,050     50.5   50.5   29.011 -   1, 2, 3, 4, 5
2         first_name  Text   False 98     -     -      -         -      -      -      11  Caren, Weylin, Heall, Flori, Lydia
3         last_name   Text   False 100    -     -      -         -      -      -      13  Saxby, Joderli, Kleinzweig, Coyle, Kleinplac
4         f           Number False 100    0.266 51.124 2,063.242 20.632 17.931 12.667 -   5.356, 12.596, 32.1245, 5.32, 0.2656

In csvkit you have also

csvsql -i postgresql input.csv

that gives you

CREATE TABLE input (
        id DECIMAL NOT NULL,
        first_name VARCHAR NOT NULL,
        last_name VARCHAR NOT NULL,
        f DECIMAL NOT NULL
);

OTHER TIPS

Where does the csv originate from? - I would guess some other database - can't you use the create table statements used for the other database?

That aside, though a second data-set would be helpful -

IDI is obviously an integer
S02Q01_Gender .. depending on your scenario most likely an enum type

the other columns look like string-data so tinytext seems not wrong ...

I understand that some 220 columns in x tables is a lot of work doing it manually - but I really suggest investing the time and setting appropriate data-types (especially for Date-Time columns - and columns most likely used for foreign key relations (integer for size that influences speed doing join-ed requests)

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange