How to automate the detection of the column type of a csv file for a table creation script?
-
18-02-2021 - |
Вопрос
I am new to pgAdmin and Azure database. I have a huge csv file with around 220 columns and I want to create a table out of it in pgAdmin4 to push it to Azure.
However I don't know how to detect automatically the types of columns.
Here is an exemple :
IDI GKID S01Q01 S02Q01_Gender ...
100093 enq030059569748fc89091fdd91cc337cac44eca90 Yes, I agree Female ...
I'm doing the script to create the table. However, given the number of columns I would like to automate it to get the script that would allow me to add the csv to the database in pgAdmin4 for Microsoft Azure.
Then, after transposing the csv header, I get:
IDI
GKID
S01Q01
S02Q01_Gender
...
I'm doing the script to create the table. However, given the number of columns I would like, in the best case, to automate the detection of the type of column to be able to write it in the right part, in the worst case to put a generic type of the type TEXT.
So far, I've tried
output = ""
file_name = "columns.txt"
string_to_add = " TINYTEXT,"
with open(file_name, 'r') as f:
file_lines = [''.join([x.strip(), string_to_add, '\n']) for x in f.readlines()]
with open(file_name, 'w') as f:
f.writelines(file_lines)
It gives me back:
IDI TINYTEXT,
GKID TINYTEXT,
S01Q01 TINYTEXT,
S02Q01_Gender TINYTEXT,
...
And, then, I can do:
CREATE TABLE my_table (
IDI TINYTEXT,
GKID TINYTEXT,
S01Q01 TINYTEXT,
S02Q01_Gender TINYTEXT,
...
But I'm not sure that this enough to make a table able to receive my csv file.
Решение
there are command line tools that do field type inferencing.
One is xsv https://github.com/BurntSushi/xsv/
Running this command on this (https://gist.githubusercontent.com/aborruso/3b1af402f0d2ed49465f218d19be81d9/raw/c0e95b320924e9e49902633d16e7ab253046ca16/input.csv)
xsv stats input.csv --everything | xsv table
you have
field type sum min max min_length max_length mean stddev median mode cardinality
id Integer 5050 1 100 1 3 50.5 28.86607004772212 50.5 N/A 100
first_name Unicode Annabal Willabella 3 11 N/A 98
last_name Unicode Albinson Zaniolini 3 13 N/A 100
f Float 2063.2419999999984 0.2656 51.1245 4 7 20.632419999999996 12.603955889545158 17.930799999999998 N/A 100
Using csvkit (https://csvkit.readthedocs.io/en/latest/index.html) and running
csvstat --csv input.csv
you will have
column_id column_name type nulls unique min max sum mean median stdev len freq
1 id Number False 100 1 100 5,050 50.5 50.5 29.011 - 1, 2, 3, 4, 5
2 first_name Text False 98 - - - - - - 11 Caren, Weylin, Heall, Flori, Lydia
3 last_name Text False 100 - - - - - - 13 Saxby, Joderli, Kleinzweig, Coyle, Kleinplac
4 f Number False 100 0.266 51.124 2,063.242 20.632 17.931 12.667 - 5.356, 12.596, 32.1245, 5.32, 0.2656
In csvkit you have also
csvsql -i postgresql input.csv
that gives you
CREATE TABLE input (
id DECIMAL NOT NULL,
first_name VARCHAR NOT NULL,
last_name VARCHAR NOT NULL,
f DECIMAL NOT NULL
);
Другие советы
Where does the csv originate from? - I would guess some other database - can't you use the create table statements used for the other database?
That aside, though a second data-set would be helpful -
- IDI is obviously an integer
- S02Q01_Gender .. depending on your scenario most likely an enum type
the other columns look like string-data so tinytext seems not wrong ...
I understand that some 220 columns in x tables is a lot of work doing it manually - but I really suggest investing the time and setting appropriate data-types (especially for Date-Time columns - and columns most likely used for foreign key relations (integer for size that influences speed doing join-ed requests)