parquet format: advise on log content
-
31-10-2019 - |
Question
I'm using a python script to log IO of a grid job.
the log is formatted like this:
timestamp;fullpath;event;size
1526994189.49;/tmp/folder/;IN_ISDIR;6
1526994189.49;/tmp/folder2/File;IN_ACCESS;36
Those files are millions of line long. I'm using Spark to generate graphs and detect anomaly in job IO. But before doing that I need to insert the job ID and the jobname to the column making the file :
timestamp;fullpath;event;size;jobid;jobname
1526994189.49;/tmp/folder/;IN_ISDIR;6;123456;afakejobname
1526994189.49;/tmp/folder2/File;IN_ACCESS;36;123456;afakejobname
The thing is I'm new to big Data technologies and I would like to know if using parquet format is it better to put both jobname
and jobid
or knowing that I have only 15 different jobname and jobid in the same log is it better to convert it on the fly using SparkSQL and make a join to a very small table with just jobname;jobid
and put only the jobid
in my log.
No correct solution
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange