Question

I'm using a python script to log IO of a grid job.
the log is formatted like this:

timestamp;fullpath;event;size
1526994189.49;/tmp/folder/;IN_ISDIR;6
1526994189.49;/tmp/folder2/File;IN_ACCESS;36

Those files are millions of line long. I'm using Spark to generate graphs and detect anomaly in job IO. But before doing that I need to insert the job ID and the jobname to the column making the file :

timestamp;fullpath;event;size;jobid;jobname
1526994189.49;/tmp/folder/;IN_ISDIR;6;123456;afakejobname
1526994189.49;/tmp/folder2/File;IN_ACCESS;36;123456;afakejobname

The thing is I'm new to big Data technologies and I would like to know if using parquet format is it better to put both jobname and jobid
or knowing that I have only 15 different jobname and jobid in the same log is it better to convert it on the fly using SparkSQL and make a join to a very small table with just jobname;jobid and put only the jobid in my log.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top