Domanda

I have CSV file with millions of records. One of the fields (e.g. $2) contains date. I need the date to be from the last week.

I thought to use AWK to substitute the current date with a random date from the last 7 days. For this, I wanted to create an array indexed 1..7 and it's values will be the last 7 dates (e.g. 2013-10-08, 2013-10-07...).

Is there a function which can calculate dates from a given date? Or some code that can do the job?

My timestamp in the file is "yyyy-mm-dd hh:mm:ss".

Update - answer

This what fitted my needs at last:

BEGIN { srand() } 
{ 
  split($2, a, "-") 
  t  = mktime(a[1] " " a[2] " " substr(a[3], 1, 2) " 00 00 00") 
  $2 = strftime("%Y-%m-%d %H:%M:%S", t - int(7*rand()+1) * 3600 * 24 + int(3600*24*rand())) 
}
1
È stato utile?

Soluzione

I assume you have something like this:

data1,2013-10-04,data2,...

and you want to change the second column date to a random day in the past 7 days.

This is fairly straightforward with GNU awk where you have the mktime and strftime commands.

Here is a gawk script (rnd_date.awk) that works on the input listed above:

BEGIN { srand() }
{
  split($2, a, /-/)
  t  = mktime(a[1] " " a[2] " " a[3] " 00 00 00")
  $2 = strftime("%F", t - int(7*rand()+1) * 3600 * 24)
}
1

Run it like this:

awk -f rnd_date.awk FS=, OFS=, infile

Example output:

data1,2013-09-29,data2,...

There are a couple of caveats:

  1. srand seeds the pseudo-random number generator with "seconds since the epoch", so if you rapidly invoke this script there will be no randomness between each invocation. You can counter this be supplying a better seed, e.g. with GNU date:

    awk -f rnd_date -v s=$(date +%N) 'BEGIN { srand(s) } ...'

  2. Setting $2 will force a reformatting of the whole line, depending on your data format this may or may not be a problem.

Edit

I see, your data is differently formatted, you have:

data1,2013-10-04 01:01:01,data2,...

You can adapt the above like this (with better named variables):

BEGIN { srand() }
{
  split($2, date, /-| |:/)
  since_epoch = mktime(date[1] " " date[2] " " date[3] " " date[4] " " date[5] " " date[6])
  $2 = strftime("%F %T", since_epoch - int(7*rand()+1) * 3600 * 24)
}
1

Or if you do not care about the HH:MM:SS part

BEGIN { srand() }
{
  split($2, a, /-| /)
  t  = mktime(a[1] " " a[2] " " a[3] " 00 00 00")
  $2 = strftime("%F %T", t - int(7*rand()+1) * 3600 * 24)
}
1

Example output:

data1,2013-09-27 00:00:00,data2,...

Altri suggerimenti

You can do some like this awk

awk '$2>=from && $2<=to' from="$(date +%b" "%e" "%H:%M:%S -d -7days)" to="$(date +%b" "%e" "%H:%M:%S)" file

You need to tune the date command to give same output as the date in field $2 that contains the date. Then this will give you the last 7 days of data

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top