I need your help, as I was working with AWK
many years ago and my knowledge is rusty now. Despite refreshing my memory some by reading several guides, I'm sure that my code contains some mistakes. Most related questions that I've read on SO deal with parsing standard JSON
, so the advice is not applicable to my case. The only answer close to what I'm looking for is the accepted answer for this SO question: using awk sed to parse update puppet file. But I'm trying to implement a two-pass parsing, whereas I don't see it in that answer (or don't understand it enough).
After considering other options (from R
itself to m4
and various template engines in between), I thought about implementing the solution purely in R via jsonlite
and stringr
packages, but it's not elegant. I've decided to write a short AWK
script that would parse my R
project's data collection configuration files before they will be read by my R
code. Such file is for the most part a JSON
file, but with some additions:
1) it contains embedded variables that are parameters, referring to values of JSON
elements in the same file (which for simplicity I decided to place in the root of JSON
tree);
2) parameters are denoted by placing a star character ('*') immediately before corresponding elements' names.
Initially I planned two types of embedded variables, which you can see here - internal (references to JSON elements in the same file, format: ${var}
) and external (user-supplied, format: %{var}
). However, the mechanism and benefits of passing values for the external parameters are still unclear to me, so currently I focus only on parsing configuration file with internal variables only. So, please disregard the external variables for now.
Example configuration file:
{
"*source":"SourceForge",
"*action":"import",
"*schema":"sf0314",
"data":[
{
"indicatorName":"test1",
"indicatorDescription":"Test Indicator 1",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
},
{
"indicatorName":"test2",
"indicatorDescription":"Test Indicator 2",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
},
{
"indicatorName":"totalProjects",
"indicatorDescription":"Total number of unique projects",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
},
{
"indicatorName":"totalDevs",
"indicatorDescription":"Total number of developers per project",
"indicatorType":"numeric",
"resultType":"data.frame",
"requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = %{group_id}"
}
]
}
AWK script:
#!/usr/bin/awk -f
BEGIN {
first_pass = true;
param = "\"\*[a-zA-Z^0-9]+?\"";
regex = "\$\{[a-zA-Z^0-9]+?\}";
params[""] = 0;
}
{
if (first_pass)
if (match($0, param)) {
print(substr($0, RSTART, RLENGTH));
params[param] = substr($0, RSTART, RLENGTH);
}
else
gsub(regex, params[regex], $0);
}
END {
if (first_pass) {
ARGC++;
ARGV[ARGIND++] = FILENAME;
first_pass = false;
nextfile;
}
}
Any help will be much appreciated! Thanks!
UPDATE (based on the G. Grothendieck's answer)
The following code (wrapped in a function and slightly modified from the original answer) behaves incorrectly, unexpectedly outputting values of all marked (with '_') configuration keys instead of only the referenced ones:
generateConfig <- function(configTemplate, configFile) {
suppressPackageStartupMessages(suppressWarnings(library(tcltk)))
if (!require(gsubfn)) install.packages('gsubfn')
library(gsubfn)
regexKeyValue <- '"_([^"]*)":"([^"]*)"'
regexVariable <- "[$]{([[:alpha:]][[:alnum:].]*)}"
cfgTmpl <- readLines(configTemplate)
defns <- strapplyc(cfgTmpl, regexKeyValue, simplify = rbind)
dict <- setNames(defns[, 2], defns[, 1])
config <- gsubfn(regexVariable, dict, cfgTmpl)
writeLines(config, con = configFile)
}
The function is called as follows:
if (updateNeeded()) {
<...>
generateConfig(SRDA_TEMPLATE, SRDA_CONFIG)
}
UPDATE 2 (per G. Grothendieck's request)
Function updateNeeded()
checks existence and modification time of both files, then, based on logic, a decision is made on whether there is a need to (re)generate the config. file (returns boolean
).
The following is the contents of the template configuration file (SRDA_TEMPLATE <- "./SourceForge.cfg.tmpl"
):
{
"_source":"SourceForge",
"_action":"import",
"_schema":"sf0314",
"data":[
{
"indicatorName":"test1",
"indicatorDescription":"Test Indicator 1",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
},
{
"indicatorName":"test2",
"indicatorDescription":"Test Indicator 2",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
},
{
"indicatorName":"totalProjects",
"indicatorDescription":"Total number of unique projects",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
},
{
"indicatorName":"totalDevs",
"indicatorDescription":"Total number of developers per project",
"indicatorType":"numeric",
"resultType":"data.frame",
"requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = 78745"
}
]
}
The following is the contents of the auto-generated configuration file (SRDA_CONFIG <- "./SourceForge.cfg.json"
):
{
"_source":"SourceForge",
"_action":"import",
"_schema":"sf0314",
"data":[
{
"indicatorName":"test1",
"indicatorDescription":"Test Indicator 1",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
},
{
"indicatorName":"test2",
"indicatorDescription":"Test Indicator 2",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
},
{
"indicatorName":"totalProjects",
"indicatorDescription":"Total number of unique projects",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT COUNT(DISTINCT group_id) FROM SourceForge import sf0314.user_group"
},
{
"indicatorName":"totalDevs",
"indicatorDescription":"Total number of developers per project",
"indicatorType":"numeric",
"resultType":"data.frame",
"requestSQL":"SELECT COUNT(*) FROM SourceForge import sf0314.user_group WHERE group_id = 78745"
}
]
}
Notice SourceForge
and import
, unexpectedly populated before sf0314
.
Help by the answer's author will be much appreciated!