Parsing JSON-like configuration file using R or AWK

https://stackoverflow.com/questions/23294120

09-07-2023
|

Question

I need your help, as I was working with AWK many years ago and my knowledge is rusty now. Despite refreshing my memory some by reading several guides, I'm sure that my code contains some mistakes. Most related questions that I've read on SO deal with parsing standard JSON, so the advice is not applicable to my case. The only answer close to what I'm looking for is the accepted answer for this SO question: using awk sed to parse update puppet file. But I'm trying to implement a two-pass parsing, whereas I don't see it in that answer (or don't understand it enough).

After considering other options (from R itself to m4 and various template engines in between), I thought about implementing the solution purely in R via jsonlite and stringr packages, but it's not elegant. I've decided to write a short AWK script that would parse my R project's data collection configuration files before they will be read by my R code. Such file is for the most part a JSON file, but with some additions:

1) it contains embedded variables that are parameters, referring to values of JSON elements in the same file (which for simplicity I decided to place in the root of JSON tree);

2) parameters are denoted by placing a star character ('*') immediately before corresponding elements' names.

Initially I planned two types of embedded variables, which you can see here - internal (references to JSON elements in the same file, format: ${var}) and external (user-supplied, format: %{var}). However, the mechanism and benefits of passing values for the external parameters are still unclear to me, so currently I focus only on parsing configuration file with internal variables only. So, please disregard the external variables for now.

Example configuration file:

{
   "*source":"SourceForge",
   "*action":"import",
   "*schema":"sf0314",
   "data":[
      {
         "indicatorName":"test1",
         "indicatorDescription":"Test Indicator 1",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
      },
      {
         "indicatorName":"test2",
         "indicatorDescription":"Test Indicator 2",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * 
                       FROM sf1104.users a, sf1104.artifact b 
                       WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
      },
      {
         "indicatorName":"totalProjects",
         "indicatorDescription":"Total number of unique projects",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
      },
      {
         "indicatorName":"totalDevs",
         "indicatorDescription":"Total number of developers per project",
         "indicatorType":"numeric",
         "resultType":"data.frame",
         "requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = %{group_id}"
      }
   ]
}

AWK script:

#!/usr/bin/awk -f

BEGIN {
  first_pass = true;
  param = "\"\*[a-zA-Z^0-9]+?\"";
  regex = "\$\{[a-zA-Z^0-9]+?\}";
  params[""] = 0;
}

{
  if (first_pass)
    if (match($0, param)) {
      print(substr($0, RSTART, RLENGTH));
      params[param] = substr($0, RSTART, RLENGTH);
    }
  else
      gsub(regex, params[regex], $0);
}

END {
  if (first_pass) {
    ARGC++;
    ARGV[ARGIND++] = FILENAME;
    first_pass = false;
    nextfile;
  }
}

Any help will be much appreciated! Thanks!

UPDATE (based on the G. Grothendieck's answer)

The following code (wrapped in a function and slightly modified from the original answer) behaves incorrectly, unexpectedly outputting values of all marked (with '_') configuration keys instead of only the referenced ones:

generateConfig <- function(configTemplate, configFile) {

  suppressPackageStartupMessages(suppressWarnings(library(tcltk)))
  if (!require(gsubfn)) install.packages('gsubfn')
  library(gsubfn)

  regexKeyValue <- '"_([^"]*)":"([^"]*)"'
  regexVariable <- "[$]{([[:alpha:]][[:alnum:].]*)}"

  cfgTmpl <- readLines(configTemplate)

  defns <- strapplyc(cfgTmpl, regexKeyValue, simplify = rbind)
  dict <- setNames(defns[, 2], defns[, 1])
  config <- gsubfn(regexVariable, dict, cfgTmpl)

  writeLines(config, con = configFile)
}

The function is called as follows:

if (updateNeeded()) {
  <...>
  generateConfig(SRDA_TEMPLATE, SRDA_CONFIG)
}

UPDATE 2 (per G. Grothendieck's request)

Function updateNeeded() checks existence and modification time of both files, then, based on logic, a decision is made on whether there is a need to (re)generate the config. file (returns boolean).

The following is the contents of the template configuration file (SRDA_TEMPLATE <- "./SourceForge.cfg.tmpl"):

{
   "_source":"SourceForge",
   "_action":"import",
   "_schema":"sf0314",
   "data":[
      {
         "indicatorName":"test1",
         "indicatorDescription":"Test Indicator 1",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
      },
      {
         "indicatorName":"test2",
         "indicatorDescription":"Test Indicator 2",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * 
                       FROM sf1104.users a, sf1104.artifact b 
                       WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
      },
      {
         "indicatorName":"totalProjects",
         "indicatorDescription":"Total number of unique projects",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
      },
      {
         "indicatorName":"totalDevs",
         "indicatorDescription":"Total number of developers per project",
         "indicatorType":"numeric",
         "resultType":"data.frame",
         "requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = 78745"
      }
   ]
}

The following is the contents of the auto-generated configuration file (SRDA_CONFIG <- "./SourceForge.cfg.json"):

{
   "_source":"SourceForge",
   "_action":"import",
   "_schema":"sf0314",
   "data":[
      {
         "indicatorName":"test1",
         "indicatorDescription":"Test Indicator 1",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
      },
      {
         "indicatorName":"test2",
         "indicatorDescription":"Test Indicator 2",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * 
                       FROM sf1104.users a, sf1104.artifact b 
                       WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
      },
      {
         "indicatorName":"totalProjects",
         "indicatorDescription":"Total number of unique projects",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT COUNT(DISTINCT group_id) FROM SourceForge import sf0314.user_group"
      },
      {
         "indicatorName":"totalDevs",
         "indicatorDescription":"Total number of developers per project",
         "indicatorType":"numeric",
         "resultType":"data.frame",
         "requestSQL":"SELECT COUNT(*) FROM SourceForge import sf0314.user_group WHERE group_id = 78745"
      }
   ]
}

Notice SourceForge and import, unexpectedly populated before sf0314.

Help by the answer's author will be much appreciated!

Solution

I am assuming the objective is to replace each occurrance of ${...} with the definition given on the star lines. In the post it indicates that you are looking at awk because an R solution was not elegant but I think that may have been due to the approach taken using R and I am assuming an R solution is still acceptable if by using a different approach it yields a solution that is reasonably compact.

Here config.json is the name of the input json file and config.out.json is the output file with the definitions substituted in.

We read in the file and use strapplyc to extract out a 2 column matrix of the definitions, defns. We rework this into a vector, dict, whose values are the values of the variables and whose names are the names of the variables. Then we use gsubfn to insert the definitions using the dict list. Finally we write it back out.

library(gsubfn)

Lines <- readLines("config.json")

defns <- strapplyc(Lines, '"\\*([^"]*)":"([^"]*)"', simplify = rbind)
dict <- setNames(as.list(defns[, 2]), defns[, 1])
Lines.out <- gsubfn("[$]{([[:alpha:]][[:alnum:].]*)}", dict, Lines)

writeLines(Lines.out, con = "config.out.json")

REVISED dict should be a list rather than a named character vector.

OTHER TIPS

I believe:

#!/usr/bin/awk -f

BEGIN {
  param = "\"\\*([a-zA-Z]+?)\":\"([^\"]*)\"";
  regex = "\\${([a-zA-Z]+?)}";
}

NR == FNR {
    if (match($0, param, a)) {
      params[a[1]] = a[2]
    }
    next
}

match($0, regex, a) {
  gsub(regex, params[a[1]], $0);
}
1

does what you want (when run as awk -f file.awk input.conf input.conf) for your given input.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow