R: replacing double escaped text

https://stackoverflow.com/questions/3177091

02-10-2019
|

Question

I'm gluing together a number of system calls using the Amazon Elastic Map Reduce command line tools. These commands return JSON text which has already been (partially?) escaped. Then when the system call turns it into an R text object (intern=T) it appears to get escaped again. I need to clean this up so it will parse with the rjson package.

I do the system call this way:

system("~/EMR/elastic-mapreduce --describe --jobflow j-2H9P770Z4B8GG", intern=T)

which returns:

 [1] "{"                                                                                             
 [2] "  \"JobFlows\": ["                                                                             
 [3] "    {"                                                                                         
 [4] "      \"LogUri\": \"s3n:\\/\\/emrlogs\\/\","                                                   
 [5] "      \"Name\": \"emrFromR\","                                                                 
 [6] "      \"BootstrapActions\": [" 
...

but the same command from the command line returns:

{
  "JobFlows": [
    {
      "LogUri": "s3n:\/\/emrlogs\/",
      "Name": "emrFromR",
      "BootstrapActions": [
        {
          "BootstrapActionConfig": {
...

If I try to run the results of the system call through rjson, I get this error:

Error: '\/' is an unrecognized escape in character string starting "s3n:\/"

I believe this is because of the double escaping in the s3n line. I'm struggling to get this text massaged into something that will parse.

It might be as simple as replacing "\\" with "\" but since I kinda struggle with regex and escaping, I can't get that done properly.

So how do I take a vector of strings and replace any occurrence of "\\" with "\"? (even to type this question I had to use three back slashes to represent two) Any other tips related to this specific use case?

Here's my code in more detail:

> library(rjson)
> emrJson <- paste(system("~/EMR/elastic-mapreduce --describe --jobflow j-2H9P770Z4B8GG", intern=T))
> 
>     parser <- newJSONParser()
>     for (i in 1:length(emrJson)){
+       parser$addData(emrJson[i])
+     }
> 
> parser$getObject()
Error: '\/' is an unrecognized escape in character string starting "s3n:\/"

and if you're itching to recreate the emrJson object, here's the dput() output:

> dput(emrJson)
c("{", "  \"JobFlows\": [", "    {", "      \"LogUri\": \"s3n:\\/\\/emrlogs\\/\",", 
"      \"Name\": \"emrFromR\",", "      \"BootstrapActions\": [", 
"        {", "          \"BootstrapActionConfig\": {", "            \"Name\": \"Bootstrap 0\",", 
"            \"ScriptBootstrapAction\": {", "              \"Path\": \"s3:\\/\\/rtmpfwblrx\\/bootstrap.sh\",", 
"              \"Args\": []", "            }", "          }", 
"        }", "      ],", "      \"ExecutionStatusDetail\": {", 
"        \"EndDateTime\": 1278124414.0,", "        \"CreationDateTime\": 1278123795.0,", 
"        \"LastStateChangeReason\": \"Steps completed\",", "        \"State\": \"COMPLETED\",", 
"        \"StartDateTime\": 1278124000.0,", "        \"ReadyDateTime\": 1278124237.0", 
"      },", "      \"Steps\": [", "        {", "          \"StepConfig\": {", 
"            \"ActionOnFailure\": \"CANCEL_AND_WAIT\",", "            \"Name\": \"Example Streaming Step\",", 
"            \"HadoopJarStep\": {", "              \"MainClass\": null,", 
"              \"Jar\": \"\\/home\\/hadoop\\/contrib\\/streaming\\/hadoop-0.18-streaming.jar\",", 
"              \"Args\": [", "                \"-input\",", "                \"s3n:\\/\\/rtmpfwblrx\\/stream.txt\",", 
"                \"-output\",", "                \"s3n:\\/\\/rtmpfwblrxout\\/\",", 
"                \"-mapper\",", "                \"s3n:\\/\\/rtmpfwblrx\\/mapper.R\",", 
"                \"-reducer\",", "                \"cat\",", 
"                \"-cacheFile\",", "                \"s3n:\\/\\/rtmpfwblrx\\/emrData.RData#emrData.RData\"", 
"              ],", "              \"Properties\": []", "            }", 
"          },", "          \"ExecutionStatusDetail\": {", "            \"EndDateTime\": 1278124322.0,", 
"            \"CreationDateTime\": 1278123795.0,", "            \"LastStateChangeReason\": null,", 
"            \"State\": \"COMPLETED\",", "            \"StartDateTime\": 1278124232.0", 
"          }", "        }", "      ],", "      \"JobFlowId\": \"j-2H9P770Z4B8GG\",", 
"      \"Instances\": {", "        \"Ec2KeyName\": \"JL 09282009\",", 
"        \"InstanceCount\": 2,", "        \"Placement\": {", 
"          \"AvailabilityZone\": \"us-east-1d\"", "        },", 
"        \"KeepJobFlowAliveWhenNoSteps\": false,", "        \"SlaveInstanceType\": \"m1.small\",", 
"        \"MasterInstanceType\": \"m1.small\",", "        \"MasterPublicDnsName\": \"ec2-174-129-70-89.compute-1.amazonaws.com\",", 
"        \"MasterInstanceId\": \"i-2147b84b\",", "        \"InstanceGroups\": null,", 
"        \"HadoopVersion\": \"0.18\"", "      }", "    }", "  ]", 
"}")

Solution

The general rule seems to be to use double the number of backslashes you think you need (can't find the source now).

emrJson <- gsub("\\\\", "\\", emrJson)
parser <- newJSONParser()
for (i in 1:length(emrJson)){
    parser$addData(emrJson[i])
}
parser$getObject()

worked here with your dput output.

OTHER TIPS

I'm not sure that it is double escaped. Remember that you need to use 'cat' to see what the string is, as opposed to the representation of the string.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow