Pergunta

I have a list of ugly looking JSON objects in a text file, one per line. I would like to make them print nicely and send the results to a file.

My attempt to use the command-line python version of json.tool:

parallel python -mjson.tool < jsonList

However, something seems to be going wrong in the parsing of this json, as python's json.tool attempts to open it as multiple arguments and thus throws:

IOError: [Errno 2] No such file or directory: {line contents, which contain single quotes, spaces, double quotes}

How can I compel this to treat each line-separated object as a single argument to the module? Opening the file directly in python and processing it serially is an inefficient solution because the file is enormous. Attempting to do so pegs the CPU.

Foi útil?

Solução

GNU Parallel will by default put the input as arguments on the command line. So what you do is:

python -mjson.tool \[\"cheese\",\ \{\"cake\":\[\"coke\",\ null,\ 160,\ 2\]\}\]

But what you want is:

echo \[\"cheese\",\ \{\"cake\":\[\"coke\",\ null,\ 160,\ 2\]\}\] | python -mjson.tool

GNU Parallel can do that with --pipe -N1:

parallel -N1 --pipe python -mjson.tool < jsonList

10 seconds installation:

wget -O - pi.dk/3 | bash

Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 or at

Walk through the tutorial (man parallel_tutorial). You command line with love you for it.

Outras dicas

Well the json module already has something similar to what you have in mind.

>>> import json
>>>
>>> my_json = '["cheese", {"cake":["coke", null, 160, 2]}]'
>>> parsed = json.loads(my_json)
>>> print json.dumps(parsed, indent=4, sort_keys=True)
[
    "cheese", 
    {
        "cake": [
            "coke", 
            null, 
            160, 
            2
        ]
    }
]

And you can just input my_json from a text file using open in r mode.

Two problems with my approach, which I eventually solved:

the default parallelization will spawn a new python VM for each thread, which is... slow. So slow.

The default json.tool does the naive implementation, but somehow is confusing the number of incoming arguments.

I wrote this:

import sys
import json
for i in sys.argv[1:]:
    o = json.loads(i)
    json.dump(o, sys.stdout, indent=4, separators=(',',': '))

Then called it like this:

parallel -n 500 python fastProcess.py < filein > prettyfileout

I'm not quite sure of the optimal value of n, but the script is 4-5x faster in wall clock time than the naive implementation due to the ability to use multiple cores.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top