Can I use JsonCpp to partially-validate JSON input?

https://stackoverflow.com/questions/9271075

29-04-2021
|

Question

I'm using JsonCpp to parse JSON in C++.

e.g.

Json::Reader r;
std::stringstream ss;
ss << "{\"name\": \"sample\"}";

Json::Value v;
assert(r.parse(ss, v));         // OK
assert(v["name"] == "sample");  // OK

But my actual input is a whole stream of JSON messages, that may arrive in chunks of any size; all I can do is to get JsonCpp to try to parse my input, character by character, eating up full JSON messages as we discover them:

Json::Reader r;
std::string input = "{\"name\": \"sample\"}{\"name\": \"aardvark\"}";

for (size_t cursor = 0; cursor < input.size(); cursor++) {  
    std::stringstream ss;
    ss << input.substr(0, cursor);

    Json::Value v;
    if (r.parse(ss, v)) {
        std::cout << v["name"] << " ";
        input.erase(0, cursor);
    }
} // Output: sample aardvark

This is already a bit nasty, but it does get worse. I also need to be able to resync when part of an input is missing (for any reason).

Now it doesn't have to be lossless, but I want to prevent an input such as the following from potentially breaking the parser forever:

{"name": "samp{"name": "aardvark"}

Passing this input to JsonCpp will fail, but that problem won't go away as we receive more characters into the buffer; that second name is simply invalid directly after the " that precedes it; the buffer can never be completed to present valid JSON.

However, if I could be told that the fragment certainly becomes invalid as of the second n character, I could drop everything in the buffer up to that point, and then simply wait for the next { to consider the start of a new object, as a best-effort resync.

So, is there a way that I can ask JsonCpp to tell me whether an incomplete fragment of JSON has already guaranteed that the complete "object" will be syntactically invalid?

That is:

{"name": "sample"}   Valid        (Json::Reader::parse == true)
{"name": "sam        Incomplete   (Json::Reader::parse == false)
{"name": "sam"LOL    Invalid      (Json::Reader::parse == false)

I'd like to distinguish between the two fail states.

Can I use JsonCpp to achieve this, or am I going to have to write my own JSON "partial validator" by constructing a state machine that considers which characters are "valid" at each step through the input string? I'd rather not re-invent the wheel...

Solution

It certainly depends if you actually control the packets (and thus the producer), or not. If you do, the most simple way is to indicate the boundaries in a header:

+---+---+---+---+-----------------------
| 3 | 16|132|243|endofprevious"}{"name":...
+---+---+---+---+-----------------------

The header is simple:

3 indicates the number of boundaries
16, 132 and 243 indicate the position of each boundary, which correspond to the opening bracket of a new object (or list)

and then comes the buffer itself.

Upon receiving such a packet, the following entries can be parsed:

previous + current[0:16]
current[16:132]
current[132:243]

And current[243:] is saved for the next packet (though you can always attempt to parse it in case it's complete).

This way, the packets are auto-synchronizing, and there is no fuzzy detection, with all the failure cases it entails.

Note that there could be 0 boundaries in the packet. It simply implies that one object is big enough to span several packets, and you just need to accumulate for the moment.

I would recommend making the numbers representation "fixed" (for example, 4 bytes each) and settling on a byte order (that of your machine) to convert them into/from binary easily. I believe the overhead to be fairly minimal (4 bytes + 4 bytes per entry given that {"name":""} is already 11 bytes).

OTHER TIPS

Iterating through the buffer character-by-character and manually checking for:

the presence of alphabetic characters
- outside of a string (being careful that " can be escaped with \, though)
- not part of null, true or false
- not a e or E inside what looks like a numeric literal with exponent
the presence of a digit outside of a string but immediately after a "

...is not all-encompassing, but I think it covers enough cases to fairly reliably break parsing at the point of or reasonably close to the point of a message truncation.

It correctly accepts:

{"name": "samL
{"name": "sam0
{"name": "sam", 0
{"name": true

as valid JSON fragments, but catches:

{"name": "sam"L
{"name": "sam"0
{"name": "sam"true

as being unacceptable.

Consequently, the following inputs will all result in the complete trailing object being parsed successfully:

1. {"name": "samp{"name": "aardvark"}
   //            ^ ^
   //            A B    - B is point of failure.
   //                     Stripping leading `{` and scanning for the first
   //                      free `{` gets us to A. (*)
   {"name": "aardvark"}

2. {"name": "samp{"0": "abc"}
   //            ^ ^
   //            A B    - B is point of failure.
   //                     Stripping and scanning gets us to A.
   {"0": "abc"}

3. {"name":{ "samp{"0": "abc"}
   //      ^      ^ ^
   //      A      B C   - C is point of failure.
   //                     Stripping and scanning gets us to A.
   { "samp{"0": "abc"}
   //     ^ ^
   //     B C           - C is still point of failure.
   //                     Stripping and scanning gets us to B.
   {"0": "abc"}

My implementation passes some quite thorough unit tests. Still, I wonder whether the approach itself can be improved without exploding in complexity.

^{* Instead of looking for a leading "{", I actually have a sentinel string prepended to every message which makes the "stripping and scanning" part even more reliable.}

Just look at expat or other streamed xml parsers. The logic of jsoncpp should be similar if its not. (Ask developers of this library to improve stream reading if needed.)

In other words, and from my point of view:

If some of your network (not JSON) packets are lost its not problem of JSON parser, just use more reliable protocol or invent your own. And only then transfer JSON over it.
If JSON parser reports errors and this error happened on the last parsed token (no more data in stream but expected) - accumulate data and try again (this task should be done by the library itself).

Sometimes it may not report errors though. For example when you transfer 123456 and only 123 is received. But this does not match your case since you don't transfer primitive data in a single JSON packet.
If the stream contains valid packets followed by semi-received packets, some callback should be called for each valid packet.
If the JSON parser reports errors and it's really invalid JSON, the stream should be closed and opened again if necessary.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow