Question

I'm storing JSON objects in a database. Many—perhaps most—of these objects will be duplicates, so I would like to key them on something like a SHA hash to avoid creating unnecessary extra records.

The problem is, at the point where I want to write them to the database, I no longer have the JSON bytes—just the Foundation objects returned by NSJSONSerialization. Because NSDictionary doesn't make any guarantees about key order (and even if it did, I'm not sure that the server I'm getting the data from does), I can't be certain NSJSONSerialization will output each object's fields in the same order each time I call it. That means that the same object could have different digests, defeating my attempts to save space.

Is there an Objective-C JSON library that does always write the exact same JSON for equivalent objects, presumably by sorting the keys before writing them? I'm targeting iOS 7, but this is probably a Foundation-level concern.

Was it helpful?

Solution

Rather than trying to write my own JSON serializer, I decided to trick Apple's into doing what I want with some proxy trickery.

Usage:

NSData * JSONData = [NSJSONSerialization dataWithJSONObject:[jsonObject objectWithSortedKeys] options:0 error:&error];

Header:

#import <Foundation/Foundation.h>

@interface NSObject (sortedKeys)

/// Returns a proxy for the object in which all dictionary keys, including those of child objects at any level, will always be enumerated in sorted order.
- (id)objectWithSortedKeys;

@end

Code:

#import "NSObject+sortedKeys.h"

/// A CbxSortedKeyWrapper intercepts calls to methods like -allKeys, -objectEnumerator, -enumerateKeysAndObjectsUsingBlock:, etc. and makes them enumerate a sorted array of keys, thus ensuring that keys are enumerated in a stable order. It also replaces objects returned by any other methods (including, say, -objectForKey: or -objectAtIndex:) with wrapped versions of those objects, thereby ensuring that child objects are similarly sorted. There are a lot of flaws in this approach, but it works well enough for NSJSONSerialization.
@interface CbxSortedKeyWrapper: NSProxy

+ (id)sortedKeyWrapperForObject:(id)object;

@end

@implementation NSObject (sortedKeys)

- (id)objectWithSortedKeys {
    return [CbxSortedKeyWrapper sortedKeyWrapperForObject:self];
}

@end

@implementation CbxSortedKeyWrapper {
    id _representedObject;
    NSArray * _keys;
}


+ (id)sortedKeyWrapperForObject:(id)object {
    if(!object) {
        return nil;
    }

    CbxSortedKeyWrapper * wrapper = [self alloc];
    wrapper->_representedObject = [object copy];

    if([wrapper->_representedObject respondsToSelector:@selector(allKeys)]) {
        wrapper->_keys = [[wrapper->_representedObject allKeys] sortedArrayUsingSelector:@selector(compare:)];
    }

    return wrapper;
}

- (NSMethodSignature*)methodSignatureForSelector:(SEL)aSelector {
    return [_representedObject methodSignatureForSelector:aSelector];
}

- (void)forwardInvocation:(NSInvocation*)invocation {
    [invocation invokeWithTarget:_representedObject];

    BOOL returnsObject = invocation.methodSignature.methodReturnType[0] == '@';

    if(returnsObject) {
        __unsafe_unretained id out = nil;
        [invocation getReturnValue:&out];

        __unsafe_unretained id wrapper = [CbxSortedKeyWrapper sortedKeyWrapperForObject:out];
        [invocation setReturnValue:&wrapper];
    }
}

- (NSEnumerator *)keyEnumerator {
    return [_keys objectEnumerator];
}

- (NSEnumerator *)objectEnumerator {
    if(_keys) {
        return [[self allValues] objectEnumerator];
    }
    else {
        return [CbxSortedKeyWrapper sortedKeyWrapperForObject:[_representedObject objectEnumerator]];
    }
}

- (NSArray *)allKeys {
    return _keys;
}

- (NSArray *)allValues {
    return [CbxSortedKeyWrapper sortedKeyWrapperForObject:[_representedObject objectsForKeys:_keys notFoundMarker:[NSNull null]]];
}

- (void)enumerateKeysAndObjectsUsingBlock:(void (^)(id key, id obj, BOOL *stop))block {
    [_keys enumerateObjectsUsingBlock:^(id key, NSUInteger idx, BOOL *stop) {
        id obj = [CbxSortedKeyWrapper sortedKeyWrapperForObject:_representedObject[key]];
        block(key, obj, stop);
    }];
}

- (void)enumerateKeysAndObjectsWithOptions:(NSEnumerationOptions)opts usingBlock:(void (^)(id key, id obj, BOOL *stop))block {
    [_keys enumerateObjectsWithOptions:opts usingBlock:^(id key, NSUInteger idx, BOOL *stop) {
        id obj = [CbxSortedKeyWrapper sortedKeyWrapperForObject:_representedObject[key]];
        block(key, obj, stop);
    }];
}

- (void)enumerateObjectsUsingBlock:(void (^)(id obj, NSUInteger idx, BOOL *stop))block {
    [_representedObject enumerateObjectsUsingBlock:^(id obj, NSUInteger idx, BOOL * stop) {
        block([CbxSortedKeyWrapper sortedKeyWrapperForObject:obj], idx, stop);
    }];
}

- (void)enumerateObjectsWithOptions:(NSEnumerationOptions)opts usingBlock:(void (^)(id obj, NSUInteger idx, BOOL *stop))block {
    [_representedObject enumerateObjectsWithOptions:opts usingBlock:^(id obj, NSUInteger idx, BOOL * stop) {
        block([CbxSortedKeyWrapper sortedKeyWrapperForObject:obj], idx, stop);
    }];
}

- (void)enumerateObjectsAtIndexes:(NSIndexSet *)indexSet options:(NSEnumerationOptions)opts usingBlock:(void (^)(id obj, NSUInteger idx, BOOL *stop))block {
    [_representedObject enumerateObjectsAtIndexes:indexSet options:opts usingBlock:^(id obj, NSUInteger idx, BOOL * stop) {
        block([CbxSortedKeyWrapper sortedKeyWrapperForObject:obj], idx, stop);
    }];
}

- (NSUInteger)countByEnumeratingWithState:(NSFastEnumerationState *)state objects:(__unsafe_unretained id *)stackbuf count:(NSUInteger)len {
    NSUInteger count = [_keys countByEnumeratingWithState:state objects:stackbuf count:len];
    for(NSUInteger i = 0; i < count; i++) {
        stackbuf[i] = [CbxSortedKeyWrapper sortedKeyWrapperForObject:stackbuf[i]];
    }
    return count;
}

@end

OTHER TIPS

First of, a valid JSON (text) isn't appropriate for generating a hash, too: for a particular object there can be many and valid forms of JSONs which represent this object:

  • JSON is basically "text" and its character encoding is Unicode. Unicode has five different unicode schemes: it can be UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE. Each scheme would yield a different hash, even the object is the same.

  • A JSON may contain spaces and tabs (aka "pretty printed").

  • Then, any character in a JSON can be optionally represented as escaped unicode. And the "solidus" character / may or may not be escaped.

  • Furthermore, the order of the elements in an JSON Object is not specified. And finally, the behavior in case of duplicate keys in a JSON-Object is undefined as well.

Thus, for any unique object there can be more than one valid JSON (text) representations which makes is inappropriate for creating a hash.

A solution would require to define your own JSON parser and serializer, which has properties which makes the generated representation (whatever it is actually) suitable for hashing.

It apparently would suffice to use the "average" JSON parser/serializer: given a valid JSON, we would create a representation and then serialize it back to a "canonical" JSON through setting options which generates a special form of JSON where the keys will be ordered.

However, this makes the assumption that you always use the exactly same parser/serializer for generating the hash for the lifetime of your database. This implies, the possibly undocumented internals and implementation details MUST NOT change, and thus guarantees that the generated and valid variation of the JSON is always exactly the same (see above how a JSON can be represented). If some implementation details would change, for example a newer version now escapes the "solidus" character, your database will break.

Unfortunately, NSJSONSerialization lacks those kind of "documentation" and also has no options to set these properties (ordering of keys for example) to create such a "special" JSON representation which would it make appropriate to create a corresponding hash for the JSON object.

You are left with searching after a third party library which provides the source code where you have full control about the generated variant of the JSON which is appropriate for hashing. I strongly discourage from trying to implement your own parser/serializer - since it is not as easy as it looks at the first glance.

For the purpose of a solution of your problem ("Canonicalize JSON"), you don't even need a JSON parser/serializer which generates a Foundation representation: any representation form would suffice (e.g C++ containers, or any custom container), as long as it generates a canonical JSON (form any valid JSON) which fulfills your requirements.

I'm pretty sure there are a few third party libraries which are appropriate for a solution to your problem. I've implemented a JSON parser/serializer myself with an Objective-C API which is based on a C++ implementation. It very likely can be a solution to your problem since it has many options to control the output (JPJSONWriter Options: JPJsonWriterSortKeys, JPJsonWriterEscapeSolidus). However, the library isn't that easy to apply, since it's source code is quite heavy (Objective-C API, C++ advanced templates, and optimized for performance and low memory footprint adds up a lot source code).

If it helps: JPJson (my attempt)

JPJson separates the concept of parsing and associated "semantic actions". A "Semantic action" for example is a "Foundation Representation Generator". That is, you could possibly implement a "HashGenerator" class which creates a hash directly from the received input without stacking up a representation.

and possibly Andrii Mamchur's JSON library: jsonlite, where the JsonLiteSerializer method serializeDictionary: can be easily modified to sort the keys before generating the output.

and a couple more libraries.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top