How a Merkle tree can work with Graph data, or a better data structure

https://softwareengineering.stackexchange.com/questions/387320

21-02-2021
|

Pergunta

I am looking at blockchain and trying to see how Merkle trees (or perhaps Merkle DAGs) can fit to a graph data structure with circular references.

For example, say I have this data model:

var profile = { setting1: 123, email: 'foo@bar.com' }
var user2 = {}
var user3 = {}
var msg1 = { body: 'foo' }
var msg2 = { body: 'bar' }
var group1 = { name: 'Hello' }
var group2 = { name: 'World' }
var user1 = { messages: [ msg1, msg2, ... ] }

profile.user = user
user.profile = profile

msg1.sender = user1
msg1.recipient = user2
msg2.sender = user1
msg2.recipient = user3

group1.members = [ user1, user3 ]
group2.members = [ user1, user2, user3 ]

user1.groups = [ group1, group2 ]
user2.groups = [ group2 ]
user3.groups = [ group1, group2 ]

This is highly cyclic, but it only has really 2 or 3 levels deep of cyclicness. In reality there could be extremely large and complicated cycles, but I don't want to overcomplicate it.

With the Merkle tree, you essentially have a bunch of blobs of data which you make the leafs of a tree, and compute hashes for each pair all the way up to the top part of the tree. But if one of your blobs changes, then you have to recalculate the whole path up to the top again. If you insert a node, you might have to recalculate even more. And this is just for basic blobs of data.

But if you have graph data like I've outlined above, I'm not sure what to do, or if there is a better data structure. So for example, if you change profile.email, then you would assume "profile has changed". But then I am unsure if we also say "user" has changed, because it is once removed. And likewise, perhaps even "msg1" has changed, or "group" has changed, since user1 is related to those through a few links. So basically, the idea of what is a "single entity" seems to break down in a graph, and I'm not sure what to do. Wondering if

Merkle trees just aren't good for this type of data structure.
If there is a way to make Merkle trees work here.
Or, if there is a better one or two data structures for this sort of data model that can accomplish what the Merkle tree does best (mainly validating that the data hasn't changed).

Solução

You should define more closely what “changed” exactly means in your problem domain. Right now you have all information in a cyclic graph. When any node changes, you claim that the entire graph changes. This may or may not be reasonable.

Instead, you might have a model where a user is not considered to have changed when other users join or leave a common group, or not consider a group changed when the value (but not the identity) of a student changes.

Such a separation might be implemented by breaking direct references to users and groups and giving each of these entities an unique ID instead. That is, we can separate the current value of one of these things from their identity. Your data can then be represented without cycles in a nice tree that can be used as a Merkle tree to ensure integrity:

{
  profile: { setting1: 123, email: "foo@example.com", user: "user/1" },
  user: {
    "user/1": { groups: ["group/1", "group/2" ], messages: ["msg/1", "msg/2"] },
    "user/2": { groups: ["group/2"] },
    "user/3": { groups: ["group/1", "group/2"] },
  },
  group: {
    "group/1": { name: "Hello", members: ["user/1", "user/3"] },
    "group/2": { name: "World", members: ["user/1", "user/2", "user/3"] },
  },
  msg: {
    "msg/1": { body: "foo", sender: "user/1", recipient: "user/2" },
    "msg/2": { body: "bar", sender: "user/1", recipient: "user/3" },
  },
}

This kind of modelling using the identity of other elements instead of their value may or may not be correct. It ensures that the relationships between elements are OK, but fails to ensure other properties. This kind of explicit link can also cause dead links or orphaned entities, which can only be detected by rebuilding and checking the original graph. This particular graph also has the difficulty that there are many bidirectional relationships. In a serialized form that can be used as a data structure in a Merkle tree these could in principle be elided because the reverse of a relationship is implied. For example, it might be sufficient if a message refers to a user and if groups refer to their members. In a Merkle tree this can be attractive because they limit which hashes have to be recalculated, but of course this also limits which changes are represented as a change in the top level hash.

Another alternative is to use a more ledger-ish approach and to hash not the resulting graph, but a list of operations leading up to that graph state. These steps are equivalent to a state because the state can always be rebuilt from the steps. However this only assigns a hash to the state of the entire graph, not to that of (unchanged) subgraphs.

So this really depends on what you want to use Merkle trees for, what a change represents on your problem domain, and what kinds of changes you want to be able to detect.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange