Distributed predicate computation on event stream

https://cs.stackexchange.com/questions/128965

29-09-2020
|

Question

My question is actually a request for papers, articles, texts or books on the problem that I'm trying to solve on my work.

I'm working on a program that computes a predicate value (true or false) for a given object in a distributed system in which there is a stream of events that can change the object's attributes and, consequentially, the predicate value. Whenever the predicate value changes, the program must send a notification about this change.

For example, consider that there is an object A which has an attribute called name and consider that there is a predicate P which is true when the object's name is equal to Jhon. Each event in the stream has a timestamp and a value for the attribute name. So consider the following sequence of events:

e1 = { name: Jhon, timestamp: 1 }
e2 = { name: Jhon, timestamp: 2 }
e3 = { name: Peter, timestamp: 3 }
e4 = { name: Doug, timestamp: 4 }
e5 = { name: Jhon, timestamp: 5 }

In this problem the events have a total order relation: If you have two events you always can say which one is the oldest of them.

Now, the events don't necessarily show up in the stream in the correct order according to its timestamp. Each event is unique to its timestamp, so there are no two or more events with the same timestamp for the same object. Also, the timestamps don't necessarily form a sequence that always increase by one: if we see e1 with timestamp 1 and e3 with timestamp 3, it doesn't imply the existence of e2 with timestamp 2. There is no guarantee that all events will be received or when they will be received. It's part of the problem that we only know about the existence of the events that we see in the stream.

The real scenario is even worse: there are multiple computers parallelly processing this stream of events. However, for simplicity, I'll go further in this example considering only one computer.

If the events arrive and are processed in the order described above, then the notifications sent should be:

P(A) = true when e1 arrives
P(A) = false when e3 arrives
P(A) = true when e5 arrives.

That is the correct sequence of notifications because it respects the timestamp order. Now, imagine that the computer receives the events in the following order:

e1, e5, e2, e4, e3

A naive algorithm which doesn't consider the event's timestamp would send an incorrect sequence of notifications:

P(A) = true when e1 arrives
P(A) = false when e4 arrives

The algorithm that I'm working on considers the timestamps and infers when a notification should have been sent but was not. So when e3 arrives it will notice that the notification P(A) = true for e5 was not sent. This feels a bit like reinventing the wheel, though I'm not aware of any reading about this problem. I would like some references to this problem or to something similar, like some papers dealing with this kind of problem.

The real problem is quite more complex since it involves storing the predicate $\times$ object state in a database that works as a shared state between the computers processing the stream and I'm talking about thousands of events arriving per second so it's not possible to keep all events stored in some database.

Is there any literature about the problem that I have described? if so, could you give me links to it?

I would like to see a paper or a text that explains an algorithm that solves this problem and it would be even better if such paper provides proofs about the algorithm (e.g. correctness).

If such paper doesn't exist (I actually think that is the case), I would accept an answer that describes an algorithm and provides an argument or a proof about its correctness.

For this algorithm to be correct, it should always send the correct sequence of notifications no matter what order that events arrives. And the algorithm shouldn't keep all the received events in memory, because the real problem deals with too many events to save in memory or to store in a DB. It would be reasonable to keep some events in memory, preferably a fixed amount.

Solution

Impossibility result #1: dropped events

The problem cannot be solved in general; there is no way to guarantee that your requirements will be met if some events are dropped (i.e., not received). Consider first this stream:

e1 = { name: Jhon, timestamp: 1 }
e2 = { name: Jhon, timestamp: 4 }

where the algorithm sees both events. Next, consider this stream:

e1' = { name: Jhon, timestamp: 1 }
e2' = { name: Pete, timestamp: 2 }
e3' = { name: Jhon, timestamp: 3 }
e4' = { name: Jhon, timestamp: 4 }

where the algorithm sees only the events e1',e4' (the other events are lost and never received). You might notice that what the algorithm sees in both cases is identical, so its outputs will be identical in both cases. However, the correct answer differs in these two cases, so there is no hope for an algorithm that always produces a correct output. (The correct response in the first case is to produce no notifications; the correct response in the second case is to produce two notifications, one to indicate that the predicate is false after receiving e2', and one to indicate that the predicate is true after receiving e3'.)

It is not clear how to adapt the requirements to deal with this situation. The only plausible solution I can see is to say that the notifications that are produced should depend only on the received events, not on the events that are sent. This is equivalent to specifying that events cannot be dropped.

Impossibility result #2: re-ordered events

You state that you must be able to handle re-ordered events, without storing all events in memory, and with arbitrary re-ordering. However, these requirements are incompatible: that is impossible to achieve. Consider a long sequence of events with timestamps 2,4,6,8,10,12,... At the end of the long sequence of events, if an event with an odd timestamp arrives, the only way to be sure you can handle it correctly is to store the entire history of past events (or past states of the object).

So, you're going to have to relax the requirement about re-ordering as well. Perhaps you're willing to store all events in memory forevermore. (If so, you have a solution.) Perhaps you are willing to impose a bound on re-ordering, e.g., no event will be delayed by more than 10 minutes. (If so, you only have to store history for the past 10 minutes, and everything older can be deleted.) Perhaps something else makes more sense in your particular situation.

But the one thing that is not an option is to impose all of the strong requirements stated in your question, and require an algorithm that is always correct.

I'm not aware of any literature on this and I don't particularly see any reason to expect there to be any. It's a very specific set of requirements, and it looks to me like the resulting task is either trivial or impossible to solve. Those usually aren't the kind of problems that tend to be studied in the literature. Perhaps you might be interested in persistent data structures, but that's just a fancy way of storing the entire history of events, which you said you want to do; and you don't need a fancy data structure to do that in your particular situation.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange