Question

Problem

A streaming application should perform matching transitively i.e. if A == B & B == C then A == C

Current Implementation

Application accepts domain objects in a streaming fashion and perform matching to filter out all duplicates. A scoring engine helps determine the equality by assigning scores between two domain objects; which would be considered equal if they score above certain threshold. All equal objects are grouped viz. an unique match group is created and only the top scoring object is considered for further processing.

Since the domain objects could flow out-of-sequence and / or randomly, the transitive relations / equality is not achieved as desired. Consequently similar (or almost similar) objects creates different match group and duplicates are submitted to downstream application which create some unacceptable business scenarios.

For e.g. if threshold score for equality is ZERO i.e. for any positive score two domain objects will be deemed equal and considering following scores A & B = 50, A & C = -30 & B & C = 70 below situations arise

Situation 1
Considering sequence as A followed by B and C => A will create own match group as it is processed first with no other matching objects available. B will join A in same match group (since positive score of 50 with A) C will join B in same match group (since positive score of 70 with B) [Note - The negative score between A & C will not be considered as B & C has positive score]

This is desired where all objects joins same match group.

Situation 2
Considering sequence as B followed by C and A => B will create own match group as it is processed first with no other matching objects available. C will join B in same match group (since positive score of 70 with B) A will join B in same match group (since positive score of 50 with B) [Note - The negative score between A & C will not be considered as B & A has positive score]

This is desired where all objects joins same match group.

Situation 3
Considering sequence as C followed by A and B => C will create own match group as it is processed first with no other matching objects available. A will create its own match group (since negative score of 30 with C) B will join C in first match group (since positive score of 70 with C) [Note - Although B has positive score with both A as well as C, it will join C's match group as it has higher score with C as compared to A]

This is not desired as two match groups are created, whereas all objects should have joined same match group, them being similar (or almost similar). The downstream system will receive two match groups where it expected only single thus causing business problems.

Current solution to overcome transitive in-equality

In case of situation 3 the moment an object i.e. B has more than one positive score across more than one match group it will join the match group with highest score (i.e. match group of C) and trigger re-processing for rest of the match group(s) (i.e. match group of A) and all of their objects i.e. A. Thus when these object(s) i.e. A is re-processed it will find positive match with B and will join B in same match group of B & C which will essentially have all of them together in same match group.

Limitations / Challenges faced
Although the above solution (more of a workaround to me) works for smaller set of data (couple of hundred or thousands sometime), it goes for toss for larger set of data, say when millions of objects are processed in an hour or so. The re-processing occurs multiple time thus reducing the overall throughput of system. Also the solution is not immune to race-condition e.g. in case the re-processed match group has more than one objects, it again can potentially have different match group for similar objects.

Expected outcome

An algorithm which will ensure that out-of-turn processing of similar (or near similar) objects ensures that they join same match group. Addressing race-condition would be add on however not must have as of now.

EDIT 1 -
Similarity Score
Every domain object has few properties which are considered for equality in a weighted fashion i.e. property X has highest weight while Y & Z has lower weight. Thus if two domain objects has similar X but not Y & Z the similarity score will be higher as compared to those which has Y & Z similar but not X.

P.S.: I have tried isolating the issue and described in it's minimalist form. If required feel free to comment and I could help with the relevant details.

Was it helpful?

Solution

Two Problems

You are attempting to:

  1. generate a set of categories based on data received in the stream.
  2. label each data point with a category.

The issue that you are hitting on is that each newly labelled data-point changes the definition of the set of categories.

This leads to two inevitable conclusions:

  1. If the data is streamed in a different order, a different set of categories is generated.
  2. Some older pieces of data are mislabelled based on the updated categories informed by newer data.

Solutions:

  1. Fix your categories in advance.
  2. Batch process the stream.
  3. Shapes
  4. Not our problem

Fixing Categories

The problem you are facing is that categories change shape as new data is witnessed, or if the data is witnessed in a different order. That simply cannot be the case.

You need to pick a similarity function that does not chain (like some lightning jumping from point to point), but instead acts like a bucket so that the data points (given in any order) always fall together.

This can be done by pre-determining ranges of values along some property that always fall toward the same bucket. The intersection of each of these pre-determined ranges for each of the checked properties is the bucket.

Say 10 <= sick leave days <= 20 && 21 <= years old <= 25 is one bucket. Therefore a data-point at { sick leave = 10; years old = 25 } falls in the same bucket as { sick leave = 20; years old = 21 } but not the same bucket as { sick leave = 21; years old = 21 }

On the plus side downstream software can guarantee what kinds of labels it will receive and can respond to it without further issues. The downside is that you will need to know what those categories are.

Also based on your requirements, there is a good chance that the problem is simply pushed out. The question then becomes which categories are similar?


Batch the Stream

Don't try to solve both problems simultaneously.

Batch the stream using a time window, and process each window using two stages: Extract categories, and label data points.

The first stage processes that time window using a data-mining algorithm to produce a correct set of similarity categories. This ensures that nearby points are grouped if similar enough, but any sufficiently wide division separates points into novel categories.

It may be feasible to simply update the pre-existing set of categories given the new point data to maintain some sort of long-term stability in the face of somewhat noisy data.

The second stage uses the generated categories to label each data-point in that window. The important distinction is that when labelling, the categories are not updated. This way out of order data-points are stably labelled one way or the other.

The benefit is that you can communicate the category system and the correctly labelled points downstream, while still maintaining categorical sensitivity to changing circumstances.

The downside is that the sensitivity is granular, and there is a delay of at least the batching window. Also if updating the prior category model, the data-mining algorithm will need to finish within the batch window on average in order to keep up with the stream.


Shapes

If we step back and look at the problem, there is a multi-dimensional space and inside that space exist contiguous volumes which we call categories. Any point within that volume is strictly speaking a member of that category. Our job is to figure out those volumes without having all of the points in advance.

This means that volumes can merge. Which means that downstream systems need to be told to treat data-points labelled as X as also labelled as Y (and vice-versa).

Instead of maintaining a single point though, we will need to describe the surface of each volume. Which may require a lot more points to be retained per volume. However over time growth will slow as new points only update the volume if they change the volumes surface, which becomes rarer and rarer with each new point witnessed. Also some new points will inevitably simplify the description of the volume replacing one or more previously identified points.

You should be able to optimise comparisons by using Collision detection techniques from video games.Such as using an Axis Aligned Bounding Box around each volume, or by using n dimensional trees (such as an r-tree), or even using sphere-to triangle surface intersection checks.


Not our problem

The short of it is that you are dealing with noisy data, and while the system mitigates some/most of that noise, the output stream is still noisy and has to be handled accordingly.

This means accepting that the categories are in flux, and that older data may be incorrectly labelled with accord to the "current" set of categories. Which means that the downstream systems are in fact broken by design.

You can (and should) of course provide appropriate support to the downstream systems. This might mean:

  • appropriate information about what those categories are
  • how the categories have changed, such as joins and splits.
  • perhaps how fuzzy the category is.
  • or maybe annotating the data with how similar it is to several categories.
Licensed under: CC-BY-SA with attribution
scroll top