Question

I am trying to solve problem of preventing oversell of limited resources.

Consider resources (people) who are described by set of properties where each property belongs to different category (example properties from four categories: male, age 25-30, 2 children, interested in games).

Buyers want to allocate access to resources. Buyers can specify subset of categories and one property from each category (example: allocate 1000 males, age 25-30 or allocate 100 females, age 25-30, interested in music).

In my real life example I have 6m+ possible set of properties (profiles) where for each set of properties I know how many profiles exists.

My initial approach was to build a graph like one below:

alt text

and then traverse using edge weights, for instance validating if demand for 100 females, age2 can be satisfied:

  1. check if size(female, age2) < 100
  2. for each parent:
    1. check if size(parent) < 100 and go to 2.
  3. for each child:
    1. check if size(child) < 100 * weight(edge(node, child)) go to 1.

(above algorithm is simplified as does not prevent visiting same node multiple times)

It all works fine when graph is small, however when number of nodes and edges (dependencies) between nodes (profile universe groups) grows it does not scale very well.

Consider example:

  • large graph, 6m nodes, 20m+ edges
  • buyer wants to allocate 1000 males (and there are only males and females in gender category)

algorithm would start with top-level 'male' node which probably has 10m+ outgoing edges and 10m+ checks would be required (and probably each of those 10m outgoing edges has incoming edges which need to be checked as well).

I was trying to find different approach but failed. I was trying to google out existing solutions but seems like I am unable to even name problem properly. Any reference to what is this problem similar to would be good for me as a starting point.

Thanks for comments/help.

Two more graphs to present exponential growth of the graph: 3 categories alt text

4 categories alt text

Update

Regarding size, assuming 8 categories of properties where each category has: 2, 6, 6, 6, 6, 8, 1140, 150 values respectively then estimated number of profiles: 2*6^4*8*1140*150 ~= 3.5 * 10^9. Number of nodes in graph: at least 7 * 10^9, number of edges in graph: at least 140 * 10^9.

Update #2

Formula for number of nodes is:

$\sum_{i<n}\prod_{k<i \atop j_1, j_2, ..., j_k < n} s_{j_{1}} ... s_{j_{n}}$

where $n$ is number of categories and $s_x$ is size of category $x$.

So in my example there would be 11'169'108'657 nodes.

Update #3

As per @Raphael advice - I have reduced number of nodes and now formula is:

$\sum_{i<n-M}\prod_{k<i \atop j_1, j_2, ..., j_k < n} s_{j_{1}} ... s_{j_{n}}$

where $M<n$ and assumed that distribution of resources across smallest slices of universe is equal. At the same time removed lot of edges from graph.

Example of sub-graph size reduction: Example of sub-graph size reduction

Was it helpful?

Solution

So the data structure holding pointers for every possible combination of classifiers is huge. Sure, but why build it at all? Don't overengineer this!

Just store the profiles in a database and do one (linear time) filtering sweep for each query, i.e. select/count on demand. For a few millions of records, that should require no further preprocessing.

If the number of requests is large and/or you need really small response times, you can think about caching, or creating equivalence classes along some popular classifiers, or along classifiers with few large classes. Then, the linear sweep has to be done only on small lists.

For example, you can divide your database along gender and age (assuming these are included in most customer queries)

$\qquad \{m, w, o\} \times \{0..5, 10..15, \dots, 95..100\}$

where the values obviously depend on your data. Then, each query will require only few of these small lists, and can even parallelise if you store the individual chunks separately.

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top