Statistic estimation of total nodes in a tree where edge traversal is expensive

https://stackoverflow.com/questions/23637410

21-07-2023
|

سؤال

I have a directed tree, and I would like to get its size. I have no information about its depth or distribution of nodes. There are two major obstacles:

1) The tree is very large (~billions of nodes)

2) Edge traversal is expensive.

Are there statistical methods I can use to get an estimate of its size (number of nodes) quickly and with bounded error? Unfortunately, googling just results in exact-count algorithms which would perform poorly given these restrictions.

Bonus

If I relax the constraint from tree to DAG (directed acyclic graph), can I get both its size and number of unique paths? E.g. for this DAG (every edge points down)

enter image description here

There are 19 nodes (size) and 23 paths (4 extra ones because the red edge gives 1 more path for its destination node and 3 more paths to the children of its destination node)

Things I've tried

For the tree case, I am thinking of something along the lines of:

amounts = []
def estimateHelper(node):
    amounts[node.depth].push(len(node.children))
    for each child in small random sample of node.children:
        estimateHelper(child)
def estimate(root):
    estimateHelper(root)
    reach = 0
    for (j = len(amounts) - 1; j >= 0; --j):
        avgChildrenPerNodeAtThisLevel = avg(amounts[j])
        reach = avgChildrenPerNodeAtThisLevel + avgChildrenPerNodeAtThisLevel * reach
    return reach

It essentially computes the "reach" at the deepest nodes of the tree, then propagates that back into the level above to find the reach at that level. It does that until it eventually finds the "reach" of the root of the tree. I'm not sure if I'm making any assumptions about a uniform distribution of nodes or not in the above algorithm. To reiterate, I don't know what kind of distribution a given tree will have.

Assuming it works, this also solves the "paths" for the DAG. Once you have all the "paths", I am thinking of using the inverse of the birthday paradox to figure out how many unique nodes there are. Birthday paradox answers "how many days (paths) do we need to choose until we hit a duplicate day with some probability given 365 unique days of the year". So we keep trying random paths (days) until we hit a duplicate node, we repeat that several times to find a probability for that event, and then we plug it into the birthday paradox to find the number of unique nodes (unique days in the year). Note though, that the birthday paradox also makes an assumption of uniformity.

None of this is very rigorous. What would be ideal is something that gives me an estimate with an error bound, and a paper that describes the algorithm with sufficient rigor. Any pointers in the right direction are very much appreciated.

المحلول

Knuth wrote a paper on this in the context of backtracking tree search: Estimating the efficiency of backtrack programs - e.g. http://www.ams.org/journals/mcom/1975-29-129/S0025-5718-1975-0373371-6/S0025-5718-1975-0373371-6.pdf. Search terms Knuth tree estimate also find papers referencing this such as ftp://ftp.cs.indiana.edu/pub/techreports/TR60.pdf and http://www.cs.ubc.ca/~hutter/EARG.shtml/earg/stack/WS06-11-005.pdf.

I have no idea how this would translate to DAGs in general, but - again in the context of tree search - you may be able to redefine DAGs as trees with the same number of vertices by adding constraints that disallow edges entering each vertex after he first. E.g. when choosing a subset of numbers one by one, require that they be listed in ascending order - then (1,3,8) has only the single ancestor (1,3).

Come to think of it, you could also define a tree in which each DAG path to an edge defined a different edge in the tree. Counting the number of edges in this might tell you something about the number of DAG paths.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow