Pruning / Retrieving Nodes from a Tree Based on Parent / Child / Sibling Relationships

StackOverflow https://stackoverflow.com/questions/23274474

  •  09-07-2023
  •  | 
  •  

Question

I have a tree structure with uniquely identified nodes, a root, and a bunch of branches. There are no restrictions on how many branches any node can sprout or how deep they can go.

What I'd like to do is, given the id of a node, efficiently return the ids of all the parents of that node, as well as all the siblings of each of those parents. What I'm hoping is that the answer to this specific question will use packages/functions designed for this type of manipulation so that I may learn about them and use them for future similar problems rather than have to write custom algorithms every time.

Here is the tree in child-parent form:

structure(list(child = 1:20, parent = c(NA, 1L, 2L, 3L, 1L, 5L, 6L, 5L, 8L, 1L, 1L, 11L, 12L, 11L, 14L, 14L, 1L, 17L, 18L, 19L)), .Names = c("child", "parent"), class = "data.frame", row.names = c(NA, -20L))

And here is a visual representation:

enter image description here

For the specific problem my target id is 13 (in red), and would like to get the ids of the parent nodes (in orange), as well those of the siblings of the parents (in yellow). I can definitely come up with an algorithm to do this (getting all parents is easy, and then siblings means getting each parent's parent's children), but I am really hoping there already exists a framework to do all this efficiently.

I looked briefly into rpart, tree, and ape but as far as I can tell from my brief review they are targeted at statistical classification of trees rather than manipulation / retrieval of nodes, and it seemed to me based on a somewhat perfunctory review that they don't offer the functionality I'm after (entirely likely I'm just missing it / misinterpreting it).

The other option seems to be the XML or xmlTree that would allow me to use some of the DOM manipulation tools to achieve my objectives, but it seems a bit heavy handed and perhaps not optimal.

Was it helpful?

Solution

I've found myself in similar situations numerous times, and I've usually just rolled my own solution. I agree that the XML packages are overkill for focused problems such as this. The iGraph package is nice, but beware. It is an interface to a C library and is zero-indexed (unlike R which is 1-indexed). I've used it successfully, but there always seems to be a painful debugging process in which I keep finding "unshifted" indexes. (e.g. foo[index] has to change to foo[index-1].)

Again, I'd probable just roll my own. Something like:

foo <- structure(list(child = 1:20, parent = c(NA, 1L, 2L, 3L, 1L, 5L, 6L, 5L, 8L, 1L, 1L, 11L, 12L, 11L, 14L, 14L, 1L, 17L, 18L, 19L)), .Names = c("child", "parent"), class = "data.frame", row.names = c(NA, -20L))

get_ancestry <- function(node, data=foo, ancestry=c()) {
    parent <- data$parent[data$child %in% node]
    if (is.na(parent)) {
        return(ancestry)
    } else {
        get_ancestry(parent, data, c(ancestry, parent))
    }
}

get_children <- function(nodes, data=foo) {
    unlist(lapply(nodes, function(x) {data$child[data$parent %in% x]}))
}

orange <- get_ancestry(13)
yellow <- get_children(orange)

OTHER TIPS

Thanks for the pointers, igraph is definitely what I'm looking for, although I'm a bit out of my depth there. With the help of the python igraph tutorial I was able to get:

g <- graph(t(as.matrix(df[-1, 2:1])))            # Make graph
plot(g)                                          # yes, this matches mine
parents <- unlist(                               # orange nodes  
  neighborhood(
    g, 
    order=ecount(g),                             # how many steps to go up, this should guarantee we get to root  
    nodes=13,                                    # starting point    
    mode="in"                                    # head in towards root
  )
)[-1L]
setdiff(                                         # yellow nodes  
  unique(                                        
    unlist(
      lapply(parents, neighborhood, graph=g, order=1, mode="out")
  ) ),
  c(parents, 13)
)

Which produces the desired answers. Unfortunately due to the return format of the neighborhood function a bit of post-processing is required at each step.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top