how to programmatically back-out, deduce, decompile, reverse-engineer the algorithm used to construct a variable in a data set

StackOverflow https://stackoverflow.com/questions/18195119

Question

i'm looking for some algorithm or program or function that will deduce how a variable was created, so long as i supply the other variables. i think computer programmers would call this "decompiling" and architects would call it "reverse-engineering" but i guess i don't know what statisticians would call it..or if there are accepted methods to do it.

let's say i've got a categorical column in a data.frame called newvar and i don't know exactly how it was constructed. but i do know what variables were used to create it..or at the very least i can provide an exhaustive set of variables that were used to create it -- even if not all of them were used.

# start with an example data set
x <- mtcars

# # # # # # # # # # # # # # # # # # # # # # # #
# pretend this block of code is a black box
x <-
    transform(
        x ,
        newvar =
            ifelse( mpg > 24 , 1 ,
            ifelse( cyl == 6 , 9 ,
            ifelse( hp > 120 , 4 ,
            ifelse( mpg > 22 , 7 , 2 ) ) ) )
    )
# end of unknown block of code
# # # # # # # # # # # # # # # # # # # # # # # #

# now knowing that `mtcars` has only 11 columns to choose from
names(x)

# how were these 11 columns used to construct `newvar`?
table( x$newvar )

# here's a start..
y <- data.frame( ftable( x[ , c( 'mpg' , 'cyl' , 'hp' , 'newvar' ) ] ) )
# ..combinations with any records
y[y[,5]!=0,]
# but that's not enough to back-out the construction

so i think you could back out the construction of newvar with a linear regression or with decision trees, but that will still require a bit of thinking and piecing together the coefficients to figure out exactly what happened inside the black box.

is there any algorithm available that guesses at the black box, so-to-speak? thanks!!

Was it helpful?

Solution

In general, no. And even applying a lot of knowledge about what could be going on, it is still (probably) no. Let me show you an example from your example. Adding knowledge of the "black box" that the output is discrete values and that they are derived based on thresholds of other values, a classification tree should be able to recover the criteria. So:

library("party")
tmp <- ctree(factor(newvar) ~ ., data=x, 
  controls=ctree_control(mincriterion=0, minsplit=2, minbucket=1))

I've set the control values to completely unreasonable values to force the algorithm to drive each bucket into containing only a single value. And even then it is not what you started with:

enter image description here

So with a simple example and adding more knowledge about the transformation, it can not be done, there is not really hope to be able to do it in the general case.

OTHER TIPS

Decompilation is something fundamentally different. It inspects the actual source code, and produces a program that produces the same result. It mostly works in languages that compile to an intermediate language that is essentially equivalent to the source code (minus comments, variable names and such). Since assembler can essentially 1:1 represent machine code, you can decompile fully compiled programs into assembler (but that does not imply you really understand what they do, or that they were written in assembler).

Now if you are trying to do this to data transformations, you can only hope to get this with a certain probability. It's like trying to decompile a program by looking at the output.

I believe the technique that will be most interesting to you is decision trees because they produce a human-interpretable result that can easily be turned into a number of primitive expressions (if this, then that).

Chances are, that the tree will still turn up different insights than what was used to construct the data.

Always remember that the constructs could be arbitrarily complex, including random.

You cannot expect a perfect result. Decision trees may be "as good as it gets", if you want readable results and want to avoid overfitting.

Recall overfitting: the expression could always be

if record == record1 then
  result = result1
else if record == record2 then
  result = result2
...
else if record == recordn then
  result = resultn
else
  result = random or fail badly

This will perfectly reconstruct the transformation. But it's entirely useless.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top