How do you find the definition of a function when all you have is a huge set of input/ouput pairs?

Question 1

So let's simplify. You want a function such that

f 1 = 10
f 2 = 3
f 3 = 8

There exists a formula for immediately finding a polynomial function which meets these demands. In particular

f x = 6 * x * x - 25 * x + 29

works. It turns out to be the case that if you have the graph of any function

{ (x_1, y_1), (x_2, y_2), ..., (x_i, y_i) }

you can immediately build a polynomial which exactly matches those inputs and outputs.

So, given that polynomials like this exist you're never going to solve your problem (finding a particular solution like floor(tan(x*x-3))) without enforcing more constraints. In particular, if you don't somehow outlaw or penalize polynomials then I'm always going to deliver them to you.

In general, what you'd like to do is (a) define a search space and (b) define a metric of fitness, also known as a loss function. If your search space is finite then you have yourself a solution immediately: rank every element of your search space according to your loss function and select randomly from the set of solutions which tie for best.

What it sounds like you're asking for is much harder though—if you're looking through the space of all possible programs then that space is unbelievably large. Searching it exhaustively is impossible unless we constrain ourselves heavily or accept approximation. Secondly, we must have very good understanding of your loss function and how it interacts with the search space as we'll want to make intelligent guesses to move forward through this vast space.

You mention genetic algorithms—they're often lauded for this kind of work and indeed they can be a method of driving search through a large space with an uncertain loss function, but they also fail as often as they succeed. Someone who is genuinely skilled at using genetic algorithms to solve problems will spend all of their time crafting the search space and the loss function to direct the algorithm toward meaningful answers.

Now this can be done for general programs if you're careful. In fact, this was the subject of last year's ICFP programming contest. In particular, search on this page for "Rules of the ICFP Contest 2013" to see the set up.

Question 2

I think feed forward neural network (FFNN) and genetic programming (GP) are good techniques for complicated function simulation. if you need function as polynomials use the GP otherwise FFNN is very simple and the matlab have a library for it.

Question 3

I think the "interpolation" don't get what I am asking. Maybe I was not clear enough, but fortunately I've managed to get a semi-satisfactory answer to my question using a brute-force search algorithm myself. Using only a list of input/output pairs, as presented in the question, I was able to recover the original function. The comments on this snippet should explain it:

import Control.Monad.Omega

{- First we define a simple evaluator for mathematical expressions -}
data A =    Add A A | Mul A A | Div A A | Sub A A | Pow A A |
            Sqrt A | Tan A | Sin A | Cos A |
            Num Float | X deriving (Show)
eval :: A -> Float -> Float
eval (Add a b) x = eval a x + eval b x
eval (Mul a b) x = eval a x * eval b x
eval (Div a b) x = eval a x / eval b x
eval (Sub a b) x = eval a x - eval b x
eval (Pow a b) x = eval a x ** eval b x
eval (Sqrt a) x = sqrt (eval a x)
eval (Tan a) x = tan (eval a x)
eval (Sin a) x = sin (eval a x)
eval (Cos a) x = cos (eval a x)
eval (Num a) x = a
eval X x = x

{- Now we enumerate all possible terms of that grammar -}
allTerms = do
  which <- each [1..15]
  if which == 1 then return X
  else if which == 2 then do { x <- allTerms; y <- allTerms; return (Add x y) }
  else if which == 3 then do { x <- allTerms; y <- allTerms; return (Mul x y) }
  else if which == 4 then do { x <- allTerms; y <- allTerms; return (Div x y) }
  else if which == 5 then do { x <- allTerms; y <- allTerms; return (Sub x y) }
  else if which == 6 then do { x <- allTerms; y <- allTerms; return (Pow x y) }
  else if which == 7 then do { x <- allTerms; y <- allTerms; return (Sqrt x) }
  else if which == 8 then do { x <- allTerms; y <- allTerms; return (Tan x) }
  else if which == 9 then do { x <- allTerms; y <- allTerms; return (Sin x) }
  else if which == 10 then do { x <- allTerms; y <- allTerms; return (Cos x) }
  else return (Num (which-10))

{- Then we create 20 input/output pairs of a random function -}
fun x = x+tan(x*x)
maps = let n=20 in zip [1..n] (map fun [1..n])

{- This tests a function in our language against a map of in/out pairs -}
check maps f = all test maps where
    test (a,b) = (eval f a) == b

{-  Naw lets see if a brute-force search can recover the original program
    from the list of input/output pairs alone! -}
main = print $ take 1 $ filter (check maps) (runOmega allTerms)

{-  Ouput: [Add X (Tan (Mul X X))]
    Yay! As much as there are infinite possible solutions,
    the first solution is actually our initial program.
-}

Question 4

One possible definition goes like this:

f 0 = 0
f 1 = 2
f 2 = 1
f 3 = -1
f 4 = 0
f 5 = 0
f 6 = -76
f 7 = -3
f 8 = 3
f 9 = -1
f 10 = -1
f 11 = -6
f 12 = -1
f 13 = -1
f 14 = 4
f 15 = -2
f 16 = -10
f 17 = 0
f 18 = 0
f 19 = -1
f 20 = 2
f 21 = 3
f 22 = 0
f 23 = 4
f 24 = 2
f 25 = -1
f 26 = 0
f 27 = 0
f 28 = -4
f 29 = -2
f 30 = -14