Why is a texture lookup so much slower than a direct computation?

Question 1

GDDR memory is pretty high latency and modern GPU architectures have plenty of number crunching capabilities. It used to be the other way around, GPUs were so ill-equipped to do calculations that normalization was cheaper to do by fetching from a cube map.

Throw in the fact that you are not doing a regular texture lookup here, but rather a dependent lookup and it comes as no surprise. Since the location you are fetching from depends on the result of another fetch, it is impossible to pre-fetch / efficiently cache (an effective latency hiding strategy) the memory needed by your shader. That is no "simple texture lookup."

What is more, in addition to doing a dependent texture lookup your second shader also includes the discard keyword. This will effectively eliminate the possibility of early depth testing on a lot of hardware.

Honestly, I do not see why you want to "optimize" the distortionFactor (...) function into a lookup. It uses squared length, so you are not even dealing with a sqrt, just a bunch of multiplication and addition.

Question 2

Andon M. Coleman already explained what's going in. Essentially memory bandwith and more importantly memory latency are the main bottlenecks of modern GPUs, hence everthing built between about 2007 to today simple calculations are often way faster than a texture lookup.

In fact memory access patterns have such a large impact on efficiency that slightly rearranging the access pattern and assuring proper alignment can easily give performance boosts of a factor of 1000 (BT;DT however that was CUDA programming). Dependent lookup is not necessarily a performance killer, though: If the dependent texture coordinate lookup is monotonic with the controller texture it's usually not so bad.

That being said, did you never hear about Horner's Method? You can rewrite

float factor =  (K[0] + K[1] * rSq + K[2] * rSq * rSq + K[3] * rSq * rSq * rSq);

trivially to

float factor =  K[0]  + rSq * (K[1] + rSq * (K[2] + rSq * K[3]) );

Saving you a couple of operations.

Question 3

The GPU is massively parallel, and can compute up to 1000's of results at a single clock-cycle. Memory read is always sequential. If it takes f.e. 5 clocks for the multiplications to compute, one can calculate 1000 results in 5 clock-cycles. If the data has to be read sequentially with f.e. 10 datasets per clock-cycle it would take 100 clock cycles instead of 5 to acquire the data. Number just by random to make you understand :)