why polymorphism is so costly in haskell(GHC)?

Question 1

I don't understand how typeclasses work in GHC.

OK, consider this function:

linear :: Num x => x -> x -> x -> x
linear a b x = a*x + b

This takes three numbers as input, and returns a number as output. This function accepts any number type; it is polymorphic. How does GHC implement that? Well, essentially the compiler creates a "class dictionary" which holds all the class methods inside it (in this case, +, -, *, etc.) This dictionary becomes an extra, hidden argument to the function. Something like this:

data NumDict x =
  NumDict
  {
    method_add :: x -> x -> x,
    method_subtract :: x -> x -> x,
    method_multiply :: x -> x -> x,
    ...
  }

linear :: NumDict x -> x -> x -> x -> x
linear dict a b x = a `method_multiply dict` x `method_add dict` b

Whenever you call the function, the compiler automatically inserts the correct dictionary - unless the calling function is also polymorphic, in which case it will have received a dictionary itself, so just pass that along.

In truth, functions that lack polymorphism are typically faster not so much because of a lack of function look-ups, but because knowing the types allows additional optimisations to be done. For example, our polymorphic linear function will work on numbers, vectors, matricies, ratios, complex numbers, anything. Now, if the compiler knows that we want to use it on, say, Double, now all the operations become single machine-code instructions, all the operands can be passed in processor registers, and so on. All of which results in fantastically efficient code. Even if it's complex numbers with Double components, we can make it nice and efficient. If we have no idea what type we'll get, we can't do any of those optimisations... That's where most of the speed difference typically comes from.

For a tiny function like linear, it's highly likely it will be inlined every time it's called, resulting in no polymorphism overhead and a small amount of code duplication - rather like a C++ template. For a larger, more complex polymorphic function, there may be some cost. In general, the compiler decides this, not you - unless you want to start sprinkling pragmas around the place. ;-) Or, if you don't actually use any polymorphism, you can just give everything monomorphic type signatures...

Question 2

With polymorphic code, there is usually a tradeoff between code size and code speed. Either you produce a separate version of the same code for each type that it will operate on, which results in larger code, or you produce a single version that can operate on multiple types, which will be slower.

C++ implementations of templates choose in favor of increasing code speed at the cost of increasing code size. By default, GHC takes the opposite tradeoff. However, it is possible to get GHC to produce separate versions for different types using the SPECIALIZE and INLINABLE pragmas. This will result in polymorphic code that has speed similar to monomorphic code.

Question 3

I want to supplement Dirk's answer by saying that INLINABLE is usually recommended over SPECIALIZE. An INLINABLE annotation on a function guarantees that the module exports the original source code of the function so that it can be specialized at the point of usage. This usually removes the need to provide separate SPECIALIZE pragmas for every single use case.

Unlike INLINE, INLINABLE does not change GHC's optimization heuristics. It just says "Please export the source code".