Why there is a huge performance difference between these two codes in Python and Cython?

Question 1

Python is a language. CPython is an bytecode compiler and an interpreter for Python.

It will take some code:

for i in xrange(value):
    z = i**2
    if(i==1000000):
        print i
    if z < i:
        print "yes"

and give you "bytecode":

load the iterator into the for loop and loop its contents into i
load i, load 2, run binary power, store z
load i, load 1000000, compare
load i, print
load z, load i, compare
load 'yes', print
finish

In full:

  1           0 SETUP_LOOP              70 (to 73)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_NAME                1 (value)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                56 (to 72)
             16 STORE_NAME               2 (i)

  2          19 LOAD_NAME                2 (i)
             22 LOAD_CONST               0 (2)
             25 BINARY_POWER        
             26 STORE_NAME               3 (z)

  3          29 LOAD_NAME                2 (i)
             32 LOAD_CONST               1 (1000000)
             35 COMPARE_OP               2 (==)
             38 POP_JUMP_IF_FALSE       49

  4          41 LOAD_NAME                2 (i)
             44 PRINT_ITEM          
             45 PRINT_NEWLINE       
             46 JUMP_FORWARD             0 (to 49)

  5     >>   49 LOAD_NAME                3 (z)
             52 LOAD_NAME                2 (i)
             55 COMPARE_OP               0 (<)
             58 POP_JUMP_IF_FALSE       13

  6          61 LOAD_CONST               2 ('yes')
             64 PRINT_ITEM          
             65 PRINT_NEWLINE       
             66 JUMP_ABSOLUTE           13
             69 JUMP_ABSOLUTE           13

        >>   72 POP_BLOCK           
        >>   73 LOAD_CONST               3 (None)
             76 RETURN_VALUE

It's worth noting that in Python, an integer is an instance of the class int or long. This means that there is not only the number, but a pointer and another piece of informations saying what class it is at least. This makes a lot of overhead.

But it's also worth noting how xrange works.

xrange creates a class instance (LOAD_NAME (xrange), CALL_FUNCTION) that can be iterated over by the for. The for will (basically) delegate to a function call on the iterator's __iter__. There is a function call every loop.

Further, every time you want to get or set the variable z or i, it has to look in the locals dictionary. This is really slow.

Running pure Python-Code in Cython:

When you run it in Cython (the third example in your question), it compiles to C. But all this C does is tell the CPython virtual machine what to do.

CPython alone: a guy reading from a book, and merticulously carrying out its functions.
CPython with Cython: a guy shouting instructions to the guy who merticulously carries out its functions.

It might be a tiny bit faster, but the slow part is still that CPython is slowly doing the work.

Using cythonized code:

What happens when you cdef long long, then?

Cython knows that xrange is acting on a long long:
- It knows the loop is valid (so it doesn't have to check that you gave it a list or somesuch)
- It knows the loop won't overflow (because it's undefined if it does!)
- It can therefore turn it into a C loop (for (int index=0; index<copy_of_value; index++) { i = index; ... })
This avoids the int and long classes, which have a lot of indirection overhead and type checking
This avoids dictionary lookups. Things are always where you put them on the stack
For example i ** 2 is much simpler as the routine can be inlined (it's always a number, dude) and work directly on the integer and ignore overflow

So the result ends up being run mostly by C, and only goes to CPython for some cleanup stuff and the print calls.

Make sense?

Question 2

As I mentioned in my comment: Your third solution is slower/as-slow-as-the-python-version because it lacks the static typing features that allow Cython to speed up your code. When you declare a variable as long f.e., Cython does not need to construct an "expensive" Python-Object, but may rely on C-Code completely. I'm not a Cython nor a Python expert, but I guess Python's object construction is the main bottleneck.