Python: JIT for known bottlenecks

Question 1

What about using Cython instead? You could convert just this one function to cython syntax, which is than compiled directly to C or so. The syntax should be close enough to python itself, probably just add a few declarations of the right types.

Question 2

No, you cannot run part of a Python program in PyPy and other parts in another Python- It's more than just a JIT, it has a completely different representation of objects and many other internals.

If your sole concern is not wanting to make sure the rest of the program works with PyPy, rest assured: Virtually all pure Python code works with PyPy, with the sole exception of CPython implementation details. Those are obscure, it's quite hard to accidentially write code that relies on most of them, and the others (e.g. files not being closed automatically as promptly) won't break most programs. Just try running your whole program with PyPy.

If there are other issues with PyPy, you might want to translate only this function into C and call it using ctypes or cffi. The annoying part would be hooking it up with Python (e.g. via an extension module), which is what ctypes and cffi do for you. You don't need a complete arbitary precision integer library, you only need an array of bits with a few very simple operations: testing the least significant bit, left/right shifting, less-than, and bitwise XOR. Each of those is only one trivial loop. And if the naive C implementation is still a bottleneck, you can probably vectorize all of these loops. You could probably also optimize the shifts to avoid copying anything at all.

Question 3

Cython has already been mentioned a few times, but I'd like to point out another alternative for those who find this question in their search, who might not have the arbitrary-precision integer requirement: ShedSkin. It translates Python to C++ via type inference. In other words, unlike Cython, you don't add type declarations to Python; you just write Python. But you do have to make sure your variables don't change type within your program, otherwise the type cannot be inferred. Certain other of the more dynamic features of Python are also not supported, but this is usually not an issue for computation-intensive code.

As for the arbitrary-precision integer requirement: You will likely get much, much bigger speed improvements than the 2.5x you noted in your comment, for integers that fit in "native" (fixed-size) integers. The improvement is so drastic that if a good portion of your calculations can be done with native integers, it is probably worth having a (blazingly fast) native-integer version of your function just for that case, and only using the general version of your function for values that don't fit. (Not all problems can be broken into separate cases like this, but for those that can, it's worth at least checking out this approach.)

Question 4

I actually created a python package that does exactly what you are trying to accomplish. It implements numpy arrays over Galois fields. I was able to optimize it by JIT compiling with numba. It also supports arbitrarily-large integers, like you mentioned. https://github.com/mhostetter/galois

In [1]: import numpy as np                                                                     

In [2]: import galois                                                                          

In [3]: GF = galois.GF(2**100)                                                                 

In [4]: print(GF.properties)                                                                   
GF(2^100):
  characteristic: 2
  degree: 100
  order: 1267650600228229401496703205376
  irreducible_poly: Poly(x^100 + x^57 + x^56 + x^55 + x^52 + x^48 + x^47 + x^46 + x^45 + x^44 + x^43 + x^41 + x^37 + x^36 + x^35 + x^34 + x^31 + x^30 + x^27 + x^25 + x^24 + x^22 + x^20 + x^19 + x^16 + x^15 + x^11 + x^9 + x^8 + x^6 + x^5 + x^3 + 1, GF(2))
  is_primitive_poly: True
  primitive_element: GF(2, order=2^100)

In [5]: A = GF.Random((2,2)); A                                                                
Out[5]: 
GF([[853109427014456778157610146134, 957579797461316189986596054496],
    [619061399056446662243502059294, 762244553049699515226100523552]],
   order=2^100)

In [6]: B = GF.Random((2,2)); B                                                                
Out[6]: 
GF([[511709585928908961018234228239, 206374347029181859532039074035],
    [795530021671674913470994904012, 918203712488921499667394325749]],
   order=2^100)

In [7]: A + B                                                                                  
Out[7]: 
GF([[1005796869832339943233379227481, 1152773228746217240881872950547],
    [1097497412292991510222532386002, 160953539027946640002225213141]],
   order=2^100)

In [8]: A * B                                                                                  
Out[8]: 
GF([[296314095771552265037299061152, 688536035673482273067277820628],
    [1177970297984569800118703939222, 537328370564266356643331706738]],
   order=2^100)

In [9]: A / A                                                                                  
Out[9]: 
GF([[1, 1],
    [1, 1]], order=2^100)

# Fermat's Little Theorem
In [10]: A ** (GF.order - 1)                                                                   
Out[10]: 
GF([[1, 1],
    [1, 1]], order=2^100)

In [11]: A @ B                                                                                 
Out[11]: 
GF([[1027776659503691614378629238339, 470187664463292378234435322369],
    [86178777179053209582733631256, 172677144553521647820627674227]],
   order=2^100)

In [12]: np.linalg.inv(A) @ A                                                                  
Out[12]: 
GF([[1, 0],
    [0, 1]], order=2^100)