I've not yet tried futures, but I believe it's thread-based, so this probably applies: http://www.youtube.com/watch?v=ph374fJqFPE
In short, I/O bound workloads thread well in CPython, but CPU-bound workloads do not. And if you mix I/O bound and CPU-bound threads in the same process, that doesn't thread well either.
If that's the problem, I'd suggest increasing the size of your work chunks (just squaring a number is pretty small), and using multiprocessing. Multiprocessing is thread-like, but it uses multiple processes with shared memory, and tends to give looser coupling between program components than threading anyway.
That, or switch to Jython or IronPython; these reputedly thread well.