Interesting problem. Because I would think it should be possible to achieve better scaling I investigated the performance with a small "benchmark". With this test I compared the performance of single and multi-threaded eig
(multi-threading being delivered through MKL LAPACK/BLAS routines) with IPython parallelized eig
. To see what difference it would make I varied the view type, the number of engines and MKL threading as well as the method of distributing the matrices over the engines.
Here are the results on an old AMD dual core system:
m_size=300, n_mat=64, repeat=3
+------------------------------------+----------------------+
| settings | speedup factor |
+--------+------+------+-------------+-----------+----------+
| func | neng | nmkl | view type | vs single | vs multi |
+--------+------+------+-------------+-----------+----------+
| ip_map | 2 | 1 | direct_view | 1.67 | 1.62 |
| ip_map | 2 | 1 | loadb_view | 1.60 | 1.55 |
| ip_map | 2 | 2 | direct_view | 1.59 | 1.54 |
| ip_map | 2 | 2 | loadb_view | 0.94 | 0.91 |
| ip_map | 4 | 1 | direct_view | 1.69 | 1.64 |
| ip_map | 4 | 1 | loadb_view | 1.61 | 1.57 |
| ip_map | 4 | 2 | direct_view | 1.15 | 1.12 |
| ip_map | 4 | 2 | loadb_view | 0.88 | 0.85 |
| parfor | 2 | 1 | direct_view | 0.81 | 0.79 |
| parfor | 2 | 1 | loadb_view | 1.61 | 1.56 |
| parfor | 2 | 2 | direct_view | 0.71 | 0.69 |
| parfor | 2 | 2 | loadb_view | 0.94 | 0.92 |
| parfor | 4 | 1 | direct_view | 0.41 | 0.40 |
| parfor | 4 | 1 | loadb_view | 1.62 | 1.58 |
| parfor | 4 | 2 | direct_view | 0.34 | 0.33 |
| parfor | 4 | 2 | loadb_view | 0.90 | 0.88 |
+--------+------+------+-------------+-----------+----------+
As you see the performance gain varies greatly over the different settings used, with a maximum of 1.64 times that of regular multi threaded eig
. In these results the parfor
function you used performs badly unless MKL threading is disabled on the engines (using view.apply_sync(mkl.set_num_threads, 1)
).
Varying the matrix size also gives a noteworthy difference. The speedup of using ip_map
on a direct_view
with 4 engines and MKL threading disabled vs regular multi threaded eig
:
n_mat=32, repeat=3
+--------+----------+
| m_size | vs multi |
+--------+----------+
| 50 | 0.78 |
| 100 | 1.44 |
| 150 | 1.71 |
| 200 | 1.75 |
| 300 | 1.68 |
| 400 | 1.60 |
| 500 | 1.57 |
+--------+----------+
Apparently for relatively small matrices there is a performance penalty, for intermediate size the speedup is the largest and for larger matrices the speedup decreases again. I you could achieve a performance gain of 1.75 that would make using IPython.parallel
worthwhile in my opinion.
I did some tests earlier on an Intel dual core laptop also, but I got some funny results, apparently the laptop was overheating. But on that system the speedups were generally a little lower, around 1.5-1.6 max.
Now I think the answer to your question should be: It depends. The performance gain depends on the hardware, the BLAS/LAPACK library, the problem size and the way IPython.parallel
is deployed, among other things perhaps that I'm not aware of. And last but not least, whether it's worth it also depends on how much of a performance gain you think is worthwhile.
The code that I used:
from __future__ import print_function
from numpy.random import rand
from IPython.parallel import Client
from mkl import set_num_threads
from timeit import default_timer as clock
from scipy.linalg import eig
from functools import partial
from itertools import product
eig = partial(eig, right=False) # desired keyword arg as standard
class Bench(object):
def __init__(self, m_size, n_mat, repeat=3):
self.n_mat = n_mat
self.matrix = rand(n_mat, m_size, m_size)
self.repeat = repeat
self.rc = Client()
def map(self):
results = map(eig, self.matrix)
def ip_map(self):
results = self.view.map_sync(eig, self.matrix)
def parfor(self):
results = {}
for i in range(self.n_mat):
results[i] = self.view.apply_async(eig, self.matrix[i,:,:])
for i in range(self.n_mat):
results[i] = results[i].get()
def timer(self, func):
t = clock()
func()
return clock() - t
def run(self, func, n_engines, n_mkl, view_method):
self.view = view_method(range(n_engines))
self.view.apply_sync(set_num_threads, n_mkl)
set_num_threads(n_mkl)
return min(self.timer(func) for _ in range(self.repeat))
def run_all(self):
funcs = self.ip_map, self.parfor
n_engines = 2, 4
n_mkls = 1, 2
views = self.rc.direct_view, self.rc.load_balanced_view
times = []
for n_mkl in n_mkls:
args = self.map, 0, n_mkl, views[0]
times.append(self.run(*args))
for args in product(funcs, n_engines, n_mkls, views):
times.append(self.run(*args))
return times
Dunno if it matters but to start 4 IPython parallel engines I typed at the command line:
ipcluster start -n 4
Hope this helps :)