Question

Assume I have $P$ processors, each having a different vector $v_p$ of size $N$, $p=1, ..., P$. I learned in this question/answer that for computing the matrix-vector product $$w = (E\otimes I_N)v$$ in parallel, where $\otimes$ is the Kronecker product, $v = (v_1, ..., v_P)^T$ is the stacked vector consisting of all local vectors $v_p$, $E = (e_{pj})_{p,j}\in\mathbb{C}^{P\times P}$ some dense matrix and $I_N$ is the identity matrix of size $N\times N$, I can basically do $P$ parallel sums using binary trees.

$N$ is pretty large (say, $10^8$), in particular much larger than $P$ (which is only 10-100) and so large that $NP$ (the size of $w$ and/or all vectors $v_p$ together) does not fit into each processor's memory.

Depending on the application, the matrix $E$ could be a Fourier or DFT matrix, i.e. computing $$w_p = \sum_{j=1}^P e_{pj} v_j$$ should be possible using an FFT. For each of the components of the vector $v_j\in\mathbb{R}^N$, the FFT would not change. Do I still have to do $N$ FFTs or is there a "block-wise" FFT idea/implementation?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top