Implementing pmap from scratch. Why my implementation slow?

https://stackoverflow.com/questions/11828317

24-06-2021
|

Question

I'm new to Erlang, so for training I try to implement standard functions from scratch. I've tried to create parallel implementation of map/2 function from lists module. But my implementation works very slow. Could you point me, if I did any principal mistakes in my implementation:

enter image description here

-module( my_pmap ).
-export([ pmap/2 ]).
-export([ map/4, collect/3 ]).

map( F, Value, Indx, SenderPid ) ->
        SenderPid ! { Indx, F( Value ) }.

pmap( F, List ) ->
        CollectorPid = spawn_link( my_pmap, collect, [ length( List ), [], self() ] ),
        lists:foldl(
                fun( X, Indx ) ->
                        spawn_link( my_pmap, map, [ F, X, Indx, CollectorPid ] ),
                        Indx + 1
                end,
                1,
                List ),
        Mapped =
                receive
                        { collected, M } ->
                                M
                end,
        Sorted = lists:sort(
                        fun( { Indx1, _ }, { Indx2, _ } ) ->
                                Indx1 < Indx2
                        end,
                        Mapped ),
        [ Val || { _Indx, Val } <- Sorted ].

collect( 0, List, SenderPid ) ->
        SenderPid ! { collected, List };
collect( N, List, SenderPid ) when N > 0 ->
        receive
                Mapped ->
                        collect( N - 1, [ Mapped | List ], SenderPid )
        end.

And here is results of testing:

1> c(my_pmap).
{ok,my_pmap}
2> timer:tc( my_pmap, pmap, [ fun(X) -> X*X*X*X end, lists:seq( 1, 10000 ) ] ).
{137804,
 [1,16,81,256,625,1296,2401,4096,6561,10000,14641,20736,
  28561,38416,50625,65536,83521,104976,130321,160000,194481,
  234256,279841,331776,390625,456976,531441|...]}
3> timer:tc( lists, map, [ fun(X) -> X*X*X*X end, lists:seq( 1, 10000 ) ] ).   
{44136,
 [1,16,81,256,625,1296,2401,4096,6561,10000,14641,20736,
  28561,38416,50625,65536,83521,104976,130321,160000,194481,
  234256,279841,331776,390625,456976,531441|...]}

As you might have seen 0,137804 sec. vs. 0,044136 sec.

Thanks

La solution

The comments are correct. The problem is that spawning processes are cheap but it does have a cost. Multiplying A number three times is very fast and the overhead of spawning a new process kills your performance.

Partitioning the list into fragments and processing each fragment in a separate process will probably be faster. If you know you have 8 cores, you could try to split it in 8 fragments. Things like pmap can be implemented in Erlang, but it is not a strength of Erlang. A system like the Haskell GHC runtime has sparks which is a better tool for fine-grained parallelism like this. Also, multiplying like that is an obvious candidate for either SIMD instructions in SSE or a GPU. Erlang has no solution for this either, but again, GHC has accelerate and repa which are libraries for handling this situation.

On the other hand, you can get a good speedup in Erlang by simply using processes to handle a couple of fragments as hinted. Also note that parallel computation often performs badly at low N (like 10000) because of the communication overhead. You need way larger problems to reap the benefits.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow