Conduit: Multiple Stream Consumers

https://stackoverflow.com/questions/17931053

04-06-2022
|

Question

I write a program which counts the frequencies of NGrams in a corpus. I already have a function that consumes a stream of tokens and produces NGrams of one single order:

ngram :: Monad m => Int -> Conduit t m [t]
trigrams = ngram 3
countFreq :: (Ord t, Monad m) => Consumer [t] m (Map [t] Int)

At the moment i just can connect one stream consumer to a stream source:

tokens --- trigrams --- countFreq

How do I connect multiple stream consumers to the same stream source? I would like to have something like this:

           .--- unigrams --- countFreq
           |--- bigrams  --- countFreq
tokens ----|--- trigrams --- countFreq
           '--- ...      --- countFreq

A plus would be to run each consumer in parallel

EDIT: Thanks to Petr I came up with this solution

spawnMultiple orders = do
    chan <- atomically newBroadcastTMChan

    results <- forM orders $ \_ -> newEmptyMVar
    threads <- forM (zip results orders) $
                        forkIO . uncurry (sink chan)

    forkIO . runResourceT $ sourceFile "test.txt"
                         $$ javascriptTokenizer
                         =$ sinkTMChan chan

    forM results readMVar

    where
        sink chan result n = do
            chan' <- atomically $ dupTMChan chan
            freqs <- runResourceT $ sourceTMChan chan'
                                 $$ ngram n
                                 =$ frequencies
            putMVar result freqs

Solution

I'm assuming you want all your sinks to receive all values.

I'd suggest:

Use newBroadcastTMChan to create a new channel Control.Concurrent.STM.TMChan (stm-chans).
Use this channel to build a sink using sinkTBMChan from Data.Conduit.TMChan (stm-conduit) for your main producer.
For each client use dupTMChan to create its own copy for reading. Start a new thread that will read this copy using sourceTBMChan.
Collect results from your threads.
Be sure your clients can read the data as fast as they're produced, otherwise you can get heap overflow.

(I haven't tried it, let us know how it works.)

Update: One way how you could collect the results is to create a MVar for each consumer thread. Each of them would putMVar its result after it's finished. And your main thread would takeMVar on all these MVars, thus waiting for every thread to finish. For example if vars is a list of your MVars, the main thread would issue mapM takeMVar vars to collect all the results.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow