Question

By stream splitting I mean the ability to:

  1. filter on the fly the stream content by a first function
  2. one part of the stream is processed by a second function
  3. the rest of the stream is processed by a third function
  4. the stream is never stored (on the fly)

An example is sometime better than a long explanation. This command line use tee and process substitution to split the stream:

$> cut -f2 file | tee >( grep "AB" | sort | ... ) | grep -v "AB" | tr A B | ...

In this example, the stream is split in two: the lines containing "AB" and the rest:

cut -f2 file ---->- line contains "AB" ->- sort ->- ...
             \--->- does not contain "AB" ->- tr A B ->- ...

But I do not like this stream splitting technique because the stream is first duplicated (by tee) to be then filtered twice (by grep and grep -v).

Therefore I wonder if something like stream splitting is available in other languages as , , , ...

I provide a more complex example below.


Complex bash stream splitting

counter.sh splits a stream in three sections (begin, middle and end). And for each section, the stream is again split to count the occurrences of symbols <, | and >:

#!/bin/bash    
{
  {  tee >( sed -n '1,/^--$/!p' >&3 ) |
            sed -n '1,/^--$/p'        |
     tee >( echo "del at begin:  $(grep -c '<')"    >&4 ) |
     tee >( echo "add at begin:  $(grep -c '>')"    >&4 ) |
          { echo "chg at begin:  $(grep -c '|')"; } >&4
  }  3>&1 1>&2  |
  {  tee >( sed -n '/^--$/,/^--$/!p' >&3 ) |
            sed -n '/^--$/,/^--$/p'        |
     tee >( echo "del at end:    $(grep -c '<')"    >&4 ) |
     tee >( echo "add at end:    $(grep -c '>')"    >&4 ) |
          { echo "chg at end:    $(grep -c '|')"; } >&4
  }  3>&1 1>&2 |
     tee >( echo "del in middle: $(grep -c '<')"    >&4 ) |
     tee >( echo "add in middle: $(grep -c '>')"    >&4 ) |
            echo "chg in middle: $(grep -c '|')"; 
} 4>&1

This script is used to count the number of added/changed/deleted lines in sections begin/middle/end. The input of this script is a stream:

$> cat file-A
1
22
3
4
5
6
77
8

$> cat file-B
22
3
4
42
6
77
8
99

$> diff --side-by-side file-A file-B | egrep -1 '<|\||>' | ./counter.sh
del at begin:  1
add at begin:  0
chg at begin:  0
del at end:    0
add at end:    1
chg at end:    0
del in middle: 0
add in middle: 0
chg in middle: 1

How to implement efficiently a such counter.sh in other programming languages without storing the data in a temporary buffer?


Answer

As noted by Lennart Regebro, I am over-thinking this question. Of course, all these languages are able to split input streams as answered by ysth. In pseudo code:

while input-stream
{
    case (begin section)
    {
        case (symbol <) aB++ 
        case (symbol |) cB++ 
        case (symbol >) dB++
    }
    case (middle section)
    {
        case (symbol <) aM++ 
        case (symbol |) cM++ 
        case (symbol >) dM++
    } 
    case (ending section)
    {
        case (symbol <) aE++ 
        case (symbol |) cE++ 
        case (symbol >) dE++
    }
}

PrintResult (aB, cB, dB, aM, cM, dM, aE, cE, dE)

Conclusion: Stream splitting is better done in python/perl/awk/C++ than using tee + process substitution.

Was it helpful?

Solution

Any of the languages you mention are perfectly suitable for this.

In Perl, I would not use the diff command, I would just use Algorithm::Diff on the original files.

OTHER TIPS

Tee is just a C program using basic system calls, you can implement it in any language that provides access to the system libraries.

A google search for

tee in my favorite language

should find all the answers you need.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top