Make math on numbers being in the specific part of a file

Question 1

Are you sure that the shell isn't fast enough? Maybe your bash just needs improved. :)

It appears that you want to print every line after a line with just a ( until you get to a closing ). So...

#!/usr/bin/ksh
print=0
while read
do
    if [[ "$REPLY" == ')' ]]
    then
        print=0
    elif [[ "$print" == 1 ]]
    then
        echo "${REPLY//[()]/}"
    elif [[ "$REPLY" == '(' ]]
    then
        print=1
    fi
done
exit 0

And, with your provided test data:

danny@machine:~$ ./test.sh < file
$data $data $data
$data $data $data
$data $data $data
$data $data $data
$data $data $data

I'll bet you'll find that to be roughly as fast as anything else you would write. If I was going to be using this often, I'd be inclined to add several more error checks - but if your data is well-formed, this will work fine.

Alternatively, you could just use sed.

danny@machine:~$ sed -n '/^($/,/^)$/{/^[()]$/d;s/[()]//gp}' file
$data $data $data
$data $data $data
$data $data $data
$data $data $data
$data $data $data

performance note edit:

I was comparing python implementations below, so I thought I'd test these as well. The sed solution runs about identically to the fastest python implementation on the same data - less than one second (0.9 seconds) to filter ~80K lines. The bash version takes 42.5 seconds to do it. However, just replacing #!/bin/bash with #!/usr/bin/ksh above (which is ksh93, on Ubuntu 13.10) and making no other changes to the script reduces runtime down to 10.5 seconds. Still slower than python or sed, but that's part of why I hate scripting in bash.

I also updated both solutions to remove the opening and closing parens, to be more consistent with the other answers.

Question 2

Here is something which should perform well on huge data, and it's using Python 3:

#!/usr/bin/python3

import mmap

fi = open('so23434490in.txt', 'rb')
m = mmap.mmap(fi.fileno(), 0, access=mmap.ACCESS_READ)
fo = open('so23434490out.txt', 'wb')
p2 = 0
while True:
    p1 = m.find(b'(\n(', p2)
    if p1 == -1:
        break
    p2 = m.find(b')\n)', p1)
    if p2 == -1:
        break # unmatched opening sequence!
    data = m[p1+3:p2]
    data = data.replace(b'(',b'').replace(b')',b'')

    # Now decide: either do some computation on that data in Python
    for line in data.split(b'\n'):
        cols = list(map(float, data.split(b' ')))
        # perform some operation on cols

    # Or simply write out the data to use it as input for your C++ code
    fo.write(data)
    fo.write(b'\n')
fo.close()
m.close()
fi.close()

This uses mmap to map the file into memory. Then you can access it easily without having to worry about reading it in. It also is very efficient, since it can avoid unneccessary copying (from the page cache to the application heap).

Question 3

I guess we need a perl solution, too.

#!/usr/bin/perl

my $p=0;
while(<STDIN>){
 if( /^\)\s*$/ ){
   $p = 0;
   }
 elsif( $p ){
   s/[()]//g;
   print;
   }
 elsif( /^\(\s*$/ ){
   $p = 1;
   }
}

On my system, this runs slightly slower than the fastest python implementation from above (while also doing the parenthesis removal), and about the same as

sed -n '/^($/,/^)$/{/^[()]$/d;s/[()]//gp}'

Question 4

Using C provides much better speed than bash/ksh or C++(or Python, even though saying that stings). I created a text file containing 18 million lines containing the example text duplicated 1 million times. On my laptop, this C program works with the file in 1 second, while the Python version takes 5 seconds, and running the bash version under ksh(because it's faster than bash) with the edits mentioned in that answer's comments takes 1 minute 20 seconds(a.k.a 80 seconds). Note that this C program doesn't check for errors at all except for the non-existent file. Here it is:

#include <string.h>
#include <stdio.h>

#define BUFSZ 1024
// I highly doubt there are lines longer than 1024 characters

int main()
{
    int is_area=0;
    char line[BUFSZ];
    FILE* f;
    if ((f = fopen("out.txt", "r")) != NULL)
    {
        while (fgets(line, BUFSZ, f))
        {
            if (line[0] == ')') is_area=0;
            else if (is_area) fputs(line, stdout); // NO NEWLINE!
            else if (strcmp(line, "(\n") == 0) is_area=1;
        }
    }
    else
    {
        fprintf(stderr, "THE SKY IS FALLING!!!\n");
        return 1;
    }
    return 0;
}

If the fact it's completely unsafe freaks you out, here's a C++ version, which took 2 seconds:

#include <iostream>
#include <fstream>
#include <string>

using namespace std;
// ^ FYI, the above is a bad idea, but I'm trying to preserve clarity

int main()
{
    ifstream in("out.txt");
    string line;
    bool is_area(false);
    while (getline(in, line))
    {
        if (line[0] == ')') is_area = false;
        else if (is_area) cout << line << '\n';
        else if(line == "(") is_area = true;
    }
    return 0;
}

EDIT: As MvG pointed out in the comments, I wasn't benching the Python version fairly. It doesn't take 24 seconds as I originally stated, but 5 instead.