Pregunta

I have many files containing :

  • data: numbers that I have to use/manipulate, formatted in a specific way, specified in the following,

  • rows that I need just as they are (configurations of the software use these files).

The files most of time are huge, many millions of rows, and can't be handled fast enough with bash. I have made a script that checks each line to see if it's data, writing them to another file (without calculations), but it's very slow (many thousand rows per second).

The data is formatted in a way like this:

text 
text
(
($data $data $data)
($data $data $data)
($data $data $data)
)
text
text
(
($data $data $data)
($data $data $data)
)
text
( text )
( text )
(text text)

I have to make another file, using $data, that should be the results of some operation with it.

The portions of file that contains numbers can be distinguished by the presence of this occurrence:

(
(

and the same:

)
)

at the end.

I've made before a C++ program that makes the operation I want, but for files containing columns of numbers only. I don't know how to ignore the text that I don't have to modify and handle the way the data is formatted.

Where do I have to look to solve my problem smartly?

Which should be the best way to handle data files, formatted in different ways, and make math with them? Maybe Python?

¿Fue útil?

Solución

Are you sure that the shell isn't fast enough? Maybe your bash just needs improved. :)

It appears that you want to print every line after a line with just a ( until you get to a closing ). So...

#!/usr/bin/ksh
print=0
while read
do
    if [[ "$REPLY" == ')' ]]
    then
        print=0
    elif [[ "$print" == 1 ]]
    then
        echo "${REPLY//[()]/}"
    elif [[ "$REPLY" == '(' ]]
    then
        print=1
    fi
done
exit 0

And, with your provided test data:

danny@machine:~$ ./test.sh < file
$data $data $data
$data $data $data
$data $data $data
$data $data $data
$data $data $data

I'll bet you'll find that to be roughly as fast as anything else you would write. If I was going to be using this often, I'd be inclined to add several more error checks - but if your data is well-formed, this will work fine.

Alternatively, you could just use sed.

danny@machine:~$ sed -n '/^($/,/^)$/{/^[()]$/d;s/[()]//gp}' file
$data $data $data
$data $data $data
$data $data $data
$data $data $data
$data $data $data

performance note edit:

I was comparing python implementations below, so I thought I'd test these as well. The sed solution runs about identically to the fastest python implementation on the same data - less than one second (0.9 seconds) to filter ~80K lines. The bash version takes 42.5 seconds to do it. However, just replacing #!/bin/bash with #!/usr/bin/ksh above (which is ksh93, on Ubuntu 13.10) and making no other changes to the script reduces runtime down to 10.5 seconds. Still slower than python or sed, but that's part of why I hate scripting in bash.

I also updated both solutions to remove the opening and closing parens, to be more consistent with the other answers.

Otros consejos

Here is something which should perform well on huge data, and it's using Python 3:

#!/usr/bin/python3

import mmap

fi = open('so23434490in.txt', 'rb')
m = mmap.mmap(fi.fileno(), 0, access=mmap.ACCESS_READ)
fo = open('so23434490out.txt', 'wb')
p2 = 0
while True:
    p1 = m.find(b'(\n(', p2)
    if p1 == -1:
        break
    p2 = m.find(b')\n)', p1)
    if p2 == -1:
        break # unmatched opening sequence!
    data = m[p1+3:p2]
    data = data.replace(b'(',b'').replace(b')',b'')

    # Now decide: either do some computation on that data in Python
    for line in data.split(b'\n'):
        cols = list(map(float, data.split(b' ')))
        # perform some operation on cols

    # Or simply write out the data to use it as input for your C++ code
    fo.write(data)
    fo.write(b'\n')
fo.close()
m.close()
fi.close()

This uses mmap to map the file into memory. Then you can access it easily without having to worry about reading it in. It also is very efficient, since it can avoid unneccessary copying (from the page cache to the application heap).

I guess we need a perl solution, too.

#!/usr/bin/perl

my $p=0;
while(<STDIN>){
 if( /^\)\s*$/ ){
   $p = 0;
   }
 elsif( $p ){
   s/[()]//g;
   print;
   }
 elsif( /^\(\s*$/ ){
   $p = 1;
   }
}

On my system, this runs slightly slower than the fastest python implementation from above (while also doing the parenthesis removal), and about the same as

sed -n '/^($/,/^)$/{/^[()]$/d;s/[()]//gp}'

Using C provides much better speed than bash/ksh or C++(or Python, even though saying that stings). I created a text file containing 18 million lines containing the example text duplicated 1 million times. On my laptop, this C program works with the file in 1 second, while the Python version takes 5 seconds, and running the bash version under ksh(because it's faster than bash) with the edits mentioned in that answer's comments takes 1 minute 20 seconds(a.k.a 80 seconds). Note that this C program doesn't check for errors at all except for the non-existent file. Here it is:

#include <string.h>
#include <stdio.h>

#define BUFSZ 1024
// I highly doubt there are lines longer than 1024 characters

int main()
{
    int is_area=0;
    char line[BUFSZ];
    FILE* f;
    if ((f = fopen("out.txt", "r")) != NULL)
    {
        while (fgets(line, BUFSZ, f))
        {
            if (line[0] == ')') is_area=0;
            else if (is_area) fputs(line, stdout); // NO NEWLINE!
            else if (strcmp(line, "(\n") == 0) is_area=1;
        }
    }
    else
    {
        fprintf(stderr, "THE SKY IS FALLING!!!\n");
        return 1;
    }
    return 0;
}

If the fact it's completely unsafe freaks you out, here's a C++ version, which took 2 seconds:

#include <iostream>
#include <fstream>
#include <string>

using namespace std;
// ^ FYI, the above is a bad idea, but I'm trying to preserve clarity

int main()
{
    ifstream in("out.txt");
    string line;
    bool is_area(false);
    while (getline(in, line))
    {
        if (line[0] == ')') is_area = false;
        else if (is_area) cout << line << '\n';
        else if(line == "(") is_area = true;
    }
    return 0;
}

EDIT: As MvG pointed out in the comments, I wasn't benching the Python version fairly. It doesn't take 24 seconds as I originally stated, but 5 instead.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top