Вопрос

I will be extracting certain bits from log files using regular expressions to filter out bit of data. Initially I was going to do this with Python. I later started to think about the fastest way I can perform this task. This lead me to parallel programming. I remember hearing somewhere that Python can't be truly parallel. I'm by far an expert programmer. I have been playing with Java for a bit and I was considering parallel programming in Java to perform this task. I was wondering what would be the fastest way to perform this task?

Это было полезно?

Решение

Assuming you are talking about speed of execution and not speed of development, C might be hard to beat. But I don't think it would beat Java or even Python by much these days. Python is largely a thin wrapper on C implementations, albeit with a more efficient syntax and richer library.

Ultimately, I think your program will be I/O bound. Adding more threads or processes probably won't get the data off the disk or network any faster.

To minimize the CPU time of your program (as opposed to the execution time of it), you should use the fastest regex library you can find, and simplify the log format and regexes as much as possible. A multiple pass approach to parsing the log lines might help, too, whereby you use something simple and fast to break up the log lines into phrases, and then pass the phrases to separate regexes that would collectively be simpler and faster than a single regex to parse the entire line.

To parallelize, I would have the main thread vend lines to a pool of line processors, each running on a separate thread, perhaps with n queues for the n threads. You might have lines processed out of order, so beware. And if one processing thread is still faster than disk or network I/O, then you are adding complexity for no real gain.

Above all, when looking to improve speed, profile and measure before and after every change. You need to be able to prove that you are making the program faster.

Лицензировано под: CC-BY-SA с атрибуция
scroll top