Question

Any small database processing can be easily tackled by Python/Perl/... scripts, that uses libraries and/or even utilities from the language itself. However, when it comes to performance, people tend to reach out for C/C++/low-level languages. The possibility of tailoring the code to the needs seems to be what makes these languages so appealing for BigData -- be it concerning memory management, parallelism, disk access, or even low-level optimizations (via assembly constructs at C/C++ level).

Of course such set of benefits would not come without a cost: writing the code, and sometimes even reinventing the wheel, can be quite expensive/tiresome. Although there are lots of libraries available, people are inclined to write the code by themselves whenever they need to grant performance. What disables performance assertions from using libraries while processing large databases?

For example, consider an entreprise that continuously crawls webpages and parses the data collected. For each sliding-window, different data mining algorithms are run upon the data extracted. Why would the developers ditch off using available libraries/frameworks (be it for crawling, text processing, and data mining)? Using stuff already implemented would not only ease the burden of coding the whole process, but also would save a lot of time.

In a single shot:

  • what makes writing the code by oneself a guarantee of performance?
  • why is it risky to rely on a frameworks/libraries when you must assure high performance?
Was it helpful?

Solution

Having done the rewriting game over and over myself (and still doing it), my immediate reaction was adaptability.

While frameworks and libraries have a huge arsenal of (possibly intertwinable) routines for standard tasks, their framework property often (always?) disallows shortcuts. In fact, most frameworks have some sort of core infrastructure around which a core layer of basic functionality is implemented. More specific functionality makes use of the basic layer and is placed in a second layer around the core.

Now by shortcuts I mean going straight from a second layer routine to another second layer routine without using the core. Typical example (from my domain) would be timestamps: You have a timestamped data source of some kind. Thus far the job is simply to read the data off the wire and pass it to the core so your other code can feast on it.

Now your industry changes the default timestamp format for a very good reason (in my case they went from unix time to GPS time). Unless your framework is industry-specific it is very unlikely that they're willing to change the core representation of time, so you end up using a framework that almost does what you want. Every time you access your data you have to convert it to industry-time-format first, and every time you want it modified you have to convert it back to whatever the core deems appropriate. There is no way that you can hand over data straight from the source to a sink without double conversion.

This is where your hand-crafted frameworks will shine, it's just a minor change and you're back modelling the real world whereas all other (non-industry-specific) frameworks will now have a performance disadvantage.

Over time, the discrepancy between the real world and the model will add up. With an off-the-shelf framework you'd soon be facing questions like: How can I represent this in that or how do make routine X accept/produce Y.

So far this wasn't about C/C++. But if, for some reason, you can't change the framework, i.e. you do have to put up with double conversion of data to go from one end to another, then you'd typically employ something that minimises the additional overhead. In my case, a TAI->UTC or UTC->TAI converter is best left to raw C (or an FPGA). There is no elegance possible, no profound smart data structure that makes the problem trivial. It's just a boring switch statement, and why not use a language whose compilers are good at optimising exactly that?

OTHER TIPS

I don't think that everyone reaches for C/C++ when performance is an issue.

The advantage to writing low-level code is using fewer CPU cycles, or sometimes, less memory. But I'd note that higher-level languages can call down to lower-level languages, and do, to get some of this value. Python and JVM languages can do this.

The data scientist using, for example, scikit-learn on her desktop is already calling heavily optimized native routines to do the number crunching. There is no point in writing new code for speed.

In the distributed "big data" context, you are more typically bottleneck on data movement: network transfer and I/O. Native code does not help. What helps is not writing the same code to run faster, but writing smarter code.

Higher-level languages are going to let you implement more sophisticated distributed algorithms in a given amount of developer time than C/C++. At scale, the smarter algorithm with better data movement will beat dumb native code.

It's also usually true that developer time, and bugs, cost loads more than new hardware. A year of a senior developer's time might be $200K fully loaded; over a year that also rents hundreds of servers worth of computation time. It may just not make sense in most cases to bother optimizing over throwing more hardware at it.

I don't understand the follow up about "grant" and "disable" and "assert"?

As all we know, in Digital world there are many ways to do the same work / get expected results..

And responsibilities / risks which comes from the code are on developers' shoulders..

This is small but i guess a very useful example from .NET world..

So Many .NET developers use the built-in BinaryReader - BinaryWriter on their data serialization for performance / get control over the process..

This is CSharp source code of the FrameWork's built in BinaryWriter class' one of the overloaded Write Methods :

// Writes a boolean to this stream. A single byte is written to the stream
// with the value 0 representing false or the value 1 representing true.
// 
public virtual void Write(bool value) 
{
     //_buffer is a byte array which declared in ctor / init codes of the class
    _buffer = ((byte) (value? 1:0));

    //OutStream is the stream instance which BinaryWriter Writes the value(s) into it.
    OutStream.WriteByte(_buffer[0]);
}

As you see, this method could written without the extra assigning to _buffer variable:

public virtual void Write(bool value) 
{
    OutStream.WriteByte((byte) (value ? 1 : 0));
}

Without assigning we could gain few milliseconds..This few milliseconds can accept as "almost nothing" but what if there are multi-thousands of writing (i.e. in a server process)?

Lets suppose that "few" is 2 (milliseconds) and multi-Thousands instances are only 2.000.. This means 4 seconds more process time..4 seconds later returning..

If we continue to subject from .NET and if you can check the source codes of BCL - .NET Base Class Library- from MSDN you can see a lot of performance losts from the developer decides..

Any of the point from BCL source It's normal that you see developer decided to use while() or foreach() loops which could implement a faster for() loop in their code.

This small gains give us the total performance..

And if we return to the BinaryWriter.Write() Method..

Actually extra assigning to a _buffer implementation is not a developer fault..This is exactly decide to "stay in safe" !

Suppose that we decide to not use _buffer and decided to implement the second method..If we try to send multi-thousands bytes over a wire (i.e. upload / download a BLOB or CLOB data) with the second method, it can fail commonly because of connection lost..Cause we try to send all data without any checks and controlling mechanism.When connection lost, Both the server and Client never know the sent data completed or not.

If the developer decides "stay in safe" then normally it means performance costs depends to implemented "stay in safe" mechanism(s).

But if the developer decides "get risky, gain performance" this is not a fault also..Till there are some discussions about "risky" coding.

And as a small note : Commercial library developers always try to stay in safe because they can't know where their code will use.

Coming from a programmers perspective, frameworks rarely target performance as the highest priority. If your library is going to be widely leveraged the things people are likely to value most are ease of use, flexibility, and reliability.

Performance is generally valued in secondary competitive libraries. "X library is better because it's faster." Even then very frequently those libraries will trade off the most optimal solution for one that can be widely leveraged.

By using any framework you are inherently taking a risk that a faster solution exists. I might go so far as to say that a faster solution almost always exists.

Writing something yourself is not a guarantee of performance, but if you know what you are doing and have a fairly limited set of requirements it can help.

An example might be JSON parsing. There are a hundred libraries out there for a variety of languages that will turn JSON into a referable object and vice versa. I know of one implementation that does it all in CPU registers. It's measurably faster than all other parsers, but it is also very limited and that limitation will vary based on what CPU you are working with.

Is the task of building a high-performant environment specific JSON parser a good idea? I would leverage a respected library 99 times out of 100. In that one separate instance a few extra CPU cycles multiplied by a million iterations would make the development time worth it.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top