Can writing Object-Orientated Code in a Functional style improve performance?

https://softwareengineering.stackexchange.com/questions/412703

13-03-2021
|

Question

Given a scenario where you have consecutive setters or a series of events where an object is modified, can it be more performant to instead write code in a way where a new state is returned rather than a field within an object modified due to how a compiler will interpret the code?

e.g

obj0 = new_object()
obj1 = modify_object(obj0)
obj2 = modify_object(obj1)

versus

obj0 = new_object()
obj0.modify()
obj0.modify()

Solution

Modern high-performance garbage collectors are built on the assumption that most objects die young, and most objects are never mutated. Your functional version fits into that assumption: the objects are never mutated, and they can be collected after just one line.

Most modern high-performance garbage collectors will not do any heap allocation for the code you posted, except for the last object. The first two objects will be allocated on the stack.

Therefore, it is highly likely that memory allocation performance will at least not be worse for the functional version. (Azul's GC could easily deal with 20 GB/s of temporary garbage objects in a benchmark ten years ago. In 2020, I would expect that to be even better.)

Your code is nice straight-line code that will be easy to optimize. It might be easier for a compiler to optimize because of the lack of mutation.

I am not sure it will be faster, but it highly likely will not be slower.

However, as always with questions of the type "which of these two pieces code runs faster", the best answer is: run both of them and see.

OTHER TIPS

It would be misleading to make a general statement about such performance comparison: it simply depends on too many factors. Moreover, modern optimisers are really good at analysing the flow of execution.

As a preliminary remark, functional style does not necessarily mean better performance. Without optimizers, a recursive version of a function may be beautiful to write, but might be less efficient than an iterative version because of the call overhead. Modern global optimisers are able to analyse what the code is really doing and in some cases will generate almost identical code for both variants.

But we can dare to make at least one general statement: the functional style, with immutability, absence of side effects, and function composition/chaining makes it easier to identify opportunities to parallelize the code. But be aware that no magic is to be expect here either: some very sequential algorithms will stay sequential even if expressed in a functional style.

At a fundamental level, code that creates new copies of objects is going to require at least as many instructions as code that modifies existing objects, so it will generally not be faster in absolute terms. The exception is parallel algorithms where lock contention can make some amount of copying desirable. In the best case (as in your example), an optimizing compiler will produce the same machine code for both approaches.

In practice, immutable objects can make it easier for the programmer to reason about side effects (or absence thereof) in the code, and this can indirectly improve performance since you can freely share object instances without worrying about them accidentally getting modified. However, this depends very much on your problem domain: if you are dealing with large amounts of data that constantly need to be updated, you are likely to incur a performance penalty if you create new copies every time.

If we're talking types that aren't trivially-copyable, like persistent data structures, even my best attempts to profile and tune them have them 2-3x slower for writes and around 15-50% slower for reads in C to make them immutable and persistent compared to mutable alternatives depending on the data structure (ex: our persistent version of std::vector in C++ -- random-access sequence -- is about 2x slower for writes and 15% slower for sequential reads, and about twice as slow for random-access reads with a collection of small types like 32-bit int or SPFP float). For trivially-copyable types, like PODs, probably (although not certain) it's usually going to make little difference -- best to measure and see but I wouldn't be concerned too much usually in advance if it's just pushing and popping trivially-copyable types to/from the stack until there's a measured reason to be.

I could see some cases where a pure function might (just might) outperform one that accepts a mutable pointer/ref with C and C++ compilers at least, but mainly due to how optimizers are conservative with respect to pointer/reference aliasing (at least without hints like restrict). That's mainly for trivially-copyable types though. With functions that aren't inlined that take two or more reference/pointer parameters of trivially-copyable types, value/copy semantics can sometimes produce much more efficient machine code since the compiler can assume that the function parameters don't alias each other (at least without optimizer hints to tell them they won't).

That said, in spite of the single-threaded overhead in the non-trivial cases (which isn't a trivial overhead as it manifests in one thread), I was more than able to make up for it due to the increased thread-safety with functions that were now pure by being able to safely throw more threads at the problem than I ever could before. For example, I once dealt with something similar to the analogical game loop where the scene updates had to occur before rendering in a serial fashion. Once I started using a more functional style and made those functions pure (and using persistent data structures for cheap copies for the sections that were modified to produce new versions), inputting state and outputting new state instead of mutating shared state, I was able to build a parallel pipeline where the renderer could be rendering the current frame while the update function was simultaneously updating the next frame to render in parallel without either systems waiting on each other. And that translated to enormous improvements in framerates in spite of the aforementioned single-threaded overhead, and without the doubled memory usage that double-buffered game engines tend to have to accomplish the same thing.

I had a case where we transformed a mutable serial set of operations to a parallel pipeline, accepting the read/write overhead mentioned above, and our stress test scene jumped from ~18 FPS to ~45 FPS on a quad-core machine, and that was in spite of both the rendering system and update systems using parallel loops and multithreading of their own with Intel's TBB internally prior to the changes. It would have been really difficult to achieve that, and especially safely, if we were dealing with functions that mutated the scene graph in-place instead of functionally inputting a former scene and outputting a new one.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange