Question

I recently interviewed at Amazon. During a coding session, the interviewer asked why I declared a variable in a method. I explained my process and he challenged me to solve the same problem with fewer variables. For example (this wasn't from the interview), I started with Method A then improved it to Method B, by removing int s. He was pleased and said this would reduce memory usage by this method.

I understand the logic behind it, but my question is:

When is it appropriate to use Method A vs. Method B, and vice versa?

You can see that Method A is going to have higher memory usage, since int s is declared, but it only has to perform one calculation, i.e. a + b. On the other hand, Method B has lower memory usage, but has to perform two calculations, i.e. a + b twice. When do I use one technique over the other? Or, is one of the techniques always preferred over the other? What are things to consider when evaluating the two methods?

Method A:

private bool IsSumInRange(int a, int b)
{
    int s = a + b;

    if (s > 1000 || s < -1000) return false;
    else return true;
}

Method B:

private bool IsSumInRange(int a, int b)
{
    if (a + b > 1000 || a + b < -1000) return false;
    else return true;
}
Was it helpful?

Solution

Instead of speculating about what may or may not happen, let's just look, shall we? I'll have to use C++ since I don't have a C# compiler handy (though see the C# example from VisualMelon), but I'm sure the same principles apply regardless.

We'll include the two alternatives you encountered in the interview. We'll also include a version that uses abs as suggested by some of the answers.

#include <cstdlib>

bool IsSumInRangeWithVar(int a, int b)
{
    int s = a + b;

    if (s > 1000 || s < -1000) return false;
    else return true;
}

bool IsSumInRangeWithoutVar(int a, int b)
{
    if (a + b > 1000 || a + b < -1000) return false;
    else return true;
}

bool IsSumInRangeSuperOptimized(int a, int b) {
    return (abs(a + b) < 1000);
}

Now compile it with no optimization whatsoever: g++ -c -o test.o test.cpp

Now we can see precisely what this generates: objdump -d test.o

0000000000000000 <_Z19IsSumInRangeWithVarii>:
   0:   55                      push   %rbp              # begin a call frame
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d ec                mov    %edi,-0x14(%rbp)  # save first argument (a) on stack
   7:   89 75 e8                mov    %esi,-0x18(%rbp)  # save b on stack
   a:   8b 55 ec                mov    -0x14(%rbp),%edx  # load a and b into edx
   d:   8b 45 e8                mov    -0x18(%rbp),%eax  # load b into eax
  10:   01 d0                   add    %edx,%eax         # add a and b
  12:   89 45 fc                mov    %eax,-0x4(%rbp)   # save result as s on stack
  15:   81 7d fc e8 03 00 00    cmpl   $0x3e8,-0x4(%rbp) # compare s to 1000
  1c:   7f 09                   jg     27                # jump to 27 if it's greater
  1e:   81 7d fc 18 fc ff ff    cmpl   $0xfffffc18,-0x4(%rbp) # compare s to -1000
  25:   7d 07                   jge    2e                # jump to 2e if it's greater or equal
  27:   b8 00 00 00 00          mov    $0x0,%eax         # put 0 (false) in eax, which will be the return value
  2c:   eb 05                   jmp    33 <_Z19IsSumInRangeWithVarii+0x33>
  2e:   b8 01 00 00 00          mov    $0x1,%eax         # put 1 (true) in eax
  33:   5d                      pop    %rbp
  34:   c3                      retq

0000000000000035 <_Z22IsSumInRangeWithoutVarii>:
  35:   55                      push   %rbp
  36:   48 89 e5                mov    %rsp,%rbp
  39:   89 7d fc                mov    %edi,-0x4(%rbp)
  3c:   89 75 f8                mov    %esi,-0x8(%rbp)
  3f:   8b 55 fc                mov    -0x4(%rbp),%edx
  42:   8b 45 f8                mov    -0x8(%rbp),%eax  # same as before
  45:   01 d0                   add    %edx,%eax
  # note: unlike other implementation, result is not saved
  47:   3d e8 03 00 00          cmp    $0x3e8,%eax      # compare to 1000
  4c:   7f 0f                   jg     5d <_Z22IsSumInRangeWithoutVarii+0x28>
  4e:   8b 55 fc                mov    -0x4(%rbp),%edx  # since s wasn't saved, load a and b from the stack again
  51:   8b 45 f8                mov    -0x8(%rbp),%eax
  54:   01 d0                   add    %edx,%eax
  56:   3d 18 fc ff ff          cmp    $0xfffffc18,%eax # compare to -1000
  5b:   7d 07                   jge    64 <_Z22IsSumInRangeWithoutVarii+0x2f>
  5d:   b8 00 00 00 00          mov    $0x0,%eax
  62:   eb 05                   jmp    69 <_Z22IsSumInRangeWithoutVarii+0x34>
  64:   b8 01 00 00 00          mov    $0x1,%eax
  69:   5d                      pop    %rbp
  6a:   c3                      retq

000000000000006b <_Z26IsSumInRangeSuperOptimizedii>:
  6b:   55                      push   %rbp
  6c:   48 89 e5                mov    %rsp,%rbp
  6f:   89 7d fc                mov    %edi,-0x4(%rbp)
  72:   89 75 f8                mov    %esi,-0x8(%rbp)
  75:   8b 55 fc                mov    -0x4(%rbp),%edx
  78:   8b 45 f8                mov    -0x8(%rbp),%eax
  7b:   01 d0                   add    %edx,%eax
  7d:   3d 18 fc ff ff          cmp    $0xfffffc18,%eax
  82:   7c 16                   jl     9a <_Z26IsSumInRangeSuperOptimizedii+0x2f>
  84:   8b 55 fc                mov    -0x4(%rbp),%edx
  87:   8b 45 f8                mov    -0x8(%rbp),%eax
  8a:   01 d0                   add    %edx,%eax
  8c:   3d e8 03 00 00          cmp    $0x3e8,%eax
  91:   7f 07                   jg     9a <_Z26IsSumInRangeSuperOptimizedii+0x2f>
  93:   b8 01 00 00 00          mov    $0x1,%eax
  98:   eb 05                   jmp    9f <_Z26IsSumInRangeSuperOptimizedii+0x34>
  9a:   b8 00 00 00 00          mov    $0x0,%eax
  9f:   5d                      pop    %rbp
  a0:   c3                      retq

We can see from the stack addresses (for example, the -0x4 in mov %edi,-0x4(%rbp) versus the -0x14 in mov %edi,-0x14(%rbp)) that IsSumInRangeWithVar() uses 16 extra bytes on the stack.

Because IsSumInRangeWithoutVar() allocates no space on the stack to store the intermediate value s it has to recalculate it, resulting in this implementation being 2 instructions longer.

Funny, IsSumInRangeSuperOptimized() looks a lot like IsSumInRangeWithoutVar(), except it compares to -1000 first, and 1000 second.

Now let's compile with only the most basic optimizations: g++ -O1 -c -o test.o test.cpp. The result:

0000000000000000 <_Z19IsSumInRangeWithVarii>:
   0:   8d 84 37 e8 03 00 00    lea    0x3e8(%rdi,%rsi,1),%eax
   7:   3d d0 07 00 00          cmp    $0x7d0,%eax
   c:   0f 96 c0                setbe  %al
   f:   c3                      retq

0000000000000010 <_Z22IsSumInRangeWithoutVarii>:
  10:   8d 84 37 e8 03 00 00    lea    0x3e8(%rdi,%rsi,1),%eax
  17:   3d d0 07 00 00          cmp    $0x7d0,%eax
  1c:   0f 96 c0                setbe  %al
  1f:   c3                      retq

0000000000000020 <_Z26IsSumInRangeSuperOptimizedii>:
  20:   8d 84 37 e8 03 00 00    lea    0x3e8(%rdi,%rsi,1),%eax
  27:   3d d0 07 00 00          cmp    $0x7d0,%eax
  2c:   0f 96 c0                setbe  %al
  2f:   c3                      retq

Would you look at that: each variant is identical. The compiler is able to do something quite clever: abs(a + b) <= 1000 is equivalent to a + b + 1000 <= 2000 considering setbe does an unsigned comparison, so a negative number becomes a very large positive number. The lea instruction can actually perform all these additions in one instruction, and eliminate all the conditional branches.

To answer your question, almost always the thing to optimize for is not memory or speed, but readability. Reading code is a lot harder than writing it, and reading code that's been mangled to "optimize" it is a lot harder than reading code that's been written to be clear. More often than not, these "optimizations" have negligible, or as in this case exactly zero actual impact on performance.


Follow up question, what changes when this code is in an interpreted language instead of compiled? Then, does the optimization matter or does it have the same result?

Let's measure! I've transcribed the examples to Python:

def IsSumInRangeWithVar(a, b):
    s = a + b
    if s > 1000 or s < -1000:
        return False
    else:
        return True

def IsSumInRangeWithoutVar(a, b):
    if a + b > 1000 or a + b < -1000:
        return False
    else:
        return True

def IsSumInRangeSuperOptimized(a, b):
    return abs(a + b) <= 1000

from dis import dis
print('IsSumInRangeWithVar')
dis(IsSumInRangeWithVar)

print('\nIsSumInRangeWithoutVar')
dis(IsSumInRangeWithoutVar)

print('\nIsSumInRangeSuperOptimized')
dis(IsSumInRangeSuperOptimized)

print('\nBenchmarking')
import timeit
print('IsSumInRangeWithVar: %fs' % (min(timeit.repeat(lambda: IsSumInRangeWithVar(42, 42), repeat=50, number=100000)),))
print('IsSumInRangeWithoutVar: %fs' % (min(timeit.repeat(lambda: IsSumInRangeWithoutVar(42, 42), repeat=50, number=100000)),))
print('IsSumInRangeSuperOptimized: %fs' % (min(timeit.repeat(lambda: IsSumInRangeSuperOptimized(42, 42), repeat=50, number=100000)),))

Run with Python 3.5.2, this produces the output:

IsSumInRangeWithVar
  2           0 LOAD_FAST                0 (a)
              3 LOAD_FAST                1 (b)
              6 BINARY_ADD
              7 STORE_FAST               2 (s)

  3          10 LOAD_FAST                2 (s)
             13 LOAD_CONST               1 (1000)
             16 COMPARE_OP               4 (>)
             19 POP_JUMP_IF_TRUE        34
             22 LOAD_FAST                2 (s)
             25 LOAD_CONST               4 (-1000)
             28 COMPARE_OP               0 (<)
             31 POP_JUMP_IF_FALSE       38

  4     >>   34 LOAD_CONST               2 (False)
             37 RETURN_VALUE

  6     >>   38 LOAD_CONST               3 (True)
             41 RETURN_VALUE
             42 LOAD_CONST               0 (None)
             45 RETURN_VALUE

IsSumInRangeWithoutVar
  9           0 LOAD_FAST                0 (a)
              3 LOAD_FAST                1 (b)
              6 BINARY_ADD
              7 LOAD_CONST               1 (1000)
             10 COMPARE_OP               4 (>)
             13 POP_JUMP_IF_TRUE        32
             16 LOAD_FAST                0 (a)
             19 LOAD_FAST                1 (b)
             22 BINARY_ADD
             23 LOAD_CONST               4 (-1000)
             26 COMPARE_OP               0 (<)
             29 POP_JUMP_IF_FALSE       36

 10     >>   32 LOAD_CONST               2 (False)
             35 RETURN_VALUE

 12     >>   36 LOAD_CONST               3 (True)
             39 RETURN_VALUE
             40 LOAD_CONST               0 (None)
             43 RETURN_VALUE

IsSumInRangeSuperOptimized
 15           0 LOAD_GLOBAL              0 (abs)
              3 LOAD_FAST                0 (a)
              6 LOAD_FAST                1 (b)
              9 BINARY_ADD
             10 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             13 LOAD_CONST               1 (1000)
             16 COMPARE_OP               1 (<=)
             19 RETURN_VALUE

Benchmarking
IsSumInRangeWithVar: 0.019361s
IsSumInRangeWithoutVar: 0.020917s
IsSumInRangeSuperOptimized: 0.020171s

Disassembly in Python isn't terribly interesting, since the bytecode "compiler" doesn't do much in the way of optimization.

The performance of the three functions is nearly identical. We might be tempted to go with IsSumInRangeWithVar() due to it's marginal speed gain. Though I'll add as I was trying different parameters to timeit, sometimes IsSumInRangeSuperOptimized() came out fastest, so I suspect it may be external factors responsible for the difference, rather than any intrinsic advantage of any implementation.

If this is really performance critical code, an interpreted language is simply a very poor choice. Running the same program with pypy, I get:

IsSumInRangeWithVar: 0.000180s
IsSumInRangeWithoutVar: 0.001175s
IsSumInRangeSuperOptimized: 0.001306s

Just using pypy, which uses JIT compilation to eliminate a lot of the interpreter overhead, has yielded a performance improvement of 1 or 2 orders of magnitude. I was quite shocked to see IsSumInRangeWithVar() is an order of magnitude faster than the others. So I changed the order of the benchmarks and ran again:

IsSumInRangeSuperOptimized: 0.000191s
IsSumInRangeWithoutVar: 0.001174s
IsSumInRangeWithVar: 0.001265s

So it seems it's not actually anything about the implementation that makes it fast, but rather the order in which I do the benchmarking!

I'd love to dig in to this more deeply, because honestly I don't know why this happens. But I believe the point has been made: micro-optimizations like whether to declare an intermediate value as a variable or not are rarely relevant. With an interpreted language or highly optimized compiler, the first objective is still to write clear code.

If further optimization might be required, benchmark. Remember that the best optimizations come not from the little details but the bigger algorithmic picture: pypy is going to be an order of magnitude faster for repeated evaluation of the same function than cpython because it uses faster algorithms (JIT compiler vs interpretation) to evaluate the program. And there's the coded algorithm to consider as well: a search through a B-tree will be faster than a linked list.

After ensuring you're using the right tools and algorithms for the job, be prepared to dive deep into the details of the system. The results can be very surprising, even for experienced developers, and this is why you must have a benchmark to quantify the changes.

OTHER TIPS

To answer the stated question:

When to optimize for memory vs performance speed for a method?

There are two things you have to establish:

  • What is limiting your application?
  • Where can I reclaim the most of that resource?

In order to answer the first question, you have to know what the performance requirements for your application are. If there are no performance requirements then there is no reason to optimize one way or the other. The performance requirements help you to get to the place of "good enough".

The method you provided on its own wouldn't cause any performance issues on its own, but perhaps within a loop and processing a large amount of data, you have to start thinking a little differently about how you are approaching the problem.

Detecting what is limiting the application

Start looking at the behavior of your application with a performance monitor. Keep an eye on CPU, disk, network, and memory usage while it's running. One or more items will be maxed out while everything else is moderately used--unless you hit the perfect balance, but that almost never happens).

When you need to look deeper, typically you would use a profiler. There are memory profilers and process profilers, and they measure different things. The act of profiling does have a significant performance impact, but you are instrumenting your code to find out what's wrong.

Let's say you see your CPU and disk usage peaked. You would first check for "hot spots" or code that either is called more often than the rest or takes a significantly longer percentage of the processing.

If you can't find any hot spots, you would then start looking at memory. Perhaps you are creating more objects than necessary and your garbage collection is working overtime.

Reclaiming performance

Think critically. The following list of changes is in order of how much return on investment you'll get:

  • Architecture: look for communication choke points
  • Algorithm: the way you process data might need to change
  • Hot spots: minimizing how often you call the hot spot can yield a big bonus
  • Micro optimizations: it's not common, but sometimes you really do need to think of minor tweaks (like the example you provided), particularly if it is a hot spot in your code.

In situations like this, you have to apply the scientific method. Come up with a hypothesis, make the changes, and test it. If you meet your performance goals, you're done. If not, go to the next thing in the list.


Answering the question in bold:

When is it appropriate to use Method A vs. Method B, and vice versa?

Honestly, this is the last step in trying to deal with performance or memory problems. The impact of Method A vs. Method B will be really different depending on the language and platform (in some cases).

Just about any compiled language with a halfway decent optimizer will generate similar code with either of those structures. However those assumptions don't necessarily remain true in proprietary and toy languages that don't have an optimizer.

Precisely which will have a better impact depends on whether sum is a stack variable or a heap variable. This is a language implementation choice. In C, C++ and Java for example, number primitives like an int are stack variables by default. Your code has no more memory impact by assigning to a stack variable than you would have with fully inlined code.

Other optimizations that you might find in C libraries (particularly older ones) where you can have to decide between copying a 2 dimensional array down first or across first is a platform dependent optimization. It requires some knowledge of how the chipset you are targeting best optimizes memory access. There are subtle differences between architectures.

Bottom line is that optimization is a combination of art and science. It requires some critical thinking, as well as a degree of flexibility in how you approach the problem. Look for big things before you blame small things.

"this would reduce memory" - em, no. Even if this would be true (which, for any decent compiler is not), the difference would most probably be negligible for any real world situation.

However, I would recommend to use method A* (method A with a slight change):

private bool IsSumInRange(int a, int b)
{
    int sum = a + b;

    if (sum > 1000 || sum < -1000) return false;
    else return true;
    // (yes, the former statement could be cleaned up to
    // return abs(sum)<=1000;
    // but let's ignore this for a moment)
}

but for two completely different reasons:

  • by giving the variable s an explaining name, the code becomes clearer

  • it avoids to have the same summation logic twice in code, so the code becomes more DRY, which means less error prone to changes.

You can do better than both of those with

return (abs(a + b) > 1000);

Most processors (and hence compilers) can do abs() in a single operation. You not only have fewer sums, but also fewer comparisons, which are generally more computationally expensive. It also removes the branching, which is much worse on most processors because it stops pipelining being possible.

The interviewer, as other answers have said, is plant life and has no business conducting a technical interview.

That said, his question is valid. And the answer to when you optimise and how, is when you've proved it's necessary, and you've profiled it to prove exactly which parts need it. Knuth famously said that premature optimisation is the root of all evil, because it's too easy to try to gold-plate unimportant sections, or make changes (like your interviewer's) which have no effect, whilst missing the places which really do need it. Until you've got hard proof it's really necessary, clarity of code is the more important target.

Edit FabioTurati correctly points out that this is the opposite logic sense to the original, (my mistake!), and that this illustrates a further impact from Knuth's quote where we risk breaking the code while we're trying to optimise it.

When is it appropriate to use Method A vs. Method B, and vice versa?

Hardware is cheap; programmers are expensive. So the cost of the time you two wasted on this question is probably far worse than either answer.

Regardless, most modern compilers would find a way to optimize the local variable into a register (instead of allocating stack space), so the methods are probably identical in terms of executable code. For this reason, most developers would pick the option that communicates the intention most clearly (see Writing really obvious code (ROC)). In my opinion, that would be Method A.

On the other hand, if this is purely an academic exercise, you can have the best of both worlds with Method C:

private bool IsSumInRange(int a, int b)
{
    a += b;
    return (a >= -1000 && a <= 1000);
}

I would optimize for readability. Method X:

private bool IsSumInRange(int number1, int number2)
{
    return IsValueInRange(number1+number2, -1000, 1000);
}

private bool IsValueInRange(int Value, int Lowerbound, int Upperbound)
{
    return  (Value >= Lowerbound && Value <= Upperbound);
}

Small methods that do just 1 thing but are easy to reason about.

(This is personal preference, I like positive testing instead of negative, your original code is actually testing whether the value is NOT outside the range.)

In short, I don't think the question has much relevance in current computing, but from a historical perspective it's an interesting thought exercise.

Your interviewer is likely a fan of the Mythical Man Month. In the book, Fred Brooks makes the case that programmers will generally need two versions of key functions in their toolbox: a memory-optimized version and a cpu-optimized version. Fred based this on his experience leading the development of the IBM System/360 operating system where machines may have as little as 8 kilobytes of RAM. In such machines, memory required for local variables in functions could potentially be important, especially if the compiler did not effectively optimize them away (or if code was written in assembly language directly).

In the current era, I think you would be hard pressed to find a system where the presence or absence of a local variable in a method would make noticeable difference. For a variable to matter, the method would need to be recursive with deep recursion expected. Even then, it's likely that the stack depth would be exceeded causing Stack Overflow exceptions before the variable itself caused an issue. The only real scenario where it may be an issue is with very large, arrays allocated on the stack in a recursive method. But that is also unlikely as I think most developers would think twice about unnecessary copies of large arrays.

After the assignment s = a + b; the variables a and b are not used anymore. Therefore, no memory is used for s if you are not using a completely brain-damaged compiler; memory that was used anyway for a and b is re-used.

But optimising this function is utter nonsense. If you could save space, it would be maybe 8 bytes while the function is running (which is recovered when the function returns), so absolutely pointless. If you could save time, it would be single numbers of nanoseconds. Optimising this is a total waste of time.

Local value type variables are allocated on the stack or (more likely for such small pieces of code) use registers in the processor and never get to see any RAM. Either way they are short lived and nothing to worry about. You start considering memory use when you need to buffer or queue data elements in collections that are both potentially large and long lived.

Then it depends what you care about most for your application. Processing speed? Response time? Memory footprint? Maintainability? Consistency in design? All up to you.

As other answers have said, you need to think what you're optimising for.

In this example, I suspect that any decent compiler would generate equivalent code for both methods, so the decision would have no effect on the run time or memory!

What it does affect is the readability of the code.  (Code is for humans to read, not just computers.)  There's not too much difference between the two examples; when all other things are equal, I consider brevity to be a virtue, so I'd probably pick Method B.  But all other things are rarely equal, and in a more complex real-world case, it could have a big effect.

Things to consider:

  • Does the intermediate expression have any side-effects?  If it calls any impure functions or updates any variables, then of course duplicating it would be a matter of correctness, not just style.
  • How complex is the intermediate expression?  If it does lots of calculations and/or calls functions, then the compiler may not be able to optimise it, and so this would affect performance.  (Though, as Knuth said, “We should forget about small efficiencies, say about 97% of the time”.)
  • Does the intermediate variable have any meaning?  Could it be given a name that helps to explain what's going on?  A short but informative name could explain the code better, while a meaningless one is just visual noise.
  • How long is the intermediate expression?  If long, then duplicating it could make the code longer and harder to read (especially if it forces a line break); if not, the duplication could be shorter over all.

As many of the answers have pointed out, attempting to tune this function with modern compilers won't make any difference. An optimizer can most likely figure out the best solution (up-vote to the answer that showed the assembler code to prove it!). You stated that the code in the interview was not exactly the code you were asked to compare, so perhaps the actual example makes a bit more sense.

But let's take another look at this question: this is an interview question. So the real issue is, how should you answer it assuming that you want to try and get the job?

Let's also assume that the interviewer does know what they are talking about and they are just trying to see what you know.

I would mention that, ignoring the optimizer, the first may create a temporary variable on the stack whereas the second wouldn't, but would perform the calculation twice. Therefore, the first uses more memory but is faster.

You could mention that anyway, a calculation may require a temporary variable to store the result (so that it an be compared), so whether you name that variable or not might not make any difference.

I would then mention that in reality the code would be optimized and most likely equivalent machine code would be generated since all the variables are local. However, it does depend on what compiler you are using (it was not that long ago that I could get a useful performance improvement by declaring a local variable as "final" in Java).

You could mention that the stack in any case lives in its own memory page, so unless your extra variable caused the stack to overflow the page, it won't in reality allocate any more memory. If it does overflow it will want a whole new page though.

I would mention that a more realistic example might be the choice of whether to use a cache to hold the results of many computations or not and this would raise a question of cpu vs memory.

All this demonstrates that you know what you are talking about.

I would leave it to the end to say that it would be better to focus on readabilty instead. Although true in this case, in the interview context it may be interpretted as "I don't know about performance but my code reads like a Janet and John story".

What you should not do is trot out the usual bland statements about how code optimization is not necessary, don't optimize until you have profiled the code (this just indicates you can't see bad code for yourself), hardware costs less than programmers, and please, please, don't quote Knuth "premature blah blah ...".

Code performance is a genuine issue in a great many organisations and many organisations need programmers who understand it.

In particular, with organisations such as Amazon, some of the code has huge leverage. A code snippet may be deployed on thousand of servers or millions of devices and may be called billions of times a day every day of the year. There may be thousands of similar snippets. The difference between a bad algorithm and a good one can easily be a factor of a thousand. Do the numbers and multiple all this up: it makes a difference. The potential cost to the organisation of non-performing code can be very significant or even fatal if a system runs out of capacity.

Furthmore, many of these organisations work in a competetive environment. So you cannot just tell your customers to buy a bigger computer if your competitor's software already works ok on the hardware that they have or if the software runs on a mobile handset and it can't be upgraded. Some applications are particularly performance critical (games and mobile apps come to mind) and may live or die according to their responsiveness or speed.

I have personally over two decades worked on many projects where systems have failed or been unusable due to performance issues and I have been called in the optimize those systems and in all cases it has been due to bad code written by programmers who didn't understand the impact of what they were writing. Furthmore, it is never one piece of code, it is always everywhere. When I turn up, it is way to late to start thinking about performance: the damage has been done.

Understanding code performance is a good skill to have in the same way as understanding code correctness and code style. It comes out of practice. Performance failures can be as bad as functional failures. If the system doesn't work, it doesn't work. Doesn't matter why. Similarly, performance and features that are never used are both bad.

So, if the interviewer asks you about performance I would recommend to try and demonstrate as much knowledge as possible. If the question seems a bad one, politely point out why you think it would not be an issue in that case. Don't quote Knuth.

You should first optimize for correctness.

Your function fails for input values that are close to Int.MaxValue:

int a = int.MaxValue - 200;
int b = int.MaxValue - 200;
bool inRange = test.IsSumInRangeA(a, b);

This returns true because the sum overflows to -400. The function also doesn't work for a = int.MinValue + 200. (incorrectly adds up to "400")

We won't know what the interviewer was looking for unless he or she chimes in, but "overflow is real".

In an interview situation, ask questions to clarify the scope of the problem: What is are the allowed maximum and minimum input values? Once you have those, you can throw an exception if the caller submits values outside of the range. Or (in C#), you can use a checked {} section, which would throw an exception on overflow. Yes, it's more work and complicated, but sometimes that's what it takes.

Your question should have been: "Do I need to optimize this at all?".

Version A and B differ in one important detail that makes A preferrable, but it is unrelated to optimization: You do not repeat code.

The actual "optimization" is called common subexpression elimination, which is what pretty much every compiler does. Some do this basic optimization even when optimizations are turned off. So that isn't truly an optimization (the generated code will almost certainly be exactly the same in every case).

But if it isn't an optimization, then why is it preferrable? Alright, you don't repeat code, who cares!

Well first of all, you do not have the risk of accidentially getting half of the conditional clause wrong. But more importantly, someone reading this code can grok immediately what you're trying to do, instead of a if((((wtf||is||this||longexpression)))) experience. What the reader gets to see is if(one || theother), which is a good thing. Not rarely, I happens that you are that other person reading your own code three years later and thinking "WTF does this mean?". In that case it's always helpful if your code immediately communicates what the intent was. With a common subexpression being named properly, that's the case.
Also, if at any time in the future, you decide that e.g. you need to change a+b to a-b, you have to change one location, not two. And there's no risk of (again) getting the second one wrong by accident.

About your actual question, what you should optimize for, first of all your code should be correct. This is the absolutely most important thing. Code that isn't correct is bad code, even moreso if despite being incorrect it "works fine", or at least it looks like it works fine. After that, code should be readable (readable by someone unfamiliar with it).
As for optimizing... one certainly shouldn't deliberately write anti-optimized code, and certainly I'm not saying you shouldn't spend a thought on the design before you start out (such as choosing the right algorithm for the problem, not the least efficient one).

But for most applications, most of the time, the performance that you get after running correct, readable code using a reasonable algorithm through an optimizing compiler is just fine, there's no real need to worry.

If that isn't the case, i.e. if the application's performance indeed doesn't meet the requirements, and only then, you should worry about doing such local optimizations as the one you attempted. Preferrably, though, you would reconsider the top-level algorithm. If you call a function 500 times instead of 50,000 times because of a better algorithm, this has larger impact than saving three clock cycles on a micro-optimization. If you don't stall for several hundred cycles on a random memory access all the time, this has a larger impact than doing a few cheap calculations extra, etc etc.

Optimization is a difficult matter (you can write entire books about that and get to no end), and spending time on blindly optimizting some particular spot (without even knowing whether that's the bottleneck at all!) is usually wasted time. Without profiling, optimization is very hard to get right.

But as a rule of thumb, when you're flying blind and just need/want to do something, or as a general default strategy, I would suggest to optimize for "memory".
Optimizing for "memory" (in particular spatial locality and access patterns) usually yields a benefit because unlike once upon a time when everything was "kinda the same", nowadays accessing RAM is among the most expensive things (short of reading from disk!) that you can in principle do. Whereas ALU, on the other hand, is cheap and getting faster every week. Memory bandwidth and latency doesn't improve nearly as fast. Good locality and good access patterns can easily make a 5x difference (20x in extreme, contrieved examples) in runtime compared to bad access patterns in data-heavy applications. Be nice to your caches, and you will be a happy person.

To put the previous paragraph into perspective, consider what the different things that you can do cost you. Executing something like a+b takes (if not optimized out) one or two cycles, but the CPU can usually start several instructions per cycle, and can pipeline non-dependent instructions so more realistically it only costs you something around half a cycle or less. Ideally, if the compiler is good at scheduling, and depending on the situation, it might cost zero.
Fetching data ("memory") costs you either 4-5 cycles if you're lucky and it's in L1, and around 15 cycles if you are not so lucky (L2 hit). If the data isn't in the cache at all, it takes several hundred cycles. If your haphazard access pattern exceeds the TLB's capabilities (easy to do with only ~50 entries), add another few hundred cycles. If your haphazard access pattern actually causes a page fault, it costs you a few ten thousand cycles in the best case, and several million in the worst.
Now think about it, what's the thing you want to avoid most urgently?

When to optimize for memory vs performance speed for a method?

After getting the functionality right first. Then selectivity concern oneself with micro optimizations.


As an interview question regarding optimizations, the code does provoke the usual discussion yet misses the higher level goal of Is the code functionally correct?

Both C++ and C and others regard int overflow as a problem from the a + b. It is not well defined and C calls it undefined behavior. It is not specified to "wrap" - even though that is the common behavior.

bool IsSumInRange(int a, int b) {
    int s = a + b;  // Overflow possible
    if (s > 1000 || s < -1000) return false;
    else return true;
}

Such a function called IsSumInRange() would be expected to be well defined and perform correctly for all int values of a,b. The raw a + b is not. A C solution could use:

#define N 1000
bool IsSumInRange_FullRange(int a, int b) {
  if (a >= 0) {
    if (b > INT_MAX - a) return false;
  } else {
    if (b < INT_MIN - a) return false;
  }
  int sum = a + b;
  if (sum > N || sum < -N) return false;
  else return true;
}

The above code could be optimized by using a wider integer type than int, if available, as below or distributing the sum > N, sum < -N tests within the if (a >= 0) logic. Yet such optimizations may not truly lead to "faster" emitted code given a smart compiler nor be worth the extra maintenance of being clever.

  long long sum a;
  sum += b;

Even using abs(sum) is prone to problems when sum == INT_MIN.

What kind of compilers are we talking about, and what sort of "memory"? Because in your example, assuming a reasonable optimizer, the expression a+b needs to generally be stored in a register (a form of memory) prior to doing such arithmetic.

So if we're talking about a dumb compiler that encounters a+b twice, it's going to allocate more registers (memory) in your second example, because your first example might just store that expression once in a single register mapped to the local variable, but we're talking about very silly compilers at this point... unless you're working with another type of silly compiler that stack spills every single variable all over the place, in which case maybe the first one would cause it more grief to optimize than the second*.

I still want to scratch that and think the second one is likely to use more memory with a dumb compiler even if it's prone to stack spills, because it might end up allocating three registers for a+b and spill a and b more. If we're talking most primitive optimizer then capturing a+b to s will probably "help" it use less registers/stack spills.

This is all extremely speculative in rather silly ways absent measurements/disassembly and even in the worst-case scenarios, this is not a "memory vs. performance" case (because even among the worst optimizers I can think of, we're not talking about anything but temporary memory like stack/register), it's purely a "performance" case at best, and among any reasonable optimizer the two are equivalent, and if one is not using a reasonable optimizer, why obsesses about optimization so microscopic in nature and especially absent measurements? That's like instruction selection/register allocation assembly-level focus which I would never expect anyone looking to stay productive to have when using, say, an interpreter that stack spills everything.

When to optimize for memory vs performance speed for a method?

As for this question if I can tackle it more broadly, often I don't find the two diametrically opposed. Especially if your access patterns are sequential, and given the speed of the CPU cache, often a reduction in the amount of bytes processed sequentially for non-trivial inputs translates (up to a point) to plowing through that data faster. Of course there are breaking points where if the data is much, much smaller in exchange for way, way more instructions, it might be faster to process sequentially in larger form in exchange for fewer instructions.

But I've found many devs tend to underestimate how much a reduction in memory use in these types of cases can translate to proportional reductions in time spent processing. It's very humanly intuitive to translate performance costs to instructions rather than memory access to the point of reaching for big LUTs in some vain attempt to speed up some small computations, only to find performance degraded with the additional memory access.

For sequential access cases through some huge array (not talking local scalar variables like in your example), I go by the rule that less memory to sequentially plow through translates to greater performance, especially when the resulting code is simpler than otherwise, until it doesn't, until my measurements and profiler tell me otherwise, and it matters, in the same way I assume sequentially reading a smaller binary file on disk would be faster to plow through than a bigger one (even if the smaller one requires some more instructions), until that assumption is shown to no longer apply in my measurements.

Licensed under: CC-BY-SA with attribution
scroll top