Separation of concerns: When is it “too much” separation?

https://softwareengineering.stackexchange.com/questions/378914

08-02-2021
|

سؤال

I really love clean code and I always want to code my code in the best possible way. But there was always one thing, I didn't really understand:

When is it too much of "separation of concerns" regarding methods?

Let's say we have the following method:

def get_last_appearance_of_keyword(file, keyword):
    with open(file, 'r') as file:
        line_number = 0
        for line in file:
            if keyword in line:
                line_number = line
        return line_number

I think this method is just fine as it is. It's simple, easy to read and it clearly does, what it's name says. But: It's not really doing "just one thing". It actually opens the file, and then finds it. That means I could split it even further (Also considering the "Single Responsibility Principle"):

Variation B (Well, this make somehow sense. In this way we can easily reuse the algorithm of finding the last appearance of a keyword in a text, yet it seems like "too much". I cant explain why, but I just "feel" it that way):

def get_last_appearance_of_keyword(file, keyword):
    with open(file, 'r') as text_from_file:
        line_number = find_last_appearance_of_keyword(text_from_file, keyword) 
    return line_number

def find_last_appearance_of_keyword(text, keyword):
    line_number = 0
    for line in text:
        if keyword in line:
            line_number = line
    return line_number

Variation C (This is just absurd in my opinion. We are basically encapsulating a one-liner into another method with just one line twice. But one could argue, that the way of opening something may change in the future, because of some feature requests, and as we don't want to change it many times, but just once, we just encapsulate it and separate our main function even further):

def get_last_appearance_of_keyword(file, keyword):
    text_from_file = get_text_from_file(file)
    line_number = find_keyword_in_text(text_from_file, keyword)
    return line_number 

def get_text_from_file(file):
    with open(file, 'r') as text:
        return text

def find_last_appearance_of_keyword(text, keyword):
    line_number = 0
    for line in text:
        if check_if_keyword_in_string(line, keyword):
            line_number = line         
    return line_number

def check_if_keyword_in_string(text, keyword):
    if keyword in string:
        return true
    return false

So my question now: What is the correct way of writing this code and why are the other approaches right or wrong? I always learned: Separation, but never when it is simply too much. And how can I be sure in the future, that it is "just right" and that it doesn't need more separation when I'm coding again?

المحلول

Your various examples of splitting out concerns into separate functions all suffer from the same issue: you are still hard-coding the file dependency into get_last_appearance_of_keyword. This makes that function hard to test as it now has to reply on a file existing in the file system when the test is run. This leads to brittle tests.

So I'd simply change your original function to:

def get_last_appearance_of_keyword(text, keyword):
    line_number = 0
    for line in text:
        if keyword in line:
            line_number = line
    return line_number

Now you have a function that has just one responsibility: find the last occurrence of a keyword in some text. If that text is to come from a file, that becomes the caller's responsibility to deal with. When testing, you can then just pass in a block of text. When using it with runtime code, first the file is read, then this function is called. That is real separation of concerns.

نصائح أخرى

The single responsibility principle states that a class should take care of a single piece of functionality, and this functionality should be appropriately encapsulated inside.

What exactly does your method do? It gets the last appearance of a keyword. Each line inside the method works towards this and it is not related to anything else, and the end result is just one and one alone. In other words: you don't have to split this method into anything else.

The main idea behind the principle is that you should not do more than one thing in the end. Maybe you open the file and also leave it like that so other methods can use it, you'll be doing two things. Or if you were to persist the data related with this method, again, two things.

Now, you could extract the "open file" line and make the method receive a file object to work with, but that is more a technical refactoring than trying to comply with the SRP.

This is a good example of over engineering. Don't think too much or you'll end up with a bunch of one line methods.

My take on it: It depends :-)

In my opinion code should meet this goals, ordered by priority:

Fulfil all requirements (i.e. it correctly does what it should)
Be readable and easy to follow/understand
Be easy to refactor
Follow good coding practices/principles

For me your original example passes all of this goals (except maybe the correctness because of the line_number = line thing already mentioned in the comments, but that's not the point here).

The thing is, SRP is not the only principle to follow. There is also You Aren’t Gonna Need It (YAGNI) (amongst many others). When the principles collide you need to balance them.

Your first example is perfectly readable, easy to refactor when you need to, but might not follow SRP as much as it could.

Each method in your third example is also perfectly readable but the whole thing is no longer as easy to understand because you have to plug all the pieces together in your mind. It does follow SRP though.

As you are not getting anything right now from splitting your method, don't do it, because you have an alternative that is easier to understand.

As your requirements change, you can refactor the method accordingly. In fact the "all in one thing" might be easier to refactor: Imagine you want to find the last line matching some arbitrary criterion. Now you just need to pass some predicate lambda function to evaluate if a line matches the criterion or not.

def get_last_match(file, predicate):
    with open(file, 'r') as file:
        line_number = 0
        for line in file:
            if predicate matches line:
                line_number = line
        return line_number

In your last example you need to pass the predicate 3 levels deep, i.e. modify 3 methods just to modify the behaviour of the last one.

Note that even splitting the reading of the file out (a refactoring that usually many seem to find useful, including me) might have unexpected consequences: You need to read the whole file to memory to pass it as a string to you method. If the files are large that might not be what you want.

Bottom line: Principles should never by followed to the extreme without taking a step back and taking all other factors into account.

Maybe "premature splitting of methods" could be considered a special case of premature optimization? ;-)

This is like a balancing question in my mind with no easy right and wrong answer. I'll just go with an approach of sharing my personal experiences here including my own tendencies and mistakes throughout my career. YMMV considerably.

As a caveat I work in areas that involve some very large-scale codebases (millions of LOC, sometimes decades-long legacy). I also work in a particularly peculiar area where no amount of commenting or code clarity can necessarily translate to any competent developer being able to understand what the implementation is doing (we cannot necessarily take any decent developer and get him to understand the implementation of a state-of-the-art fluid dynamics implementation based on a paper published 6 months ago without him spending a good amount of time away from the code in specializing in this area). This generally means that only a few developers tops can effectively understand and maintain any particular part of the codebase. Further it means that a lot of code can quickly become obsolete and could be due for deprecation and possibly wholesale replacements given how rapidly the computer graphics industry moves with yesterday's techniques becoming obsolete today.

Given my particular experiences and perhaps combined with the peculiar nature of this industry, I no longer found it productive to take SoC, DRY, making function implementations as readable as possible, even reusability to its utmost limits in favor of YAGNI, decoupling, testability, writing tests, interface documentation (so we at least know how to use an interface even if the implementation requires too much specialized knowledge) and ultimately shipping the software.

Lego Blocks

I was actually prone to go the total opposite direction originally at some point earlier in my career. I got excited so much by functional programming and policy class designs in Modern C++ Design and template metaprogramming and so forth. In particular I was excited by the most compact and orthogonal designs where you have all these little pieces of functionality (like "atoms") you can combine together (to form "molecules") in seemingly infinite ways to get desired results. It got me wanting to write almost everything as functions which consisted of a few lines of code, and there's not necessarily anything inherently wrong with such a short function (it can still be very wide in applicability and clarify code), except I was starting to go in the dogmatic direction of thinking my code had something wrong if any function spanned more than a few lines. And I got some really neat toys and even some production code out of that type of code but I was ignoring the clock: the hours and days and weeks rolling by.

In particular while I admired the simplicity of each little "lego block" I created that I could combine in infinite ways, I ignored the amount of time and brainpower I was putting into to piecing all these blocks together to form some elaborate "contraption". Moreover in the rare but painful instances where something went wrong with the elaborate contraption, I willfully ignored the time I was spending trying to figure out what went wrong tracing through a seemingly endless array of function calls analyzing each and every decentralized lego piece and subsets of their combinations when the whole thing might have been a lot simpler if it wasn't made out of these "legos", if you will, and just written as a handful of meatier functions or a medium-weight class. This might have also been exacerbated by how I was using tools and languages not necessarily intended to be used this way (I wasn't using actual functional languages to implement these ideas, e.g.), or maybe I just wasn't a good enough developer to make this all work out so well in a reasonable time (can accept that possibility).

Still I came around full circle and as deadlines forced me to become more conscious of time, I started to realize that my endeavors were teaching me more about what I was doing wrong than what I was doing right. I started to once again appreciate the meatier function and object/component here and there, that there are more pragmatic ways to achieve a reasonable degree of SoC as David Arno points out by separating file input from string processing without necessarily decomposing the string processing down to the most granular level imaginable.

Meatier Functions

And even more so I started to be okay with even some code duplication, even some logical duplication (I'm not saying copy and paste coding, everything I'm talking about is finding "balance"), provided that the function is not prone to incur repeated changes and documented in terms of its usage and most of all well-tested to make sure its functionality matches up correctly with what it's documented to do and stays that way. I started to realize that reusability is largely tied to reliability.

I've come to realize that even the meatiest function which is still singular enough in concern to not be too narrowly applicable and too clunky to use and test, even if it duplicates some logic in some distant functions elsewhere in the codebase, and provided it's well-tested and reliable and the tests reasonably ensure it remains that way, is still preferable to the most decomposed and flexible combo of functions that lack this quality. So I've come to like some of the meatier stuff these days well enough if it's reliable.

It also seems to me that most of the time, it's cheaper to realize you Are gonna need something in hindsight and add it, provided your code is at least receptive to new additions without cascading hellfire, than to code all sorts of things when you aren't gonna need it and then face the temptation of removing it all when it's starting to become a real PITA to maintain.

So that's what I've learned, those are the lessons I've deemed most necessary for me to personally learn in hindsight in this context, and as a caveat it should be taken with a grain of salt. YMMV. But hopefully that might be of some value to you in helping you to find the right sort of balance to ship products that make your users happy within a reasonable amount of time and maintain them effectively.

The problem that you are encountering, is that you are not factoring your functions to their most reduced form. Take a look at the following: (I'm not a python programmer, so cut me some slack)

def lines_from_file(file):
    with open(file, 'r') as text:
        line_number = 1
        lines = []
        for line in text:
            lines.append((line_number, line.strip()))
            line_number += 1
    return lines

def filter(l, func):
    new_l = []
    for x in l:
        if func(x):
            new_l.append(x)
    return new_l

def contains(needle):
    return lambda haystack: needle in haystack

def last(l):
    length = len(l)
    if length > 0:
        return l[length - 1]
    else:
        return None

Each of the above functions does something entirely different, and I believe you would have a difficult time factoring those functions any further. We can combine those functions to accomplish the task at hand.

lines = lines_from_file('./test_file')
filtered = filter(lines, lambda x : contains('some value')(x[1]))
line = last(filtered)
if line is not None:
    print(line[0])

The above lines of code could easily be put together into a single function to perform exactly what you are looking to do. The way to really separate concerns is to break complex operations down into their most factored form. Once you have a group of well factored functions, you may begin piecing them together to solve the more complex problem. One nice thing about well factored functions, is that they are often reusable outside of the context of the current problem at hand.

I'd say that indeed there's never too much separation of concerns. But there can be functions that you only use once, and do not even test separately. They can be safely inlined, keeping the separation from seeping into the outer namespace.

Your example literally doesn't need check_if_keyword_in_string, because the string class already provides an implementation: keyword in line is enough. But you may plan to swap implementations, to e.g. one using a Boyer-Moore search, or allowing a lazy search in a generator; then it would make sense.

Your find_last_appearance_of_keyword could be more general, and find the last appearance of an item in a sequence. For that, you can use an existing implementation, or make a reusable implementation. Also it can take a different filter, so that you could search for a regex, or for case-insensitive matches, etc.

Usually anything that deals with I/O deserves a separate function, so get_text_from_file may be good idea if you want to handle various special cases outright. It may be not if you rely on an outer IOError handler for that.

Even line-counting may be a separate concern if in the future you may need to support e.g. continuation lines (e.g. with \) and would need the logical line number. Or you might need to ignore comment lines, without breaking line numbering.

Consider:

def get_last_appearance_of_keyword(filename, keyword):
    with open(filename) as f:  # File-opening concern.
        numbered_lines = enumerate(f, start=1)  # Line-numbering concern.
        last_line = None  # Also a concern! Some e.g. prefer -1.
        for line_number, line in numbered_lines:  # The searching concern.
            if keyword in line: # The matching concern, applied.
                last_line = line_number
    # Here the file closes; an I/O concern again.
    return last_line

See how you may want to split your code when you consider some concerns that may change in the future, or just because you notice how the same code can be reused elsewhere.

This is something to be aware of when you write the original short and sweet function. Even if you don't yet need the concerns separated as functions, keep them separate as much as practical. It not only helps to evolve the code later, but helps to better understand the code immediately, and make fewer mistakes.

When is it “too much” separation? Never. You can't have too much separation.

Your last example is pretty good, but you could maybe simplify the for loop with a text.GetLines(i=>i.containsKeyword) or something.

*Practical version : Stop when it works. Separate more when it breaks.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى softwareengineering.stackexchange