Should I refactor large functions that mostly consist of one regex? [closed]

https://softwareengineering.stackexchange.com/questions/231427

02-10-2020
|

Pergunta

I just wrote a function that spans approximately 100 lines. Hearing that, you are probably tempted to tell me about single responsibilities and urge me to refactor. This is my gut instinct as well, but here is the issue: The function does one thing. It performs a complex string manipulation, and the function body consists mostly of one verbose regex, broken up into many lines that are documented. If I broke up the regex into multiple functions, I feel like I would actually lose readability, since I am effectively switching languages, and won't be able to take advantage of some features regexes offer. Here now is my question:

When it comes to string manipulation with regular expressions, are large function bodies still an anti-pattern? It seems like named capture groups serve a very similar purpose to functions. By the way, I have tests for every flow through the regex.

Solução

What you're encountering is the cognitive dissonance that comes from listening to people who favor slavish adherence to guidelines under the guise of "best practices" over reasoned decision making.

You've clearly done your homework:

The function's purpose is understood.
The workings of its implementation is understood (i.e., readable).
There are full-coverage tests of the implementation.
Those tests pass, meaning you believe the implementation to be correct.

If any of those points weren't true, I'd be first in line to say that your function needs work. So there's one vote for leaving the code as-is.

The second vote comes from looking at your options and what you get (and lose) from each:

Refactor. This gains you compliance with someone's idea of how long a function should be and sacrifices readability.
Do nothing. This maintains existing readability and sacrifices compliance with with someone's idea of how long a function should be.

This decision comes down to which you value more: readability or length. I fall into the camp that believes length is nice but readability is important and will take the latter over the former any day week.

Bottom line: if it isn't broken, don't fix it.

Outras dicas

Honestly, your function may "do one thing", but as you stated by yourself

I could start breaking up the regex into multiple functions,

which means your reg ex code does a lot of things. And I guess it could be broken down into smaller, individually testable units. However, if this is a good idea is not easy to answer, (especially without seeing the actual code). And the correct answer may be neither "yes" or "no", but "not yet, but next time you have to change something in that reg exp".

but feel like I would actually lose readability that way, since I am effectively switching languages

And this is the core point - you have a piece of code written in reg ex language. This language does not provide any good means of abstraction in itself (and I don't consider "named capture groups" as a replacement for functions). So refactoring "in the reg ex language" is not really possible, and interweaving smaller reg exps with the host language may not actually improve readability (at least, you feel so, but you have doubts, otherwise your would not have posted the question). So here is my advice

show your code to another advanced developer (maybe on https://codereview.stackexchange.com/) to make sure others think about readability the way you do. Be open to the idea that others may not find a 100 line reg exp as readable as you. Sometimes the notion of "its not easily breakable into smaller pieces" can be overcome just by a second pair of eyes.
observe the actual evolvability - does your shiny reg exp still look so good when new requirements arrive and you have to implement and test them? As long as your reg exp works, I would not touch it, but whenever something has to be changed, I would reconsider if it was really a good idea to put everyhing into this one big block - and (seriously!) rethink if splitting into smaller pieces would not be a better option.
observe the maintainability - can you effectively debug the reg exp in the current form very well? Especially after you have to change something, and now your tests tell you that something is wrong, do you have a reg exp debugger helping you to find the root cause? If debugging gets hard, that would also be a occasion to reconsider your design.

Sometimes a longer function that does one thing is the most appropriate way to handle a unit of work. You can easily get into very long functions when you start dealing with querying a database (using your favorite query language). To make a function (or method) more readable while limiting it to its stated purpose is what I would consider the most desirable outcome of a function.

Length is an arbitrary "standard" when it comes to code size. Where a 100 line function in C# may be considered longish, it would be tiny in some versions of assembly. I have seen some SQL queries that were well into the 200 lines of code range that returned one very complicated set of data for a report.

Fully working code, that is as simple as you can reasonably make it is the goal.

Don't change it just because it is long.

You could always break up the regex into sub-regexes, and gradually compose the final expression. This could aid comprehension for a very large pattern, particularly if the same sub-pattern is repeated many times. For example in Perl;

my $start_re = qr/(?:\w+\.\w+)/;
my $middle_re = qr/(?:DOG)|(?:CAT)/;
my $end_re = qr/ => \d+/;

my $final_re = $start_re . $middle_re . $end_re;
# or: 
# my $final_re = qr/${start_re}${middle_re}${end_re}/

I would say break it if it is breakable. from maintainability point of view and maybe resuability it makes sense to break it, but of course you have to consider natural of your function and how you get input and what it is going to return.

I remember I was working on parsing streaming chunked data into objects, so what I did basically was I divided it into two main parts, one was building complete unit of String out of encoded text and in second part parsing those units into data dictionary and organize them (could be random property for different object) and than updating or creating objects.

Also I could break each main part into several smaller and more specific functions so at end I had 5 different functions to do the whole thing and I could reuse some of the functions in different place.

One thing that you may or may not have considered is to write a small parser in the language you are using instead of using a regex in that language. This may be easier to read, test, and maintain.

Giant regexes are a bad choice in most cases. In my experience, they are often used because the developer is not familiar with parsing (see Thomas Eding's answer).

Anyway, let's assume you want to stick to a regex-based solution.

As I don't know the actual code, I will examine the two possible scenarios:

The regex is simple (a lot of literal matching and few alternatives)

In this case the advanced features offered by a single regex are not indispensable. This means you will likely benefit from splitting it.
The regex is complex (a lot of alternatives)

In this case you cannot realistically have full test coverage, because you probably have millions of possibile flows. So, in order to test it, you need to split it.

I might lack imagination, but I can't think of any real-world situation where a 100-line regex is a good solution.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange