Why are some functions extremely long? (ideas needed for an academic research!) [closed]

https://stackoverflow.com/questions/1127588

13-09-2019
|

Question

I am writing a small academic research project about extremely long functions. Obviously, I am not looking for examples for bad programming, but for examples of 100, 200 and 600 lines long functions which makes sense.

I will be investigating the Linux kernel source using a script written for a Master's degree written in the Hebrew University, which measures different parameters like number of lines of code, function complexity (measured by MCC) and other goodies. By the way, It's a neat study about code analysis, and a recommended reading material.

I am interested if you can think of any good reason why any function should be exceptionally long? I'll be looking into C, but examples and arguments from any language would be of great use.

Solution

I may catch flak for this, but readability. A highly serial, but independent execution that could be broken up into N function calls (of functions that are used nowhere else) doesn't really benefit from decomposition. Unless you count meeting an arbitrary maximum on function length as a benefit.

I'd rather scroll through N function sized blocks of code in order than navigate the whole file, hitting N functions.

OTHER TIPS

Lots of values in a switch statement?

Anything generated from other sources, i.e. a finite state machine from a parser generator or similar. If it's not intended for human consumption, aesthetic or maintainability concerns are irrelevant.

Functions can get longer over time, especially if they are modified by many sets of developers.

Case in point: I recently (~1yr or 2 ago) refactored some legacy image processing code from 2001 or so that contained a few several-thousand-line-functions. Not a few several-thousand-line-files - a few several-thousand-line-functions.

Over the years so much functionality was added to them without really putting in the effort to refactor them properly.

Read the chapter in McConnell's Code Complete about subroutines, it has guidelines and pointers of when you should break things into functions. If you have some algorithm where those rules don't apply, that may be a good reason for having a long function.

Generated code can generate very very long functions.

The only ones I've recently coded are where it doesn't achieve much to make them smaller or can make the code less readable. The notion that a function that is over a certain length is somehow intrinsically bad is simply blind dogma. Like any blindly applied dogma ot relieves the follower of the need to actually think about what applies in any given case...

Recent examples...

Parsing, and validating a config file with simple name=value structure into an array, converting each value as I find it, this is one massive switch statement, one case per config option. Why? I could have split into lots of calls to 5/6 line trivial functions. That would add about 20 private members to my class. None of them are reused anywhere else. Factoring it into smaller chunks just didn't add enough value to be worth it, so it's been the same ever since the prototype. If I want another option, add another case.

Another case is the client and server communication code in the same app, and its client. Lots of calls to read/write any of which can fail, in which case I bail and return false. So this function is basically linear, and has bail points (if failed, return) after almost every call. Again, nothing to gain by making it smaller and no way to really make it any smaller.

I should also add that most of my functions are a couple of "screenfuls" and I strive in more complex areas to keep it to one "screenful", simply because I can look at the whole function at once. It's ok for functions that are basically linear in nature and don't have lots of complex looping or conditions going on so the flow is simple. As a final note I prefer to apply cost-benefit reasoning when deciding which code to refactor, and prioritise accordingly. Helps avoid the perpetually half-finished project.

Sometimes I find myself writing a flat file (for use by third parties) which entails headers, trailers, and detail records that are all linked. It's easier to have a long function for the purpose of computing summaries than it is to devise some scheme to pass values back and forth through lots of small functions.

One point that I think has a bearing is that different languages and tools have different lexical scoping associated with functions.

For example, Java allows you to suppress warnings with an annotation. It may be desirable to limit the scope of the annotation and so you keep the function short for that purpose. In another language, breaking that section out into it's own function might be completely arbitrary.

Controversial: In JavaScript, I tend to only create functions for the purpose of reusing code. If a snippet is only executed in one place, I find it burdensome to jump around the file(s) following the spaghetti of function references. I think closures facilitate and therefore reinforce longer [parent] functions. Since JS is an interpreted language and the actual code gets sent over the wire, it's good to keep the length of the code small--creating matching declarations and references doesn't help (this could be considered a premature optimization). A function has to get pretty long in JS before I decide to chop it up for the express purpose of "keeping functions short".

Again in JS, sometimes the entire 'class' is technically a function with many enclosed sub-functions but there are tools to help deal with it.

On the other hand in JS, variables have scope for the length of the function and so that's a factor that may limit the length of a given function.

The very long functions I come across are not written in C, so you'll have to decide whether this applies to your research or not. What I have in mind are some PowerBuilder functions that are several hundred of lines long, being so for the following reasons:

They've been written over 10 years ago, by people who at that time did not have coding standards in mind.
The development environment makes it a bit harder to create functions. Hardly a good excuse, but it's one of those little things that sometimes discourages you from working properly, and I guess someone just got lazy.
The functions have evolved over time, adding both code and complexity.
The functions contain huge loops, each iteration possibly handling different kind of data in a different way. Using tens(!) of local variables, some member variables and some globals, they have become extremely complex.
Being that old and ugly, no one dares refactoring them into smaller parts. Having so many special cases handled in them, breaking them apart is asking for trouble.

This is yet another place where obvious bad programming practices meet reality. While any first year CS student could say those beasts are bad, no one would spend any money on making them look prettier (given that at least for now, they still deliver).

By far the most common I see/write are long switch statements or if/else semi-switch statements for types that can't be used in this language's switch statements (already mentioned a few times). Generated code is an interesting case, but I'm focusing on human-written code here. Looking at my current project, the only truly long function not included above (296 LOC/650 LOT) is some Cowboy Code I'm using as an early evaluation the output of a code generator I plan to use in the future. I'll definitely be refactoring it, which removes it from this list.

Many years ago, I was working on some scientific computing software that had a long function in it. The method used a large number of local variables and refactoring the method kept resulting in a measurable difference per profiling. Even a 1% improvement in this section of code saved hours of computation time, so the function stayed long. I've learned a great deal since then, so I can't speak to how I'd handle the situation today.

Speed:

Calling a function means pushing to the stack, then jumping, then storing on the stack again, then jumping again. if you use parameters to the function, you usually have several more pushes.

Consider a loop:

for...
   func1

inside a loop, all those pushes, and jumps can be a factor.

This was largely solved with the presentation of Inline Functions on C99 and unofficially before that, But some code written before, or was created with compatibility in mind, may have been long for that reason.

Also Inline has it's flows, some are described on the Inline Functions link.

Edit:

As an example of how a call to a function can make a program slower:

4         static void
5 do_printf()
6 {
7         printf("hi");
8 }
9         int
10 main()
11 {
12         int i=0;
13         for(i=0;i<1000;++i)
14                 do_printf();
15 }

This produces (GCC 4.2.4):

 .
 . 
 jmp    .L4
 .L5:
call    do_printf
addl    $1, -8(%ebp)
 .L4:
cmpl    $999, -8(%ebp)
jle .L5

 .
 .
do_printf:
pushl   %ebp
movl    %esp, %ebp
subl    $8, %esp
movl    $.LC0, (%esp)
call    printf
leave
ret

against:

         int
 main()
 {
         int i=0;
         for(i=0;i<1000;++i)
                 printf("hi");
 }

or against:

 4         static inline void __attribute__((always_inline)) //This is GCC specific!
 5 do_printf()
 6 {
 7         printf("hi");
 8 }

Both produce (GCC 4.2.4):

jmp .L2
.L3:
movl    $.LC0, (%esp)
call    printf
addl    $1, -8(%ebp)
.L2:
cmpl    $999, -8(%ebp)
jle .L3

Which is faster.

XML parsing code often has reams of escape character processing in one setup function.

The functions I deal with (not write) become long because are expanded and expanded and no one spends the time to re-factor the functions. They just keep adding logic to the functions with no thought to the big picture.

I deal with a lot of cut-n-paste development...

So, for the paper, one aspect to look at is poor maintenance plan/cycle, etc.

A few ideas not explicitely mentioned yet:

repetitive tasks, e.g. the function reads a database table with 190 columns and has to output them as a flat file (assuming that columns need to be treated individually, so a simple loop over all columns won't do). Of course you could create 19 functions, each outputting 10 columns, but that wouldn't make the program any better.
complicated, verbose APIs, like Oracle's OCI. When seemingly simple actions require large amounts of code, it's hard to break it down into small functions that make any sense.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow