How do huge open source libraries get maintained while having code far from “clean code” practices?

https://softwareengineering.stackexchange.com/questions/373959

06-02-2021
|

문제

I'm still inexperienced to write high quality code, so I read books addressing the issue such as Clean Code by Robert C. Martin, and keep checking code of well-known libraries to improve my skills.

Although many open source libraries have been maintained for years, which means that it's very unlikely that they aren't on the right path, I found the code in many of them to be far from the principles addressed to write clean code – e.g methods containing hundreds of lines of code.

So my question is: Are the principles of clean code too restricted, and we can do without them in many libraries like these? If not, how are huge libraries being maintained without considering many of these principles?

I'll appreciate any brief clarification. I apologize if the question seems to be silly from a newbie guy.

EDIT

Check this example in Butterknife library – one of the most well know libraries in Android community.

해결책

Good answer here already, but let me say a word about your butterknife example: though I have no idea what the code does, at a first glance, it does not look really unmaintainable to me. Variables and method names seem to be chosen deliberately, the code is properly indented and formatted, it has some comments and the long methods at least show some block structure.

Yes, it does in no way follow Uncle Bob's "clean code" rules, and some of the methods are sure too long (probably the whole class). But looking at the code I still see enough structure so that they could be easily "cleaned up" by extracting those blocks into methods on their own (with a low risk of introducing bugs when using refactoring tools).

The real problem with such code is, adding one block and another block and another block works to some degree, sometimes over years. But every day the code gets harder to evolve a little bit, and it takes a little bit longer to modify and test it. And when you really have to change something which cannot be solved by "adding another block", but requires restructuring, then you will wish someone had started to clean up the code more early.

다른 팁

The principles stated in "Clean Code" are not always generally agreed upon. Most of it is common sense, but some of the author's opinions are rather controversial and not shared by everybody.

In particular, the preference for short methods is not agreed on by everybody. If the code in a longer method is not repeated elsewhere, extracting some of it to a separate method (so you get multiple shorter methods) increases overall complexity, since these methods are now visible for other methods which should not care about them. So it is a trade-off, not an objective improvement.

The advice in the book is also (like all advice) geared towards a particular type of software: Enterprise applications. Other kinds of software like games or operating systems have different constraints than enterprise software, so different patterns and design principles are in play.

The language is also a factor: Clean Code assumes Java or a similar language - if you use C or Lisp a lot of the advice does not apply.

In short, the book is a single persons opinions about a particular class of software. It will not apply everywhere.

As for open source projects, code quality ranges from abysmal to brilliant. After all, anyone can publish their code as open source. But if you look at a mature and successful open source project with multiple contributors, you can be fairly sure they have consciously settled on a style that works for them. If this style is in contradiction to some opinion or guideline, then (to put it bluntly) it is the guideline that is wrong or irrelevant, since working code trumps opinions.

Summary

As JacquesB writes, not everybody agrees with Robert C. Martin's "Clean Code".

The open source projects that you found to be "violating" the principles you expected are likely to simply have other principles.

My perspective

I happen to oversee several code bases that adhere very much to Robert C. Martin's principles. However, I do not really claim that they are right, I can only say they work well for us - and that "us" is in fact a combination of at least

the scope and architecture of our products,
the target market / customer expectations,
how long the products are maintained,
the development methodology we use,
the organizational structure of our company and
our developers' habits, opinions, and past experience.

Basically, this boils down to: each team (be it a company, a department or an open source project) is unique. They will have different priorities and different viewpoints, and of course they will make different tradeoffs. These tradeoffs, and the code style they result in, are largely a matter of taste and cannot be proven "wrong" or "right". The teams can only say "we do this because it works for us" or "we should change this because it doesn't work for us".

That said, I believe that to be able to successfully maintain large codebases over years, each team should agree on a set of code conventions they think are suitable for the aspects given above. That may mean adopting practices by Robert C. Martin, by another author, or inventing their own; it may mean writing them down formally or documenting them "by example". But they should exist.

Example

Consider the practice of "splitting code from a long method into several private methods".

Robert C. Martin says that this style allows for limiting the contents of each method to one level of abstraction - as a simplified example, a public method would probably only consist of calls to private methods like verifyInput(...), loadDataFromHardDisk(...), transformDataToJson(...) and finally sendJsonToClient(...), and these methods would have the implementation details.

Some people like this because readers can get a quick overview of the high-level steps and can choose which details they want to read about.
Some people dislike it because when you want to know all the details, you have to jump around in the class to follow along the execution flow (this is what JacquesB likely refers to when he writes about adding complexity).

The lesson is: all of them are right, because they are entitled to have an opinion.

Many open source libraries do in fact suffer from objectively poor coding practices and are maintained with difficulty by a small group of long-term contributors who can deal with the poor readability because they are very familiar with the parts of the code that they most frequently maintain. Refactoring code to improve readability after the fact is often a Herculean effort because everyone needs to be on the same page, it's not fun and it doesn't pay because no new features are implemented.

As others have said, any book about clean code stating anything at all necessarily contains advice that is not universally agreed upon. In particular, almost any rule can be followed with excessive zeal, replacing a readability problem with another one.

Personally, I avoid creating named functions if I don't have a good name for them. And a good name has to be short and describe faithfully what the function does to the exterior world. This is also tied with trying to have as few function arguments as possible and no globally writable data. Trying to cut down a very complex function into smaller functions often results in very long argument lists when the function was genuinely complex. Creating and maintaining readable code is an exercise in equilibrium between mutually conflicting common sense rules. Reading books is good, but only experience will teach you how to find false complexity, which is where the real readability gains are made.

Most open source projects are badly managed. There are obviously exceptions to that, but you will find a lot of junk in the open-source world.

This is not a critique of all the project owners/managers whose projects I am talking about, it is simply a matter of time used. These people have better things to do with their time, like their actual paying job.

In the beginning the code is the work of one person and is probably small. And small code doesn't need to be clean. Or rather, the effort needed to make the code clean is larger than the benefit.

As time goes by, the code is more a pile of patches by a lot of different people. The patch writers feel no ownership of the code, they just want this one feature added or this one bug fixed in the easiest way possible.

The owner does not have the time to clean things up and nobody else cares.

And the code is getting big. And ugly.

As it gets harder and harder to find your way around the code, people start adding features in the wrong place. And instead of fixing bugs, they add workarounds other places in the code.

At this point it isn't just that people don't care, they no longer dare clean up since they are afraid of breaking things.

I have had people describing code bases as "cruel and unusual punishment".

My personal experiences aren't quite that bad, but I have seen a few very odd things.

It seems to me, you are asking how does this stuff even work if nobody is doing what they are supposed to be doing. And if it does work, then why are we supposed to be doing these things?

The answer, IMHO, is that it works "good enough", also known as the "worse is better" philosophy. Basically, despite the rocky history between open source and Bill Gates, they both de-facto adopted the same idea, that most people care about features, not bugs.

This of course also leads us to "normalization of deviance" which leads to situations like Heartbleed, where, precisely as if to answer your question, a massive, overgrown spaghetti pile of open source code called OpenSSL went "uncleaned" for something like ten years, winding up with a massive security flaw affecting thousands of millions of people.

The solution was to invent a whole new system called LibreSSL, which was going to use clean-ish code, and of course almost nobody uses it.

So how are huge badly coded open source projects maintained? The answer is in the question. A lot of them aren't maintained in a clean state. They are patched randomly by thousands of different people to cover use cases on various strange machines and situations the developers will never have access to test on. The code works "good enough" until it doesn't, when everyone panics and decides to throw money at the problem.

So why should you bother doing something 'the right way' if nobody else is?

The answer is you shouldn't. You either do or you don't, and the world keeps turning regardless, because human nature doesn't change on the scale of a human lifetime. Personally, I only try to write clean code because I like the way it feels to do it.

What constitutes good code depends on the context, and classic books guiding you on that are, if not too old to discuss open-source, at least part of a tradition waging the neverending war against bad in-house codebases. So it's easy to overlook the fact that libraries have completely different aims, and they're written accordingly. Consider the following issues, in no particular order:

When I import a library, or from a library, I'm probably not enough of an expert in its internal structure to know exactly which tiny fraction of its toolkit I need for whatever I'm working on, unless I'm copying what a Stack Exchange answer told me to do. So I start typing from A import (if it's in Python, say) and see what comes up. But that means what I see listed needs to reflect the logical tasks I'll need to borrow, and that's what has to be in the codebase. Countless helper methods that make it shorter will just confuse me.
Libraries are there for the most inexpert programmer trying to use some algorithm most people have only vaguely heard of. They need external documentation, and that needs to precisely mirror the code, which it can't do if we keep refactoring everything to make short-method and do-one-thing adherents happy.
Every library method people borrow could break code the world over with disastrous consequences if it's taken down or even renamed. Sure, I wish sklearn would correct the typo in Calinski-Harabasz, but that could cause another left-pad incident. In fact, in my experience the biggest problem with library evolution is when they try too hard to adopt some good-code new "improvement" to how they structure everything.
In-house, comments are largely a necessary evil at best, for all manner of reasons I needn't regurgitate (although those points do exaggerate somewhat). A good comment says why the code works, not how. But libraries know their readers are competent programmers who couldn't, say, write-linear-algebra their way out of a paper bag. In other words, everything needs commenting re: why it works! (OK, that's another exaggeration.) So that's why you see signature line, 100-line comment block, 1 line of code that could literally have gone on the signature line (language permitting, of course).
Let's say you update something on Github and wait to see whether your code will be accepted. It must be clear why your code change works. I know from experience that refactoring to make the campsite cleaner as part of a functional commit often means a lot of line-saving, rearrangement and renaming, which makes your salaryless reviewer's job harder, and causes other aforementioned problems.

I'm sure people with more experience than me can mention other points.

There are already a lot of good answers - I want to give the perspective of an open source maintainer.

My perspective

I'm a maintainer of a lot of such projects with less than great code. Sometime I am even prevented from improving such code because of compatibility concerns since the libraries are downloaded millions of times every week.

It does make maintaining harder - as a Node.js core member there are parts of the code I'm afraid to touch but there is a lot of work to do regardless and people use the platform successfuly and enjoy it. The most important thing is that it works.

On readable code

When you say:

I found the code in many of them to be far from the principles addressed to write clean code – e.g methods containing hundreds of lines of code.

Lines of code are not a great measure of how readable it is. In the study I linked to the linux kernel was analyzed and a survey of programmers found "regular" code (code that people expect basically) and consistent code to be better than "clean" code in understandability. This also aligns with my personal experience.

Some open source projects aren't too welcoming

Linus "famously" said that linux shouldn't have a built in debugger because people using debuggers aren't good enough to work on linux and he doesn't want to attract more of them.

Personally I absolutely disagree with his stance there - but it is also something people do.

Open source software does not necessarily mean that multiple authors are involved. When a software (or unit of software) is written by a single author, long functions appear frequently.

This comes from the nature of the development process. A simple method gets extended over time, new features are being added and bug fixed.

Long methods severely reduce the understanding of functionality for new authors. However, with a single author this is rarely a problem and the problem tends to be overlooked. Another nature of open source is the fact that a lot of software is not actively developed therefore there is no refactoring work that would, for example, split complex methods into multiple simple methods.

You haven't shown any examples but from my understanding this is also often connected to the development language. Some languages enforce strict linting rules from the beginning and heavy unit testing (or even TDD). Both linting and unit tests usually prevent that issue (it's hard to unit test complex/long methods).

In general, it's harder to make code clean if software is developed by a single author and other contributors are only fixing small issues.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 softwareengineering.stackexchange