Is the Java Stream API intended to replace loops?

https://softwareengineering.stackexchange.com/questions/410624

11-03-2021
|

Question

I mean the question in the sense of: Should the occurrence of simple loops on collections in code in Java 8 and higher be regarded as code smell (except in justified exceptions)?

When it came to Java 8, I assumed that it would be good to treat everything with Stream API now wherever possible. I thought, especially when I use parallelStream() wherever I know that order doesn't matter, this gives the JVM the ability to optimize the execution of my code.

Teammates think differently here. They think lambda transforms are hard to read and we don't use streams much now. I agree that streams are hard to read if the code formatter forces you to write them like this:

return projects.parallelStream().filter(lambda -> lambda.getGeneratorSource() != null)
        .flatMap(lambda -> lambda.getFolders().prallelStream().map(mu -> Pair.of(mu, lambda.getGeneratorSource())))
        .filter(lambda -> !lambda.getLeft().equals(lambda.getRight())).map(Pair::getLeft)
        .filter(lambda -> lambda.getDerivative().isPresent() || lambda.getDpi().isPresent()
                || lambda.getImageScale().isPresent() || lambda.getImageSize().isPresent())
        .findAny().isPresent();

I would rather prefer to write them like this:

return projects.parallelStream()

// skip all projects that don’t declare a generator source
.filter(λ ->
    λ.getGeneratorSource() != null
)

/* We need to remove the folders which are the source folders of their
 * project. To do so, we create pairs of each folder with the source
 * folder … */
.flatMap(λ ->
    λ.getFolders().stream()
    .map(μ ->
        Pair.of(μ, λ.getGeneratorSource()) // Pair<Folder, Folder>
    )
)
// … and drop all folders that ARE the source folders
.filter(λ ->
    !λ.getLeft().equals(λ.getRight())
)
/* For the further processing, we only need the folders, so we can unbox
 * them now */
.map(Pair::getLeft)

// only look for folders that declare a method to generate images
.filter(λ ->
    λ.getDerivative().isPresent() || λ.getDpi().isPresent() ||
    λ.getImageScale().isPresent() || λ.getImageSize().isPresent()
)

// return whether there is any
.findAny().isPresent();

Yes, return is at the top, and it looks quite differently than:

for (Project project : projects) {
    if (Objects.nonNull(project.getGeneratorSource())) {
        continue;
    }
    for (Folder folder : project.getFolders()) {
        if (folder.equals(project.getGeneratorSource())) {
            continue;
        } else if (folder.getDerivative().isPresent() || folder.getDpi().isPresent()
                || folder.getImageScale().isPresent() || folder.getImageSize().isPresent()) {
            return true;
        }
    }
}
return false;

But isn’t that just a matter of habit? Therefore my question is more a question of assessment:

Should Stream be something basic to use (should Java beginners learn for loops at all/first, or should they first learn to use streams), or is that too much premature optimization, and one should use streams rather after it is shown that a code makes a bottleneck.

(Less opinion-based: How do modern Java teaching books (are there still books in IT training?) handle this?)

Edit 1: To prevent the question from losing focus: My question is: Is Stream is intended as a basic programming paradigm every Java programmer should use often, or as a performance feature for senior software enhancers, that should be avoided? You can also look at it internally: Can the JVM decide whether to use a stream construction or process the objects sequentially (e.g. may it only do that if it detects a hotspot?), or does it have to set up complicated parallelization logic in the background each time, with multiple threads, so that this could produce a lot of overhead? In the second case that would be a clear argument to me against using Stream all and everywhere.

No correct solution

OTHER TIPS

A reason to prefer Streams over for is that for does everything. You get to use different names for different operations, rather than having to recognise a pattern spread across tens of lines.

The problem with both versions is doing everything in one method. You are missing functions like

Stream<T> <T> notNull(Stream<T> objects) { 
    return objects.filter(o -> o != null); 
}

Stream<Pair<Folder, Folder>> getFolderPairs(Project project) { 
    return project.getFolders().stream()
        .map(folder ->
            Pair.of(folder, project.getGeneratorSource())
        );
}

boolean isNotSourceFolder(Pair<Folder, Folder> folders) {
    return !folders.getLeft().equals(folders.getRight());
}

Stream<Folder> subfoldersFor(Stream<Project> projects) { 
    return notNull(projects)
        .flatMap(getFolderPairs)
        .filter(isNotSourceFolder)
        .map(Pair::getLeft)
}

boolean canGenerateImages(Folder folder) { 
    return (folder.getDerivative().isPresent() 
         || folder.getDpi().isPresent() 
         || folder.getImageScale().isPresent() 
         || folder.getImageSize().isPresent()) 
}

Note that I use sensible names for the parameters of my lambda expressions, rather than lambda everywhere.

Using those, your original function becomes much simpler, even with a formatter that wants things together.

return subfoldersFor(projects.parallelStream())
    .anyMatch(canGenerateImages);

Is Stream is intended as a basic programming paradigm every Java programmer should use often, or as a performance feature for senior software enhancers, that should be avoided?

I think there's another option that you are not seeing. The primary benefit of Streams (in my humble opinion) is that it's a paradigm for concurrent and parallel algorithms that is much easier to get right.

That is, it's not intended to replace loops in general (although this is a somewhat popular opinion.) But it's not really intended only for experts either. It's intended to allow average developers implement approaches that in the past, only experts could manage.

To this point, a lot of people will use parallel stream to implement multi-threading. This can be fine and it might even lead to an optimal solution but it's actually much less sophisticated than the existing concurrency libraries. And I don't mean just that the code is simpler to write. I mean that the capabilities of the concurrency libraries offer a vast array of design options beyond what the streams API provides. The problem is that you need to have a really strong understanding of concurrency, the Java memory model, and an extensive library.

So ultimately, I don't see streams as an advanced option but one designed for beginners.

Having said that, there are a lot of functional capabilities in the streams API that are better than their for-loop equivalent. Mapping and filtering are much cleaner and provide much more reuse options than e.g. if statements buried in a loop. Unfortunately, Java doesn't make streams and loops work well together by default. There's a reason for that but it tends to lead people on a path of 'streams or bust' and IMO, reduces the readability of code in many circumstances.

Streams are a fine addition to the Java ecosystem, but they cannot replace the traditional for loop.

Limited access to enclosing variables

The main limitation is that code inside lambdas cannot fully access variables of the enclosing method. And that cannot be overcome as lambdas are in fact shorthands for instances of anonymous functional classes. Code in a lambda expression technically does not reside in the enclosing scope, but in a completely different class.

Readability

I hav come to associate streams-based code with poor readability, and that drawback mainly comes from the habit of chaining all stream operations into one, huge expression.

In traditional, non-streams programming, we regard such a chaining of many method calls a code smell, but with streams we happily (?) write expressions occupying lines and lines of code.

Why don't we split these expressions into intellegible parts, giving the intermediate streams meaningful names as local variables, with the benefit of seeing the stream type in the variable declarations? Or group together a few streams operations into a method with a meaningful name? There's nothing forcing us to write streams expressions as single, monolithic blocks.

The reason probably is that most tutorials lead us into that direction, never showing streams expressions broken into manageable pieces.

Debugging

And then there's debugging. Streams-based code is much harder to debug:

The actual execution is triggered by the final collecting step of the operations chain, so e.g. single-stepping through the code is virtually impossible.
Stack traces don't resemble the code structure. There are lots of synthetic method calls between the enclosing method and the business logic inside the stream operations. This not only affects debugging sessions, but also production-time error reporting. The highly-appreciated stack trace logging loses a lot of its value.

Performance

And finally, we have a performance issue. Even with the best optimizers, the overhead of chaining together lots of distinct functional-interface instances has its price. It might not be significant in most situations, but it's there.

Conclusion

While the readability issue can be overcome by adopting a different coding style, there remain enough drawbacks, so streams can't adequately replace traditional for loops.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange