Guidelines for applying DRY in Haskell function definitions

https://stackoverflow.com/questions/828061

06-07-2019
|

Question

I have a question about whether or not a specific way of applying of the DRY principle is considered a good practice in Haskell.I'm going to present an example, and then ask whether the approach I'm taking is considered good Haskell style. In a nutshell, the question is this: when you have a long formula, and then you find yourself needing to repeat some small subsets of that formula elsewhere, do you always put that repeated subset of the formula into a variable so you can stay DRY? Why or why not?

The Example: Imagine we're taking a string of digits, and converting that string into its corresponding Int value. (BTW, this is an exercise from "Real World Haskell").

Here's a solution that works except that it ignores edge cases:

asInt_fold string = fst (foldr helper (0,0) string)
  where
    helper char (sum,place) = (newValue, newPlace)
      where 
        newValue = (10 ^ place) * (digitToInt char) + sum
        newPlace = place + 1

It uses foldr, and the accumulator is a tuple of the next place value and the sum so far.

So far so good. Now, when I went to implement the edge case checks, I found that I needed little portions of the "newValue" formula in different places to check for errors. For example, on my machine, there would be an Int overflow if the input was larger than (2^31 - 1), so the max value I could handle is 2,147,483,647. Therefore, I put in 2 checks:

If the place value 9 (the billions place) and the digit value is > 2, there's an error.
If sum + (10 ^ place) * (digitToInt char) > maxInt, there's an error.

Those 2 checks caused me to repeat part of the formula, so I introduced the following new variables:

digitValue = digitToInt char
newPlaceComponent = (10^place) * digitValue

The reason I introduced those variables is merely an automatic application of the DRY principle: I found myself repeating those portions of the formula, so I defined them once and only once.

However, I wonder if this is considered good Haskell style. There are obvious advantages, but I see disadvantages as well. It definitely makes the code longer, whereas much of the Haskell code I've seen is pretty terse.

So, do you consider this good Haskell style, and do you follow this practice, or not? Why / why not?

And for what it's worth, here's my final solution that deals with a number of edge cases and therefore has quite a large where block. You can see how large the block became due to my application of the DRY principle.

Thanks.

asInt_fold "" = error "You can't be giving me an empty string now"
asInt_fold "-" = error "I need a little more than just a dash"
asInt_fold string | isInfixOf "." string = error "I can't handle decimal points"
asInt_fold ('-':xs) = -1 * (asInt_fold xs) 
asInt_fold string = fst (foldr helper (0,0) string)
  where
    helper char (sum,place) | place == 9 && digitValue > 2 = throwMaxIntError
               | maxInt - sum < newPlaceComponent      = throwMaxIntError
                   | otherwise                             = (newValue, newPlace)
            where
              digitValue =  (digitToInt char)
              placeMultiplier = (10 ^ place)
              newPlaceComponent = placeMultiplier * digitValue
              newValue = newPlaceComponent + sum
              newPlace = place + 1
              maxInt = 2147483647
              throwMaxIntError = 
                        error "The value is larger than max, which is 2147483647"

Solution

As noted by bdonlan, your algorithm could be cleaner---it's especially useful that the language itself detects overflow. As for your code itself and the style, I think the main tradeoff is that each new name imposes a small cognitive burden on the reader. When to name an intermediate result becomes a judgment call.

I personally would not have chosen to name placeMultiplier, as I think the intent of place ^ 10 is much clearer. And I would look for maxInt in the Prelude, as you run the risk of being terribly wrong if run on 64-bit hardware. Otherwise, the only thing I find objectionable in your code are the redundant parentheses. So what you have is an acceptable style.

(My credentials: At this point I have written on the order of 10,000 to 20,000 lines of Haskell code, and I have read perhaps two or three times that. I also have ten times that much experience with the ML family of languages, which require the programmer to make similar decisions.)

OTHER TIPS

DRY is just as good of a principle in Haskell as it is anywhere else :) A lot of the reason behind the terseness you speak of in haskell is that many idioms are lifted out into libraries, and that often those examples you look at have been considered very carefully to make them terse :)

For example, here's an alternate way to implement your digit-to-string algorithm:

asInt_fold ('-':n) = negate (asInt_fold n)
asInt_fold "" = error "Need some actual digits!"
asInt_fold str = foldl' step 0 str
    where
        step _ x
            | x < '0' || x > '9'
            = error "Bad character somewhere!"
        step sum dig =
            case sum * 10 + digitToInt dig of
                n | n < 0 -> error "Overflow!"
                n -> n

A few things to note:

We detect overflow when it happens, not by deciding arbitrary-ish limits on what digits we allow. This signifigantly simplifies the overflow detection logic - and makes it work on any integer type from Int8 to Integer [as long as overflow results in wraparound, doesn't occur, or results in an assertion from the addition operator itself]
By using a different fold, we don't need two seperate states.
No repeating ourselves, even without going out of our way to lift things out - it falls naturally out of re-stating what we're trying to say.

Now, it's not always possible to just reword the algorithm and make the duplication go away, but it's always useful to take a step back and reconsider how you've been thinking about the problem :)

I think the way you've done it makes sense.

You should certainly always break repeated computations out into separately defined values if avoiding repeated computation is important, but in this case that doesn't look necessary. Nevertheless, the broken out values have easy to understand names, so they make your code easier to follow. I don't think the fact that your code is a bit longer as a result is a bad thing.

BTW, instead hardcoding the maximum Int, you can use (maxBound :: Int) which avoids the risk of you making a mistake or another implementation with a different maximum Int breaking your code.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow