How should I store “unknown” and “missing” values in a variable, while still retaining the difference between “unknown” and “missing”?

softwareengineering.stackexchange https://softwareengineering.stackexchange.com/questions/376845

Pergunta

Consider this an "academic" question. I have been wondering about about avoiding NULLs from time to time and this is an example where I can't come up with a satisfactory solution.


Let's assume I store measurements where on occasions the measurement is known to be impossible (or missing). I would like to store that "empty" value in a variable while avoiding NULL. Other times the value could be unknown. So, having the measurements for a certain time-frame, a query about a measurement within that time period could return 3 kinds of responses:

  • The actual measurement at that time (for example, any numerical value including 0)
  • A "missing"/"empty" value (i.e., a measurement was done, and the value is known to be empty at that point).
  • An unknown value (i.e., no measurement has been done at that point. It could be empty, but it could also be any other value).

Important Clarification:

Assuming you had a function get_measurement() returning one of "empty", "unknown" and a value of type "integer". Having a numerical value implies that certain operations can be done on the return value (multiplication, division, ...) but using such operations on NULLs will crash the application if not caught.

I would like to be able to write code, avoiding NULL checks, for example (pseudocode):

>>> value = get_measurement()  # returns `2`
>>> print(value * 2)
4

>>> value = get_measurement()  # returns `Empty()`
>>> print(value * 2)
Empty()

>>> value = get_measurement()  # returns `Unknown()`
>>> print(value * 2)
Unknown()

Note that none of the print statements caused exceptions (as no NULLs were used). So the empty & unknown values would propagate as necessary and the check whether a value is actually "unknown" or "empty" could be delayed until really necessary (like storing/serialising the value somewhere).


Side-Note: The reason I'd like to avoid NULLs, is primarily a brain-teaser. If I want to get stuff done I'm not opposed to using NULLs, but I found that avoiding them can make code a lot more robust in some cases.

Foi útil?

Solução

The common way to do this, at least with functional languages is to use a discriminated union. This is then a value that is one of a valid int, a value that denotes "missing" or a value that denotes "unknown". In F#, it might look something like:

type Measurement =
    | Reading of value : int
    | Missing
    | Unknown of value : RawData

A Measurement value will then be a Reading, with an int value, or a Missing, or an Unknown with the raw data as value (if required).

However, if you aren't using a language that supports discriminated unions, or their equivalent, this pattern isn't likely of much use to you. So there, you could eg use a class with an enum field that denotes which of the three contains the correct data.

Outras dicas

If you do not already know what a monad is, today would be a great day to learn. I have a gentle introduction for OO programmers here:

https://ericlippert.com/2013/02/21/monads-part-one/

Your scenario is a small extension to the "maybe monad", also known as Nullable<T> in C# and Optional<T> in other languages.

Let's suppose you have an abstract type to represent the monad:

abstract class Measurement<T> { ... }

and then three subclasses:

final class Unknown<T> : Measurement<T> { ... a singleton ...}
final class Empty<T> : Measurement<T> { ... a singleton ... }
final class Actual<T> : Measurement<T> { ... a wrapper around a T ...}

We need an implementation of Bind:

abstract class Measurement<T>
{ 
    public Measurement<R> Bind(Func<T, Measurement<R>> f)
  {
    if (this is Unknown<T>) return Unknown<R>.Singleton;
    if (this is Empty<T>) return Empty<R>.Singleton;
    if (this is Actual<T>) return f(((Actual<T>)this).Value);
    throw ...
  }

From this you can write this simplified version of Bind:

public Measurement<R> Bind(Func<A, R> f) 
{
  return this.Bind(a => new Actual<R>(f(a));
}

And now you're done. You have a Measurement<int> in hand. You want to double it:

Measurement<int> m = whatever;
Measurement<int> doubled = m.Bind(a => a * 2);
Measurement<string> asString = m.Bind(a => a.ToString());

And follow the logic; if m is Empty<int> then asString is Empty<String>, excellent.

Similarly, if we have

Measurement<int> First()

and

Measurement<double> Second(int i);

then we can combine two measurements:

Measurement<double> d = First().Bind(Second);

and again, if First() is Empty<int> then d is Empty<double> and so on.

The key step is to get the bind operation correct. Think hard about it.

I think that in this case a variation on a Null Object Pattern would be useful:

public class Measurement
{
    private int value;
    private bool isUnknown = false;
    private bool isMissing = false;

    private Measurement() { }
    public Measurement(int value) { this.value = value; }

    public int Value {
        get {
            if (!isUnknown && !isMissing)
            {
                return this.value;
            }
            throw new SomeException("...");
        }                   
    }

    public static readonly Measurement Unknown = new Measurement
    {
        isUnknown = true
    };

    public static readonly Measurement Missing = new Measurement
    {
        isMissing = true
    };
}

You can turn it into a struct, override Equals/GetHashCode/ToString, add implicit conversions from or to int, and if you want NaN-like behavior you can also implement your own arithmetic operators so that eg. Measurement.Unknown * 2 == Measurement.Unknown.

That said, C#'s Nullable<int> implements all that, with the only caveat being that you can't differentiate between different types of nulls. I'm not a Java person, but my understanding is that Java's OptionalInt is similar, and other languages likely have their own facilities to represent an Optional type.

If you literally MUST use an integer then there is only one possible solution. Use some of the possible values as 'magic numbers' that mean 'missing' and 'unknown'

eg 2,147,483,647 and 2,147,483,646

If you just need the int for 'real' measurements, then create a more complicated data structure

class Measurement {
    public bool IsEmpty;
    public bool IsKnown;
    public int Value {
        get {
            if(!IsEmpty && IsKnown) return _value;
            throw new Exception("NaN");
            }
        }
}

Important Clarification:

You can acheieve the maths requirement by overloading the operators for the class

public static Measurement operator+ (Measurement a, Measurement b) {
    if(a.IsEmpty) { return b; }
    ...etc
}

If your variables are floating-point numbers, IEEE754 (the floating point number standard which is supported by most modern processors and languages) has your back: it is a little-known feature, but the standard defines not one, but a whole family of NaN (not-a-number) values, which can be used for arbitrary application-defined meanings. In single-precision floats, for instance, you have 22 free bits that you can use to distinguish between 2^{22} types of invalid values.

Normally, programming interfaces expose only one of them (e.g., Numpy's nan); I don't know if there is a built-in way to generate the others other than explicit bit manipulation, but it's just a matter of writing a couple of low-level routines. (You will also need one to tell them apart, because, by design, a == b always returns false when one of them is a NaN.)

Using them is better than reinventing your own "magic number" to signal invalid data, because they propagate correctly and signal invalid-ness: for instance, you don't risk shooting yourself in the foot if you use an average() function and forget to check for your special values.

The only risk is libraries not supporting them correctly, since they are quite an obscure feature: for instance, a serialization library may 'flatten' them all to the same nan (which looks equivalent to it for most purposes).

Following on David Arno's answer, you can do something like a discriminated union in OOP, and in an object-functional style such as that afforded by Scala, by Java 8 functional types, or a Java FP library such as Vavr or Fugue it feels fairly natural to write something like:

var value = Measurement.of(2);
out.println(value.map(x -> x * 2));

var empty = Measurement.empty();
out.println(empty.map(x -> x * 2));

var unknown = Measurement.unknown();
out.println(unknown.map(x -> x * 2));

printing

Value(4)
Empty()
Unknown()

(Full implementation as a gist.)

An FP language or library provides other tools like Try (a.k.a. Maybe) (an object that contains either a value, or an error) and Either (an object that contains either a success value or a failure value) that could also be used here.

The ideal solution to your problem is going to hinge on why you care about the difference between a known failure and an known unreliable measurement, and what downstream processes you want to support. Note, 'downstream processes' for this case does not exclude human operators or fellow developers.

Simply coming up with a "second flavor" of null doesn't give the downstream set of processes enough information for deriving a reasonable set of behaviors.

If you are relying instead on contextual assumptions about the source of bad behaviors being made by downstream code, I'd call that bad architecture.

If you know enough to distinguish between a reason for failure and a failure without a known reason, and that information is going to inform future behaviors, you should be communicating that knowledge downstream, or handling it inline.

Some patterns for handling this:

  • Sum types
  • Discriminated unions
  • Objects or structs containing an enum representing the result of the operation and a field for the result
  • Magic strings or magic numbers that are impossible to achieve via normal operation
  • Exceptions, in languages in which this use is idiomatic
  • Realizing that there isn't actually any value in differentiating between these two scenarios and just using null

If I were concerned with "getting something done" rather than an elegant solution, the quick and dirty hack would be to simply use the strings "unknown", "missing", and 'string representation of my numeric value', which would then be converted from a string and used as needed. Implemented quicker than writing this, and in at least some circumstances, entirely adequate. (I'm now forming a betting pool on the number of downvotes...)

The gist if the question seems to be "How do I return two unrelated pieces of information from a method which returns a single int? I never want to check my return values, and nulls are bad, don't use them."

Let's look at what you are wanting to pass. You are passing either an int, or a non-int rationale for why you can't give the int. The question asserts that there will only be two reasons, but anyone who has ever made an enum knows that any list will grow. Scope to specify other rationales just makes sense.

Initially, then, this looks like it might be a good case for throwing an exception.

When you want to tell the caller something special which isn't in the return type, exceptions are often the appropriate system: exceptions are not just for error states, and allow you to return a lot of context and rationale to explain why you just can't int today.

And this is the ONLY system which allows you to return guaranteed-valid ints, and guarantee that every int operator and method that takes ints can accept the return value of this method without ever needing to check for invalid values like null, or magic values.

But exceptions are really only a valid solution if, as the name implies, this is an exceptional case, not the normal course of business.

And a try/catch and handler is just as much boilerplate as a null check, which was what was objected to in the first place.

And if the caller doesn't contain the try/catch, then the caller's caller has to, and so on up.


A naive second pass is to say "It's a measurement. Negative distance measurements are unlikely." So for some measurement Y, you can just have consts for

  • -1=unknown,
  • -2=impossible to measure,
  • -3=refused to answer,
  • -4=known but confidential,
  • -5=varies depending on moon phase, see table 5a,
  • -6=four-dimensional, measurements given in title,
  • -7=file system read error,
  • -8=reserved for future use,
  • -9=square/cubic so Y is same as X,
  • -10=is a monitor screen so not using X,Y measurements: use X as the screen diagonal,
  • -11=wrote the measurements down on the back of a receipt and it was laundered into illegibility but I think it was either 5 or 17,
  • -12=... you get the idea.

This is the way it is done in a lot of old C systems, and even in modern systems where there is a genuine constraint to int, and you can't wrap it to a struct or monad of some type.

If the measurements can be negative, then you just make your data type larger (eg long int) and have the magic values be higher than the range of the int, and ideally begin with some value that will show up clearly in a debugger.

There are good reasons to have them as a separate variable, rather than just having magic numbers, though. For example, strict typing, maintainability, and conforming to expectations.


In our third attempt, then, we look at cases where it is the normal course of business to have non-int values. For example, if a collection of these values may contain multiple non-integer entries. This means an exception handler may be the wrong approach.

In that case, it looks a good case for a structure which passes the int, and the rationale. Again, this rationale can just be a const like the above, but instead of holding both in the same int, you store them as distinct parts of a structure. Initially, we have the rule that if the rationale is set, the int will not be set. But we are no longer tied to this rule; we can provide rationales for valid numbers too, if needs be.

Either way, every time you call it, you still need boilerplate, to test the rationale to see if the int is valid, then pull out and use the int part if the rationale lets us.

This is where you need to investigate your reasoning behind "don't use null".

Like exceptions, null is meant to signify an exceptional state.

If a caller is calling this method and ignoring the "rationale" part of the structure completely, expecting a number without any error handling, and it gets a zero, then it'll handle the zero as a number, and be wrong. If it gets a magic number, it'll treat that as a number, and be wrong. But if it gets a null, it'll fall over, as it damn well should do.

So every time you call this method you must put in checks for its return value, however you handle the invalid values, whether in-band or out of band, try/catch, checking the struct for a "rationale" component, checking the int for a magic number, or checking an int for a null...

The alternative, to handle multiplication of an output which might contain an invalid int and a rationale like "My dog ate this measurement", is to overload the multiplication operator for that structure.

...And then overload every other operator in your application that might get applied to this data.

...And then overload all methods that might take ints.

...And all of those overloads will need to still contain checks for invalid ints, just so that you can treat the return type of this one method as if it were always a valid int at the point when you are calling it.

So the original premise is false in various ways:

  1. If you have invalid values, you can't avoid checking for those invalid values at any point in the code where you're handling the values.
  2. If you're returning anything other than an int, you're not returning an int, so you can't treat it like an int. Operator overloading lets you pretend to, but that's just pretend.
  3. An int with magic numbers (including NULL, NAN, Inf...) is no longer really an int, it's a poor-man's struct.
  4. Avoiding nulls will not make code more robust, it will just hide the problems with ints, or move them into a complex exception-handling structure.

I don't understand the premise of your question, but here's the face value answer. For Missing or Empty, you could do math.nan (Not a Number). You can perform any mathematical operations on math.nan and it will remain math.nan.

You can use None (Python's null) for an unknown value. You shouldn't be manipulating an unknown value anyways, and some languages (Python is not one of them) have special null operators so that the operation is only performed if the value is nonnull, otherwise the value remains null.

Other languages have guard clauses (like Swift or Ruby), and Ruby has a conditional early return.

I've seen this solved in Python in a few different ways:

  • with a wrapper data structure, since numerical information usually is about to an entity and has a measurement time. The wrapper can override magic methods like __mult__ so that no exceptions are raised when your Unknown or Missing values come up. Numpy and pandas might have such capability in them.
  • with a sentinel value (like your Unknown or -1/-2) and an if statement
  • with a separate boolean flag
  • with a lazy data structure- your function performs some operation on the structure, then it returns, the outermost function that needs the actual result evaluates the lazy data structure
  • with a lazy pipeline of operations- similar to the previous one, but this one can be used on a set of data or a database

How the value is stored in memory is dependent on the language and implementation details. I think what you mean is how the object should behave to the programmer. (This is how I read the question, tell me if I'm wrong.)

You've proposed an answer to that in your question already: use your own class that accepts any mathematical operation and returns itself without raising an exception. You say you want this because you want to avoid null checks.

Solution 1: don't avoid null checks

Missing can be represented as math.nan
Unknown can be represented as None

If you have more than one value, you can filter() to only apply the operation on values that aren't Unknown or Missing, or whatever values you want to ignore for the function.

I can't imagine a scenario where you need a null-check on a function that acts on a single scalar. In that case, it's good to force null-checks.


Solution 2: use a decorator that catches exceptions

In this case, Missing could raise MissingException and Unknown could raise UnknownException when operations are performed on it.

@suppressUnknown(value=Unknown) # if an UnknownException is raised, return this value instead
@suppressMissing(value=Missing)
def sigmoid(value):
    ...

The advantage of this approach is that the properties of Missing and Unknown are only suppressed when you explicitly ask for them to be suppressed. Another advantage is that this approach is self-documenting: every function shows whether or not it expects an unknown or a missing and how the function.

When you call a function doesn't expect a Missing gets a Missing, the function will raise immediately, showing you exactly where the error occurred instead of silently failing and propagating a Missing up the call chain. The same goes for Unknown.

sigmoid can still call sin, even though it doesn't expect a Missing or Unknown, since sigmoid's decorator will catch the exception.

Assume fetching the number of CPUs in a server. If the server is switched off, or has been scrapped, that value simply doesn't exist. It will be a measurement which does not make any sense (maybe "missing"/"empty" are not the best terms). But the value is "known" to be nonsensical. If the server exists, but the process of fetching the value crashes, measuring it is valid, but fails resulting an "unknown" value.

Both of these sound like error conditions, so I would judge that the best option here is to simply have get_measurement() throw both of these as exceptions immediately (such as DataSourceUnavailableException or SpectacularFailureToGetDataException, respectively). Then, if any of these issues occur, the data-gathering code can react to it immediately (such as by trying again in the latter case), and get_measurement() only has to return an int in the case that it can successfully get the data from the data source - and you know that the int is valid.

If your situation doesn't support exceptions or can't make much use of them, then a good alternative is to use error codes, perhaps returned through a separate output to get_measurement(). This is the idiomatic pattern in C, where the actual output is stored in an input pointer and an error code is passed back as the return value.

The given answers are fine, but still do not reflect the hierarchical relation between value, empty and unknown.

  • Highest comes unknown.
  • Then before using a value first empty must be clarified.
  • Last comes the value to calculate with.

Ugly (for its failing abstraction), but fully operational would be (in Java):

Optional<Optional<Integer>> unknowableValue;

unknowableValue.ifPresent(emptiableValue -> ...);
Optional<Integer> emptiableValue = unknowableValue.orElse(Optional.empty());

emptiableValue.ifPresent(value -> ...);
int value = emptiableValue.orElse(0);

Here functional languages with a nice type system are better.

In fact: The empty/missing and unknown* non-values seem rather part of some process state, some production pipeline. Like Excel spread sheet cells with formulas referencing other cells. There one would think of maybe storing contextual lambdas. Changing a cell would re-evaluate all recursively dependent cells.

In that case an int value would be gotten by an int supplier. An empty value would give an int supplier throwing an empty exception, or evaluating to empty (recursively upwards). Your main formula would connect all values and possibly also return an empty (value/exception). An unknown value would disable evaluation by throwing an exception.

Values probably would be observable, like a java bound property, notifying listeners on change.

In short: The recurring pattern of needing values with additional states empty and unknown seems to indicate that a more spread sheet like bound properties data model might be better.

Yes, the concept of multiple different NA types exists in some languages; more so in statistical ones, where it's more meaningful (viz. the huge distinction between Missing-At-Random, Missing-Completely-At-Random, Missing-Not-At-Random).

  • if we're only measuring widget lengths, then it's not crucial to distinguish between 'sensor failure' or 'power cut' or 'network failure' (although 'numerical overflow' does convey information)

  • but in e.g. data mining or a survey, asking respondents for e.g. their income or HIV status, a result of 'Unknown' is distinct to 'Decline to answer', and you can see that our prior assumptions about how to impute the latter will tend to be different to the former. So languages like SAS support multiple different NA types; the R language doesn't but users very often have to hack around that; NAs at different points in a pipeline can be used to denote very different things.

  • there's also the case where we have multiple NA variables for a single entry ("multiple imputation"). Example: if I don't know any of a person's age, zipcode, education level or income, it's harder to impute their income.

As to how you represent different NA types in general-purpose languages that don't support them, generally people hack up things like floating-point-NaN (requires converting integers), enums or sentinels (e.g. 999 or -1000) for integer or categorical values. Usually there isn't a very clean answer, sorry.

R has build-in missing value support. https://medium.com/coinmonks/dealing-with-missing-data-using-r-3ae428da2d17

Edit: because I was downvoted I'm going to explain a bit.

If you are going to deal with statistics I recommend you to use a statistics language such as R because R is written by statisticians for statisticians. Missing values is such a big topic that they teach you a whole semester. And there is big books only about missing values.

You can however you want to mark you missing data, like a dot or "missing" or whatever. In R you can define what you mean by missing. You don't need to convert them.

Normal way to define missing value is to mark them as NA.

x <- c(1, 2, NA, 4, "")

Then you can see what values are missing;

is.na(x)

And then the result will be;

FALSE FALSE  TRUE FALSE FALSE

As you can see "" is not missing. You can threat "" as unknown. And NAis missing.

Is there a reason that the functionality of the * operator cannot be altered instead?

Most of the answers involve a lookup value of some sort, but it might just be easier to amend the mathematical operator in this case.

You would then be able to have similar empty()/unknown() functionality across your entire project.

Licenciado em: CC-BY-SA com atribuição
scroll top