Python：ナンの存在下で関数が壊れる

https://stackoverflow.com/questions/4240050

27-09-2019
|

質問

sorted([2, float('nan'), 1]) 戻り値 [2, nan, 1]

（少なくともActivestate Python 3.1実装で。）

理解します nan 奇妙なオブジェクトなので、ソート結果のランダムな場所に表示されても驚かないでしょう。しかし、それはまた、コンテナ内のナン以外の数字のソートを台無しにしますが、これは本当に予想外です。

私は尋ねました関連する質問約 max, 、そしてそれに基づいて、私はその理由を理解しています sort このように機能します。しかし、これはバグと見なされるべきですか？

ドキュメントは、詳細を指定せずに「新しいソートリスト[...]を返す[...]」と書かれています。

編集：私は今、これがIEEE標準に違反していないことに同意します。しかし、それはあらゆる常識的な観点からのバグだと思います。彼らの間違いを頻繁に認めることが知られていないマイクロソフトでさえ、これをバグとして認識し、最新バージョンで修正しました。 http://connect.microsoft.com/visualstudio/feedback/details/363379/bug-in-list-double-sort-in-list-which-contains-double-nan.

とにかく、私は @Khachikの答えをフォローしました：

sorted(list_, key = lambda x : float('-inf') if math.isnan(x) else x)

デフォルトでそれを行っている言語と比較して、パフォーマンスのヒットが発生すると思われますが、少なくとも機能します（導入したバグを除いて）。

解決

The previous answers are useful, but perhaps not clear regarding the root of the problem.

In any language, sort applies a given ordering, defined by a comparison function or in some other way, over the domain of the input values. For example, less-than, a.k.a. operator <, could be used throughout if and only if less than defines a suitable ordering over the input values.

But this is specifically NOT true for floating point values and less-than: "NaN is unordered: it is not equal to, greater than, or less than anything, including itself." (Clear prose from GNU C manual, but applies to all modern IEEE754 based floating point)

So the possible solutions are:

remove the NaNs first, making the input domain well defined via < (or the other sorting function being used)

define a custom comparison function (a.k.a. predicate) that does define an ordering for NaN, such as less than any number, or greater than any number.

Either approach can be used, in any language.

Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context.

Otherwise you could use a suitable predicate function via "cmp" in older python versions, or via this and functools.cmp_to_key(). The latter is a bit more awkward, naturally, than removing the NaNs first. And care will be required to avoid worse performance, when defining this predicate function.

他のヒント

The problem is that there's no correct order if the list contains a NAN, since a sequence a1, a2, a3, ..., an is sorted if a1 <= a2 <= a3 <= ... <= an. If any of these a values is a NAN then the sorted property breaks, since for all a, a <= NAN and NAN <= a are both false.

I'm not sure about the bug, but the workaround may be the following:

sorted(
    (2, 1, float('nan')),
    lambda x,y: x is float('nan') and -1 
                or (y is float('nan') and 1
                or cmp(x,y)))

which results in:

('nan', 1, 2)

Or remove nans before sorting or anything else.

IEEE754 is the standard that defines floating point operations in this instance. This standard defines the compare operation of operands, at least one of which is a NaN, to be an error. Hence, this is not a bug. You need to deal with the NaNs before operating on your array.

Assuming you want to keep the NaNs and order them as the lowest "values", here is a workaround working both with non-unique nan, unique numpy nan, numerical and non numerical objects:

def is_nan(x):
    return (x is np.nan or x != x)

list_ = [2, float('nan'), 'z', 1, 'a', np.nan, 4, float('nan')]
sorted(list_, key = lambda x : float('-inf') if is_nan(x) else x)
# [nan, nan, nan, 1, 2, 4, 'a', 'z']

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow