Why MS SQL bigint type is implicitly mapped to float64 python type, and what is the best way to handle it? [closed]

https://dba.stackexchange.com/questions/285498

16-03-2021
|

문제

Python integer type has unlimited precision so it is more than capable to hold a bigint value of MS SQL (64 bit). Still it is implicitly mapped to float64 python type, when passed to an external script.

This can cause serious calculation errors for large integers.

So why is it mapped to float64?

My guess is:

R was added before Python via the Extensibility architecture and it has fixed precision integers (32 bit). So it can't hold bigints. So perhaps this is a compatibility issue.

What is the best practice to ensure precise calculations?

Simple but working idea: pass bigints as string then parse them as int.

I know it has a slim chance to cause problem in practice, but still good to know.

How can it be a problem:

I wrote a simple example to demonstrate how can it be a problem:

CREATE TABLE #test (
    big_integer BIGINT
);

INSERT INTO #test 
    (big_integer)
VALUES
    (36028797018963968),
    (36028797018963968 + 1);

EXECUTE sp_execute_external_script 
    @language = N'Python',
    @input_data_1 = N'SELECT big_integer FROM #test',
    @script = N'
print(InputDataSet.dtypes)
OutputDataSet = InputDataSet
'

Executing this code on SQL Server 2019 will give you the result of:

| | (No column name)  |
|---------------------|
|1| 36028797018963970 |
|2| 36028797018963970 |

and because of the print(InputDataSet.dtypes) statement we can see the following message:

...
STDOUT message(s) from external script: 
big_integer    float64
dtype: object
...

So we got a floating point rounding error. The value of this error for big enough integers is greater than 1, which is the root of this problem.

It is out of the scope of this question to teach floating point arithmetics, but I link some good materials if you don't understand what did happen:

Simple example - Stack Overflow.

Floating Point Numbers - Computerphile

The IEEE 754 Format - Oxford

I also share a small ipython sample if you want to experiment with this (which isn't a substitute of learning the theory behind this):

In [16]: import numpy as np

In [17]: a = 2**55

In [18]: a
Out[18]: 36028797018963968

In [19]: float(a) == float(a + 1)
Out[19]: True

In [20]: float(a)
Out[20]: 3.602879701896397e+16

In [21]: float(a + 1)
Out[21]: 3.602879701896397e+16

In [22]: np.nextafter(float(a), np.inf)
Out[22]: 3.6028797018963976e+16

Note

To run my example T-SQL some conditions must be met:

해결책

I write down the best solution what I found:

CREATE TABLE #test (
    big_integer BIGINT
);

INSERT INTO #test 
    (big_integer)
VALUES
    (36028797018963968),
    (36028797018963968 + 1);

CREATE TABLE #out (
    big_integer BIGINT
);

INSERT INTO #out
    EXECUTE sp_execute_external_script 
        @language = N'Python',
        @input_data_1 = N'SELECT CAST(big_integer AS VARCHAR(20)) AS big_integer FROM #test',
        @script = N'
import numpy as np
print(InputDataSet)
InputDataSet["big_integer"] = InputDataSet["big_integer"].astype(np.int64)
InputDataSet["big_integer"] = InputDataSet["big_integer"] + 1
InputDataSet["big_integer"] = InputDataSet["big_integer"].astype(str)
OutputDataSet = InputDataSet
';

SELECT big_integer FROM #out;

I did what I supposed in the question:

cast the big_integer column to VARCHAR(20) the maximum length of a string representation of 64 bit signed integer:

In [34]: len(str(-2**63))
Out[34]: 20

In [35]: len(str(2**63-1))
Out[35]: 19

cast it back to numpy.int64 type in the external script.
made a simple calculation: incremented all values in the column
cast it back to string still in python. This step also necessary, because the implicit type conversion works both ways.
Inserted the values into the big_integer column of the #out table. Which also has a BIGINT type and the returned strings were implictly casted to BIGINT

Note

It would be a rare event where you need to handle this problem. The integer value must be bigger than 2^52 so a distance between two float64 will be greater than 1.

In [50]: def float_distance(x):
    ...:     x_float = float(x)
    ...:     x_next_float = np.nextafter(x_float, np.inf)
    ...:     x_float_diff = x_next_float - x_float
    ...:     return(x, x_float, x_next_float, x_float_diff)

In [51]: float_distance(2**52)
Out[51]: (4503599627370496, 4503599627370496.0, 4503599627370497.0, 1.0)

In [52]: float_distance(2**53)
Out[52]: (9007199254740992, 9007199254740992.0, 9007199254740994.0, 2.0)

I guess this may occur if your store the results of high throughput scientific research in physics or in bioinformatics.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 dba.stackexchange