Why MS SQL bigint type is implicitly mapped to float64 python type, and what is the best way to handle it? [closed]
-
16-03-2021 - |
문제
Python integer type has unlimited precision so it is more than capable to hold a bigint value of MS SQL (64 bit). Still it is implicitly mapped to float64 python type, when passed to an external script.
This can cause serious calculation errors for large integers.
So why is it mapped to float64?
My guess is:
R was added before Python via the Extensibility architecture and it has fixed precision integers (32 bit). So it can't hold bigints. So perhaps this is a compatibility issue.
What is the best practice to ensure precise calculations?
Simple but working idea: pass bigints as string then parse them as int.
I know it has a slim chance to cause problem in practice, but still good to know.
How can it be a problem:
I wrote a simple example to demonstrate how can it be a problem:
CREATE TABLE #test (
big_integer BIGINT
);
INSERT INTO #test
(big_integer)
VALUES
(36028797018963968),
(36028797018963968 + 1);
EXECUTE sp_execute_external_script
@language = N'Python',
@input_data_1 = N'SELECT big_integer FROM #test',
@script = N'
print(InputDataSet.dtypes)
OutputDataSet = InputDataSet
'
Executing this code on SQL Server 2019 will give you the result of:
| | (No column name) |
|---------------------|
|1| 36028797018963970 |
|2| 36028797018963970 |
and because of the print(InputDataSet.dtypes)
statement we can see the following message:
...
STDOUT message(s) from external script:
big_integer float64
dtype: object
...
So we got a floating point rounding error. The value of this error for big enough integers is greater than 1, which is the root of this problem.
It is out of the scope of this question to teach floating point arithmetics, but I link some good materials if you don't understand what did happen:
Simple example - Stack Overflow.
Floating Point Numbers - Computerphile
I also share a small ipython sample if you want to experiment with this (which isn't a substitute of learning the theory behind this):
In [16]: import numpy as np
In [17]: a = 2**55
In [18]: a
Out[18]: 36028797018963968
In [19]: float(a) == float(a + 1)
Out[19]: True
In [20]: float(a)
Out[20]: 3.602879701896397e+16
In [21]: float(a + 1)
Out[21]: 3.602879701896397e+16
In [22]: np.nextafter(float(a), np.inf)
Out[22]: 3.6028797018963976e+16
Note
To run my example T-SQL some conditions must be met:
해결책
I write down the best solution what I found:
CREATE TABLE #test (
big_integer BIGINT
);
INSERT INTO #test
(big_integer)
VALUES
(36028797018963968),
(36028797018963968 + 1);
CREATE TABLE #out (
big_integer BIGINT
);
INSERT INTO #out
EXECUTE sp_execute_external_script
@language = N'Python',
@input_data_1 = N'SELECT CAST(big_integer AS VARCHAR(20)) AS big_integer FROM #test',
@script = N'
import numpy as np
print(InputDataSet)
InputDataSet["big_integer"] = InputDataSet["big_integer"].astype(np.int64)
InputDataSet["big_integer"] = InputDataSet["big_integer"] + 1
InputDataSet["big_integer"] = InputDataSet["big_integer"].astype(str)
OutputDataSet = InputDataSet
';
SELECT big_integer FROM #out;
I did what I supposed in the question:
- cast the
big_integer
column toVARCHAR(20)
the maximum length of a string representation of 64 bit signed integer:
In [34]: len(str(-2**63))
Out[34]: 20
In [35]: len(str(2**63-1))
Out[35]: 19
cast it back to numpy.int64 type in the external script.
made a simple calculation: incremented all values in the column
cast it back to string still in python. This step also necessary, because the implicit type conversion works both ways.
Inserted the values into the
big_integer
column of the#out
table. Which also has aBIGINT
type and the returned strings were implictly casted toBIGINT
Note
It would be a rare event where you need to handle this problem. The integer value must be bigger than 2^52 so a distance between two float64 will be greater than 1.
In [50]: def float_distance(x):
...: x_float = float(x)
...: x_next_float = np.nextafter(x_float, np.inf)
...: x_float_diff = x_next_float - x_float
...: return(x, x_float, x_next_float, x_float_diff)
In [51]: float_distance(2**52)
Out[51]: (4503599627370496, 4503599627370496.0, 4503599627370497.0, 1.0)
In [52]: float_distance(2**53)
Out[52]: (9007199254740992, 9007199254740992.0, 9007199254740994.0, 2.0)
I guess this may occur if your store the results of high throughput scientific research in physics or in bioinformatics.