truncation issue with pd.read_csv

https://stackoverflow.com/questions/23534775

17-07-2023
|

Domanda

I would like to seek guidance on remedial steps for an issue i noticed in pandas.read_csv routine. When i store a long integer into a file using pd.to_csv, it stores the data fine - but when i read it back using pd.read_csv, it messes with the last 3 digits. When i try to save it back again using to_csv (without any edits), the numbers in resulting CSV file is different from the original CSV file. I've illustrated the problem below (notice how 4321113141090630389 becomes 4321113141090630400 and 4321583677327450765 becomes 4321583677327450880):

original CSV file created by pd.to_csv:

grep -e 321583677327450 -e 321113141090630 orig.piece 
orig.piece:1,1;0;0;0;1;1;3844;3844;3844;1;1;1;1;1;1;0;0;1;1;0;0,,,4321583677327450765
orig.piece:5,1;0;0;0;1;1;843;843;843;1;1;1;1;1;1;0;0;1;1;0;0,64.0,;,4321113141090630389

import pandas as pd
import numpy as np

orig = pd.read_csv('orig.piece')
orig.dtypes
Unnamed: 0 int64
aa object
act float64
...
...
s_act float64
dtype: object

>orig['s_act'].head(6)
0 NaN
1 4.321584e+18
2 4.321974e+18
3 4.321494e+18
4 4.321283e+18
5 4.321113e+18
Name: s_act, dtype: float64

>orig['s_act'].fillna(0).astype(int).head(6)
0 0
1 4321583677327450880
2 4321973950881710336
3 4321493786516159488
4 4321282586859217408
5 4321113141090630400

>orig.to_csv('convert.piece')

grep -e 321583677327450 -e 321113141090630 orig.piece convert.piece
orig.piece:1,1;0;0;0;1;1;3844;3844;3844;1;1;1;1;1;1;0;0;1;1;0;0,,,4321583677327450765
orig.piece:5,1;0;0;0;1;1;843;843;843;1;1;1;1;1;1;0;0;1;1;0;0,64.0,;,4321113141090630389
convert.piece:1,1;0;0;0;1;1;3844;3844;3844;1;1;1;1;1;1;0;0;1;1;0;0,,,4.321583677327451e+18
convert.piece:5,1;0;0;0;1;1;843;843;843;1;1;1;1;1;1;0;0;1;1;0;0,64.0,;,4.3211131410906304e+18

could you please help me understand why read_csv jumbles the last three digits? Its not even a rounding issue, the digits are totally different (like 4321583677327450765 becomes 4321583677327450880 above) Is it because of the scientific notation that comes in the way - how can we disable it and let pandas treat this data as jus object/string or plan integer/float?

Soluzione

It's floating point error. Because the s_act column has a missing value (pandas doesn't have integer missing values), it reads in s_act with dtype=float (dtypes are defined at the column level in pandas). So you're basically see the following:

>>> x = 4321113141090630389
>>> float(x)
4.32111314109063e+18
>>> int(float(x))
4321113141090630144

In terms of a solution you could change the dtype of s_act to a string when you read it in (the resulting dtype will be oject). For, example:

data = """
id,val,x
1,4321113141090630389,4
2,,5
3,200,4
"""

df = pd.read_csv(StringIO(data),header=True,dtype={'val':str})
print df

   id                  val  x
0   1  4321113141090630389  4
1   2                  NaN  5
2   3                  200  4

print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 3 columns):
id     3 non-null int64
val    2 non-null object
x      3 non-null int64

df['val'] = df['val'].fillna(0).astype(int)
print df

   id                  val  x
0   1  4321113141090630389  4
1   2                    0  5
2   3                  200  4

Altri suggerimenti

It's a problem with excel reading large numbers. One option is to change the format of the numbers by adding spaces. In this case I'm adding a space between every 5 numbers.

def spaces_in_string(val):    
    try:
        return (' ').join(re.findall('.{1,5}',val))
    except:
        return val

df['col'] = df['col'].apply(spaces_in_string)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow