Database to dataframes and getting information on filling of columns

https://stackoverflow.com/questions/23171574

06-07-2023
|

Question

I'm trying to get some metadata from my pandas dataframe: I want to know how many rows have data in all the tables of database. The code below gives me:

PandasError: DataFrame constructor not properly called!

But I don't know why. It seems to bork a table that has no data at all, but I don't see why that should be problem...

engine = sqlalchemy.create_engine("mysql+mysqldb://root:123@127.0.0.1/%s" % db)
meta = sqlalchemy.MetaData()
meta.reflect(engine)
tables = meta.tables.keys() # Fetches all table names
cnx = engine.raw_connection() # Raw connection is needed.

df = pd.read_sql('SELECT * FROM offending_table', cnx )
df = df.applymap(lambda x: np.nan if x == "" else x) # maak van alle "" een NaN

count = df.count()

table = pd.DataFrame(count, columns=['CellsWithData'])
table

The complete error message is:

offending_table
---------------------------------------------------------------------------
PandasError                               Traceback (most recent call last)
<ipython-input-367-f33bb79a6773> in <module>()
     14     count = df.count()
     15 
---> 16     table = pd.DataFrame(count, columns=['CellsWithData'])
     17     if len(all_tables) == 0:
     18         all_tables = table

/Library/Python/2.7/site-packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    271                                          copy=False)
    272             else:
--> 273                 raise PandasError('DataFrame constructor not properly called!')
    274 
    275         NDFrame.__init__(self, mgr, fastpath=True)

PandasError: DataFrame constructor not properly called!

The table that gives this message contains a few columns, none of them have data in them. The df that gets created looks like:

name           NaN
principal_id   NaN
diagram_id     NaN
version        NaN
definition     NaN

And when I do:

df.count()

I get:

Is that the expected behaviour?

La solution

It appears that the applymap is the culprit here :-)

When you have an empty result set of the read_sql query, you wil get an empty dataframe. Eg:

In [2]: df = pd.DataFrame(columns=list('ABC'))

In [3]: df
Out[3]:
Empty DataFrame
Columns: [A, B, C]
Index: []

Using this empty dataframe, when you then call the applymap on this, it is apparantly converted to a Series, and then the count just gives a number:

In [10]: df2 = df.applymap(lambda x: np.nan if x == "" else x)

In [11]: df2
Out[11]:
A   NaN
B   NaN
C   NaN
dtype: float64

In [12]: df2.count()
Out[12]: 0

while doing the count directly on the empty dataframe gives the desired output:

In [13]: df.count()
Out[13]:
A    0
B    0
C    0
dtype: int64

I don't know exactly why the applymap does this (or if it is a bug), but a simple solution for now would be to just do a quick if before the applymap:

if not len(df):
   df = df.applymap(lambda x: np.nan if x == "" else x)

The reason that the above is a problem, is that the DataFrame constructor does not accept a scalar as input data.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow