Question

I have a pandas DataFrame, df_test. It contains a column 'size' which represents size in bytes. I've calculated KB, MB, and GB using the following code:

df_test = pd.DataFrame([
    {'dir': '/Users/uname1', 'size': 994933},
    {'dir': '/Users/uname2', 'size': 109338711},
])

df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')

df_test


             dir       size       size_kb   size_mb size_gb
0  /Users/uname1     994933      971.6 KB    0.9 MB  0.0 GB
1  /Users/uname2  109338711  106,776.1 KB  104.3 MB  0.1 GB

[2 rows x 5 columns]

I've run this over 120,000 rows and time it takes about 2.97 seconds per column * 3 = ~9 seconds according to %timeit.

Is there anyway I can make this faster? For example, can I instead of returning one column at a time from apply and running it 3 times, can I return all three columns in one pass to insert back into the original dataframe?

The other questions I've found all want to take multiple values and return a single value. I want to take a single value and return multiple columns.

Was it helpful?

Solution

You can return a Series from the applied function that contains the new data, preventing the need to iterate three times. Passing axis=1 to the apply function applies the function sizes to each row of the dataframe, returning a series to add to a new dataframe. This series, s, contains the new values, as well as the original data.

def sizes(s):
    s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return s

df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)

OTHER TIPS

Use apply and zip will 3 times fast than Series way.

def sizes(s):    
    return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
        locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
        locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))

Test result are:

Separate df.apply(): 

    100 loops, best of 3: 1.43 ms per loop

Return Series: 

    100 loops, best of 3: 2.61 ms per loop

Return tuple:

    1000 loops, best of 3: 819 µs per loop

Some of the current replies work fine, but I want to offer another, maybe more "pandifyed" option. This works for me with the current pandas 0.23 (not sure if it will work in previous versions):

import pandas as pd

df_test = pd.DataFrame([
  {'dir': '/Users/uname1', 'size': 994933},
  {'dir': '/Users/uname2', 'size': 109338711},
])

def sizes(s):
  a = locale.format_string("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
  b = locale.format_string("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
  c = locale.format_string("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
  return a, b, c

df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")

Notice that the trick is on the result_type parameter of apply, that will expand its result into a DataFrame that can be directly assign to new/old columns.

Really cool answers! Thanks Jesse and jaumebonet! Just some observation in regards to:

  • zip(* ...
  • ... result_type="expand")

Although expand is kind of more elegant (pandifyed), **zip is at least 2x faster. On this simple example below, I got 4x faster.

import pandas as pd

dat = [ [i, 10*i] for i in range(1000)]

df = pd.DataFrame(dat, columns = ["a","b"])

def add_and_sub(row):
    add = row["a"] + row["b"]
    sub = row["a"] - row["b"]
    return add, sub

df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))

Just another readable way. This code will add three new columns and its values, returning series without use parameters in the apply function.

def sizes(s):

    val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])

df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)

A general example from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)

#foo  bar
#0    1    2
#1    1    2
#2    1    2

The performance between the top answers is significantly varied, and Jesse & famaral42 have already discussed this, but it is worth sharing a fair comparison between the top answers, and elaborating on a subtle but important detail of Jesse's answer: the argument passed in to the function, also affects performance.

(Python 3.7.4, Pandas 1.0.3)

import pandas as pd
import locale
import timeit


def create_new_df_test():
    df_test = pd.DataFrame([
      {'dir': '/Users/uname1', 'size': 994933},
      {'dir': '/Users/uname2', 'size': 109338711},
    ])
    return df_test


def sizes_pass_series_return_series(series):
    series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return series


def sizes_pass_series_return_tuple(series):
    a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c


def sizes_pass_value_return_tuple(value):
    a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c

Here are the results:

# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Notice how returning tuples is the fastest method, but what is passed in as an argument, also affects the performance. The difference in the code is subtle but the performance improvement is significant.

Test #4 (passing in a single value) is twice as fast as test #3 (passing in a series), even though the operation performed is ostensibly identical.

But there's more...

# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In some cases (#1a and #4a), applying the function to a DataFrame in which the output columns already exist is faster than creating them from the function.

Here is the code for running the tests:

# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('\nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)

print('\nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")

print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))

print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))

A fairly fast way to do this with apply and lambda. Just return the multiple values as a list and then use to_list()

import pandas as pd

dat = [ [i, 10*i] for i in range(100000)]

df = pd.DataFrame(dat, columns = ["a","b"])

def add_and_div(x):
    add = x + 3
    div = x / 3
    return [add, div]

start = time.time()
df[['c','d']] = df['a'].apply(lambda x: add_and_div(x)).to_list()
end = time.time()

print(end-start) # output: 0.27606

Simply and easy:

def func(item_df):
  return [1,'Label 1'] if item_df['col_0'] > 0 else [0,'Label 0']
 
my_df[['col_1','col2']] = my_df.apply(func, axis=1,result_type='expand')

I believe the 1.1 version breaks the behavior suggested in the top answer here.

import pandas as pd
def test_func(row):
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']})
df.apply(test_func, axis=1)

The above code ran on pandas 1.1.0 returns:

   a  b   c  d
0  1  i  1i  2
1  1  i  1i  2
2  1  i  1i  2

While in pandas 1.0.5 it returned:

   a   b    c  d
0  1   i   1i  2
1  2   j   2j  3
2  3   k   3k  4

Which I think is what you'd expect.

Not sure how the release notes explain this behavior, however as explained here avoiding mutation of the original rows by copying them resurrects the old behavior. i.e.:

def test_func(row):
    row = row.copy()   #  <---- Avoid mutating the original reference
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

Generally, to return multiple values, this is what I do

def gimmeMultiple(group):
    x1 = 1
    x2 = 2
    return array([[1, 2]])
def gimmeMultipleDf(group):
    x1 = 1
    x2 = 2
    return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)

Returning a dataframe definitively has its perks, but sometimes not required. You can look at what the apply() returns and play a bit with the functions ;)

You can go 40+ times faster than the top answers here if you do your math in numpy instead. Adapting @Rocky K's top two answers. The main difference is running on an actual df of 120k rows. Numpy is way faster at math when you apply your functions array-wise (instead of applying a function value-wise). The best answer is by far the third one because it uses numpy for the math. Also notice that it only calculates 1024**2 and 1024**3 once each instead of once for each row, saving 240k calculations. Here are the timings on my machine:

Tuples (pass value, return tuple then zip, new columns dont exist):
Runtime: 10.935037851333618 

Tuples (pass value, return tuple then zip, new columns exist):
Runtime: 11.120025157928467 

Use numpy for math portions:
Runtime: 0.24799370765686035

Here is the script I used (adapted from Rocky K) to calculate these times:

import numpy as np
import pandas as pd
import locale
import time

size = np.random.random(120000) * 1000000000
data = pd.DataFrame({'Size': size})

def sizes_pass_value_return_tuple(value):
    a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c

print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df1 = data.copy()
start = time.time()
df1['size_kb'],  df1['size_mb'], df1['size_gb'] = zip(*df1['Size'].apply(sizes_pass_value_return_tuple))
end = time.time()
print('Runtime:', end - start, '\n')

print('Tuples (pass value, return tuple then zip, new columns exist):')
df2 = data.copy()
start = time.time()
df2 = pd.concat([df2, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
df2['size_kb'],  df2['size_mb'], df2['size_gb'] = zip(*df2['Size'].apply(sizes_pass_value_return_tuple))
end = time.time()
print('Runtime:', end - start, '\n')

print('Use numpy for math portions:')
df3 = data.copy()
start = time.time()
df3['size_kb'] = (df3.Size.values / 1024).round(1)
df3['size_kb'] = df3.size_kb.astype(str) + ' KB'
df3['size_mb'] = (df3.Size.values / 1024 ** 2).round(1)
df3['size_mb'] = df3.size_mb.astype(str) + ' MB'
df3['size_gb'] = (df3.Size.values / 1024 ** 3).round(1)
df3['size_gb'] = df3.size_gb.astype(str) + ' GB'
end = time.time()
print('Runtime:', end - start, '\n')

It gives a new dataframe with two columns from the original one.

import pandas as pd
df = ...
df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)

I wanted to use apply on groupby. I tried to use what you suggested here. It definitely helped me on the way but not all the way.

Adding result_type='expand' did not work (as I use apply on Series and not DataFrame?) and with zip(*___) I lose the index.

If anyone else comes here with the same issue, here is how I (finally) solved it:

dfg = df.groupby(by=['Column1','Column2']).Column3.apply(myfunc)
dfres = pd.DataFrame()
dfres['a'], dfres['b'], dfres['c'] = (dfg.apply(lambda x: x[0]), dfg.apply(lambda x: x[1]), dfg.apply(lambda x: x[2])) 

Or if you know a better way. Tell me.

And do let me know if this is too much out of scope for this discussion.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top