Question

When plotting columns of a dataframe with pandas, e.g.

  df.boxplot()

the automatic adjustment of the yaxis can lead to a large amount of unused space in the plot. I wonder if this is because the dataframe has points that exceed the boxplot whiskers (but for some reason the outliers aren't displayed). If that is the case, what would be a good way to automatically adjust ylim so that there isn't so much empty space in the plot?

enter image description here

Was it helpful?

Solution

I think a combination of the seaborn style and the way matplotlib draws boxplots is hiding your outliers here.

If I generate some skewed data

import seaborn as sns
import pandas as pd
import numpy as np

x = pd.DataFrame(np.random.lognormal(size=(100, 6)),
             columns=list("abcdef"))

And then use the boxplot method on the dataframe, I see something similar

x.boxplot()

enter image description here

But if you change the symbol used to plot outliers, you get

x.boxplot(sym="k.")

enter image description here

Alternatively, you can use the seaborn boxplot function, which does the same thing but with some nice aesthetics:

sns.boxplot(x)

enter image description here

OTHER TIPS

Building on eumiro's answer in this SO post (I just extend it to pandas data frames you could do the following

import numpy as np
import pandas as pd

def reject_outliers(df, col_name, m=2):
    """ Returns data frame without outliers in the col_name column """
    return df[np.abs(df[col_name] - df[col_name].mean()) < m * df[col_name].std()]

# Create fake data
N = 10
df = pd.DataFrame(dict(a=np.random.rand(N), b=np.random.rand(N)))
df = df.append(dict(a=0.1, b=10), ignore_index=True)

# Strip outliers from the "b" column
df = reject_outliers(df, "b")
bp = df.boxplot()

The argument m is the number of standard deviations to ignore.

EDIT:

Why do the whiskers not include the maximum outliers in the first place?

There are several types of Boxplots as described on Wikipedia. The pandas boxplot calls to matplotlib's boxplot. If you take a look at the documentation for this the argument whis"Defines the length of the whiskers as a function of the inner quartile range. So it won't cover the entire range by design.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top