Question

I have been struggling with how to make a Pareto Chart in R using the ggplot2 package. In many cases when making a bar chart or histogram we want items sorted by the X axis. In a Pareto Chart we want the items ordered descending by the value in the Y axis. Is there a way to get ggplot to plot items ordered by the value in the Y axis? I tried sorting the data frame first but it seems ggplot reorders them.

Example:

val <- read.csv("http://www.cerebralmastication.com/wp-content/uploads/2009/11/val.txt")
val<-with(val, val[order(-Value), ])
p <- ggplot(val)
p + geom_bar(aes(State, Value, fill=variable), stat = "identity", position="dodge") + scale_fill_brewer(palette = "Set1")

the data frame val is sorted but the output looks like this:

alt text
(source: cerebralmastication.com)

Hadley correctly pointed out that this produces a much better graphic for showing actuals vs. predicted:

ggplot(val, aes(State, Value)) + geom_bar(stat = "identity", subset = .(variable == "estimate"), fill = "grey70") + geom_crossbar(aes(ymin = Value, ymax = Value), subset = .(variable == "actual"))

which returns:

alt text
(source: cerebralmastication.com)

But it's still not a Pareto Chart. Any tips?

Was it helpful?

Solution

The bars in ggplot2 are ordered by the ordering of the levels in the factor.

val$State <- with(val, factor(val$State, levels=val[order(-Value), ]$State))

OTHER TIPS

Subsetting and sorting your data;

valact <- subset(val, variable=='actual')
valsort <- valact[ order(-valact[,"Value"]),]

From there it's just a standard boxplot() with a very manual cumulative function on top:

op <- par(mar=c(3,3,3,3)) 
bp <- barplot(valsort [ , "Value"], ylab="", xlab="", ylim=c(0,1),    
              names.arg=as.character(valsort[,"State"]), main="How's that?") 
lines(bp, cumsum(valsort[,"Value"])/sum(valsort[,"Value"]), 
      ylim=c(0,1.05), col='red') 
axis(4)
box() 
par(op)

which should look like this

alt text
(source: eddelbuettel.com)

and it doesn't even need the overplotting trick as lines() happily annotates the initial plot.

A traditional Pareto chart in ggplot2.......

Developed after reading Cano, E. L., Moguerza, J. M., & Redchuk, A. (2012). Six Sigma with R. (G. Robert, K. Hornik, & G. Parmigiani, Eds.) Springer.

library(ggplot2);library(grid)

counts  <- c(80, 27, 66, 94, 33)
defects <- c("price code", "schedule date", "supplier code", "contact num.", "part num.")
dat <- data.frame(count = counts, defect = defects, stringsAsFactors=FALSE )
dat <- dat[order(dat$count, decreasing=TRUE),]
dat$defect <- factor(dat$defect, levels=dat$defect)
dat$cum <- cumsum(dat$count)
count.sum<-sum(dat$count)
dat$cum_perc<-100*dat$cum/count.sum

p1<-ggplot(dat, aes(x=defect, y=cum_perc, group=1))
p1<-p1 + geom_point(aes(colour=defect), size=4) + geom_path()

p1<-p1+ ggtitle('Pareto Chart')+ theme(axis.ticks.x = element_blank(), axis.title.x = element_blank(),axis.text.x = element_blank())
p1<-p1+theme(legend.position="none")

p2<-ggplot(dat, aes(x=defect, y=count,colour=defect, fill=defect))
p2<- p2 + geom_bar()

p2<-p2+theme(legend.position="none")

plot.new()
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 1)))
print(p1, vp = viewport(layout.pos.row = 1,layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 2,layout.pos.col = 1))

With a simple example:

 > data
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10 
0.29056 0.23833 0.11003 0.05549 0.04678 0.03788 0.02770 0.02323 0.02211 0.01925 

barplot(data) does things correctly

the ggplot equivalent "should be": qplot(x=names(data), y=data, geom='bar')

But that incorrectly reorders/sorts the bars alphabetically... because that's how levels(factor(names(data))) would be ordered.

Solution: qplot(x=factor(names(data), levels=names(data)), y=data, geom='bar')

Phew!

Also, see the package qcc which has a function pareto.chart(). Looks like it uses base graphics too, so start your bounty for a ggplot2-solution :-)

To simplify things, let's just consider only the estimates.

estimates <- subset(val, variable == "estimate")

First we reorder the factor levels, so that States are plotted in decreasing order of Value.

estimates$State <- with(estimates, reorder(State, -Value))

Similarly, we reorder the dataset and calculate a cumulative value.

estimates <- estimates[order(estimates$Value, decreasing = TRUE),]
estimates$cumulative <- cumsum(estimates$Value)

Now we are ready to draw the plot. The trick to get a line and bar on the same axes is to convert the State variable (a factor) to be numeric.

p <- ggplot(estimates, aes(State, Value)) + 
  geom_bar() +
  geom_line(aes(as.numeric(State), cumulative))
p

As mentioned in the question, trying to draw two Pareto plots of two variable groups right next to each other isn't very easy. You'd probably be better off using facetting if you want multiple Pareto plots.

freqplot = function(x, by = NULL, right = FALSE)
{
if(is.null(by)) stop('Valor de "by" precisa ser especificado.')
breaks = seq(min(x), max(x), by = by )
ecd = ecdf(x)
den = ecd(breaks)
table = table(cut(x, breaks = breaks, right = right))
table = table/sum(table)

intervs = factor(names(table), levels = names(table))
freq = as.numeric(table/sum(table))
acum = as.numeric(cumsum(table))

normalize.vec = function(x){
  (x - min(x))/(max(x) - min(x))
}

dados = data.frame(classe = intervs, freq = freq, acum = acum, acum_norm = normalize.vec(acum))
p = ggplot(dados) + 
  geom_bar(aes(classe, freq, fill = classe), stat = 'identity') +
  geom_point(aes(classe, acum_norm, group = '1'), shape = I(1), size = I(3), colour = 'gray20') +
  geom_line(aes(classe, acum_norm, group = '1'), colour = I('gray20'))

p
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top