Question

I have a dataset of SQL queries, containing start and finish timestamps as well as the specific query that was run. I would like to use this information to visualize the number of concurrent queries executing at any given time.

That is to say with a dataset like such:

StartTime,StopTime,SQLStatement
10/11/2012 08:17:18.060,10/11/2012 08:17:19.221,DELETE FROM foo WHERE bar = ?;
10/11/2012 08:17:19.036,10/11/2012 08:17:19.911,SELECT * FROM users WHERE user_id = ?;
10/11/2012 08:17:19.661,10/11/2012 08:17:19.810,SELECT * FROM users WHERE user_id = ?;
...

I would like to produce plot where the horizontal axis represents time, and the vertical axis represents the instantaneous number of queries running. In the above example, there is one query running at 08:17:19.000 and two queries running at 08:17.19.100.

My naive approach is to iterate over the ~1 million discrete time values, counting the rows that have StartTime and StopTime spanning each value. This is slow and won't scale particularly well--and it seems like R would have a better way to do it!

A 15-minutes sample of this data contains ~170,000 rows with millisecond accuracy--methods that could reasonably scale to a day's worth of data would be appreciated, but any solution is welcome!

Was it helpful?

Solution

My simplistic approach

# Millisecond precision for deltas
options(digits.secs=3)

# Load file
log.file <- read.csv("data/raw.csv", header=T)

log.file$StartTime <- as.POSIXct(log.file$StartTime ,
                                 format="%Y-%m-%d %H:%M:%OS")
log.file$StopTime <- as.POSIXct(log.file$StopTime ,
                                format="%Y-%m-%d %H:%M:%OS")

range <- seq(min(log.file$StartTime), max(log.file$StopTime), by=1)

getsum <- function(tstamp) {
  ret <- nrow(subset(log.file, StartTime<=tstamp & StopTime>=tstamp))
  return(ret)
}

dset <- frame()
dset$TIME <- range
dset$COUNT <- lapply(range, getsum)

plot(dset$TIME, dset$COUNT)

A better solution

While this doesn't exactly meet the criteria I laid out earlier (i.e. doesn't fill in time gaps), it does give me a graph close to what I'm looking for. However, if there is a solution that hits the two points I asked for earlier, I'm still interested.

# Millisecond precision for deltas
options(digits.secs=3)

# Load file
log.file <- read.csv("data/raw.csv", header=T)

log.file$StartTime <- as.POSIXct(log.file$StartTime ,
                                 format="%Y-%m-%d %H:%M:%OS")
log.file$StopTime <- as.POSIXct(log.file$StopTime ,
                                format="%Y-%m-%d %H:%M:%OS")

# Create data frames for query start times and query stop times
# and give them +1 and -1 values
queries.start <- data.frame(Time=log.file$StartTime, Value=1)
queries.stop <- data.frame(Time=log.file$StopTime, Value=-1)

# Merge the two together and order by time
queries.both <- rbind(queries.start, queries.stop)
queries.both <- queries.both[with(queries.both, order(Time)), ]

# Create a cumulative sum data frame, and plot the results
queries.sum <- data.frame(Time=queries.both$Time, Queries=cumsum(queries.both$Value))
plot(queries.sum, type="l")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top