Question

I have network traffic data in the following for for each hour of a ten day period as follows in a R dataset.

   Day   Hour         Volume          Category
    0    00            100            P2P
    0    00            50             email
    0    00            200            gaming
    0    00            200            video
    0    00            150            web
    0    00            120            P2P
    0    00            180            web
    0    00            80             email
    ....
    0    01            150            P2P
    0    01            200            P2P
    0    01             50            Web
    ...
    ...
    10   23            100            web
    10   23            200            email
    10   23            300            gaming
    10   23            300            gaming

As seen there are repetition of Category within a single hour also. I need to calculate the volatility and the peak hour to average hour ratios of these different application categories.

Volatility: Standard deviation of hourly volumes divided by hourly average.

Peak hour to avg. hour ratio: Ratio of volume of the maximum hour to the vol. of the average hour for that application.

So how do I aggregate and calculate these two statistics for each category? I am new to R and don't have much knowledge of how to aggregate and get the averages as mentioned.

So, the final result would look something like this where first the volume for each category is aggregated on a single 24 hour period by summing the volume and then calculating the two statistics

Category    Volatility      Peak to Avg. Ratio
Web            0.55            1.5
P2P            0.30            2.1
email          0.6             1.7
gaming         0.4             2.9

Edit: plyr got me as far as this.

stats = ddply(
    .data = my_data
    , .variables = .( Hour , Category)
    , .fun = function(x){
        to_return = data.frame(
            volatility = sd((x$Volume)/mean(x$Volume))
            , pa_ratio = max(x$Volume)/mean(x$Volume)
        )
        return( to_return )
    }
)

But this is not what I was hoping for. I want the statistics per Category where all the hours of the days are aggregated first into 24 hours by summing the volumes and then the volatility and PA ratio calculated. Any suggestions for improvement?

Was it helpful?

Solution

You'd need to do it in two stages (using the plyr package): First, as you pointed out, there can be multiple Day-Hour combos for the same category, so we first aggregate, for each category, its totals within each Hour, regardless of the day:

df1 <- ddply( df, .(Hour, Category), summarise, Volume = sum(Volume))

Then you get your stats:

> ddply(df1, .(Category), summarise,
+            Volatility = sd(Volume)/mean(Volume),
+            PeakToAvg = max(Volume)/mean(Volume) )

  Category Volatility PeakToAvg
1      P2P  0.3225399  1.228070
2      Web         NA  1.000000
3    email  0.2999847  1.212121
4   gaming  0.7071068  1.500000
5    video         NA  1.000000
6      web  0.7564398  1.534884
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top