How to calculate most frequent value combinations

https://datascience.stackexchange.com/questions/6492

16-10-2019
|

Question

I have the following CSV data:

shot_id,round_id,hole,shotType,clubType,desiredShape,lineDirection,shotQuality,note
48,2,1,tee,driver,straight,straight,good,
49,2,1,approach,iron,straight,right,bad,
50,2,1,approach,wedge,straight,straight,bad,
51,2,1,approach,wedge,straight,straight,bad,
52,2,1,putt,putter,straight,straight,good,
53,2,1,putt,putter,straight,straight,good,
54,2,2,tee,driver,draw,straight,good,
55,2,2,approach,iron,draw,straight,good,
56,2,2,putt,putter,straight,straight,good,
57,2,2,putt,putter,straight,straight,good,
58,2,3,tee,driver,draw,straight,good,
59,2,3,approach,iron,straight,right,good,
60,2,3,chip,wedge,straight,straight,good,
61,2,3,putt,putter,straight,straight,good,
62,2,4,tee,iron,straight,straight,good,
63,2,4,putt,putter,straight,straight,good,
64,2,4,putt,putter,straight,straight,good,
65,2,5,tee,driver,straight,left,good,
66,2,5,approach,wedge,straight,straight,good,
67,2,5,putt,putter,straight,straight,bad,
68,2,5,putt,putter,straight,straight,good,
69,2,6,tee,driver,draw,straight,bad,
70,2,6,approach,hybrid,draw,straight,good,
71,2,6,putt,putter,straight,straight,good,
72,2,6,putt,putter,straight,straight,good,
73,2,7,tee,driver,straight,straight,good,
74,2,7,approach,wood,fade,straight,good,
75,2,7,approach,wedge,straight,straight,bad,long
76,2,7,putt,putter,straight,straight,good,
77,2,7,putt,putter,straight,straight,good,
78,2,8,tee,iron,straight,right,bad,
79,2,8,approach,wedge,straight,straight,good,
80,2,8,putt,putter,straight,straight,bad,
81,2,9,tee,driver,straight,straight,good,
82,2,9,approach,iron,straight,straight,good,
83,2,9,approach,wedge,straight,straight,bad,
84,2,9,putt,putter,straight,straight,good,
85,2,9,putt,putter,straight,straight,good,
86,2,10,tee,driver,straight,left,good,
87,2,10,approach,iron,straight,left,good,
88,2,10,chip,wedge,straight,straight,good,
89,2,10,putt,putter,straight,straight,good,
90,2,10,putt,putter,straight,straight,good,
91,2,11,tee,driver,draw,straight,good,
92,2,11,approach,iron,draw,straight,good,
93,2,11,putt,putter,straight,straight,good,
94,2,11,putt,putter,straight,straight,good,
95,2,12,tee,iron,draw,straight,good,
96,2,12,putt,putter,straight,straight,good,
97,2,12,putt,putter,straight,straight,good,
98,2,13,tee,driver,draw,straight,good,
99,2,13,approach,wood,straight,straight,bad,topped
100,2,13,putt,putter,straight,straight,good,
101,2,13,putt,putter,straight,straight,good,
102,2,14,tee,driver,draw,straight,good,
103,2,14,approach,wood,straight,straight,bad,
104,2,14,approach,iron,draw,straight,good,
105,2,14,approach,wedge,straight,straight,bad,
106,2,14,putt,putter,straight,straight,bad,
107,2,14,putt,putter,straight,straight,good,
108,2,15,tee,iron,draw,right,bad,
109,2,15,approach,wedge,straight,straight,good,
110,2,15,putt,putter,straight,straight,good,
111,2,15,putt,putter,straight,straight,good,
112,2,16,tee,driver,draw,right,good,
113,2,16,approach,iron,straight,left,bad,
114,2,16,approach,wedge,straight,left,bad,
115,2,16,putt,putter,straight,straight,good,
116,2,17,tee,driver,straight,straight,good,
117,2,17,approach,wood,straight,right,bad,
118,2,17,approach,wedge,straight,straight,good,
119,2,17,putt,putter,straight,straight,good,
120,2,17,putt,putter,straight,straight,good,
121,2,18,tee,driver,fade,right,bad,
122,2,18,approach,wedge,straight,straight,good,
123,2,18,approach,wedge,straight,straight,good,
124,2,18,putt,putter,straight,straight,good,
125,2,18,putt,putter,straight,straight,good,

And I would like to be able to identify which combinations of values are the most frequently occurring.

club types: driver, wood, iron, wedge, putter
Shot types: tee, approach, chip, putt
line directions: left, center, right
shot qualities: good, bad, neutral

Where ideally I'd be able to identify a sweet spot (no pun intended) combination: "driver" + "tee" + "straight" + "good"

I intend only to measure this for a static dataset, not for any future values or prediction. So, my thought is that this is probably a clustering / k-means problem. Is that correct?

If so, how would I begin doing a K-Mean analysis with these types of values in R?

If it isn't a kmeans problem, then what is it?

Solution

If I understand your question you want to know which combination is most frequent or how frequent a combination is relative to others. This is a static method that will determine the unique combinations in total (i.e., combinations of all five columns).

The plyr package has a nifty utility for grouping unique combinations of columns in a data.frame. We can specify the names of the columns we want to group by, and then specify a function to perform for each of those combinations. In this case, we specify the columns associated with your golf shot qualities and use the function nrow which will count the number of rows in every subset of the large data.frame for which the columns are the identical.

# You need this library for the ddply() function
require(plyr)

# These are the columns that determine a unique situation (change this if you need)
qualities <- c("shotType","clubType","desiredShape","lineDirection","shotQuality")

# The call to ddply() actually gives us what we want, which is the number 
# of times that combination is present in the dataset
countedCombos <- ddply(golf,qualities,nrow)

# To be nice, let's give that newly added column a meaningful name
names(countedCombos) <- c(qualities,"count")

# Finally, you probably want to order it (decreasing, in this case)
countedCombos <- countedCombos[with(countedCombos, order(-count)),]

Now check out your product. The final column has the count associated with each unique combination of columns you provided to ddply:

head(countedCombos)
   shotType clubType desiredShape lineDirection shotQuality count
16     putt   putter     straight      straight        good    30
10 approach    wedge     straight      straight        good     6
9  approach    wedge     straight      straight         bad     5
19      tee   driver         draw      straight        good     5
22      tee   driver     straight      straight        good     4
2  approach     iron         draw      straight        good     3

To see the results for a particular cross-section (say, for example, the driver clubType):

countedCombos[which(countedCombos$clubType=="driver"),]
   shotType clubType desiredShape lineDirection shotQuality count
19      tee   driver         draw      straight        good     5
22      tee   driver     straight      straight        good     4
21      tee   driver     straight          left        good     2
17      tee   driver         draw         right        good     1
18      tee   driver         draw      straight         bad     1
20      tee   driver         fade         right         bad     1

As a bonus, you can dig into these results with ddply again. For example, if you wanted to look at the ratio of "good" to "bad" shotQuality based on shotType and clubType:

shotPerformance <- ddply(countedCombos,c("shotType","clubType"),
    function(x){
        total<- length(x$shotQuality)
            good <- length(which(x$shotQuality=="good"))
        bad <- length(which(x$shotQuality=="bad"))
        c(total,good,bad,good/(good+bad))
    }
 )
names(shotPerformance)<-c("count","shotType","clubType","good","bad","goodPct")

This gives you a new breakdown of some math performed on the counts of a character field (shotQuality) and shows you how you can build custom functions for ddply. Of course, you can still order these whichever way you want, too.

head(shotPerformance)
  shotType clubType total good bad   goodPct
1 approach   hybrid    1  1   0 1.0000000
2 approach     iron    6  4   2 0.6666667
3 approach    wedge    3  1   2 0.3333333
4 approach     wood    3  1   2 0.3333333
5     chip    wedge    1  1   0 1.0000000
6     putt   putter    2  1   1 0.5000000

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange