How to calculate most frequent value combinations
-
16-10-2019 - |
Question
I have the following CSV data:
shot_id,round_id,hole,shotType,clubType,desiredShape,lineDirection,shotQuality,note
48,2,1,tee,driver,straight,straight,good,
49,2,1,approach,iron,straight,right,bad,
50,2,1,approach,wedge,straight,straight,bad,
51,2,1,approach,wedge,straight,straight,bad,
52,2,1,putt,putter,straight,straight,good,
53,2,1,putt,putter,straight,straight,good,
54,2,2,tee,driver,draw,straight,good,
55,2,2,approach,iron,draw,straight,good,
56,2,2,putt,putter,straight,straight,good,
57,2,2,putt,putter,straight,straight,good,
58,2,3,tee,driver,draw,straight,good,
59,2,3,approach,iron,straight,right,good,
60,2,3,chip,wedge,straight,straight,good,
61,2,3,putt,putter,straight,straight,good,
62,2,4,tee,iron,straight,straight,good,
63,2,4,putt,putter,straight,straight,good,
64,2,4,putt,putter,straight,straight,good,
65,2,5,tee,driver,straight,left,good,
66,2,5,approach,wedge,straight,straight,good,
67,2,5,putt,putter,straight,straight,bad,
68,2,5,putt,putter,straight,straight,good,
69,2,6,tee,driver,draw,straight,bad,
70,2,6,approach,hybrid,draw,straight,good,
71,2,6,putt,putter,straight,straight,good,
72,2,6,putt,putter,straight,straight,good,
73,2,7,tee,driver,straight,straight,good,
74,2,7,approach,wood,fade,straight,good,
75,2,7,approach,wedge,straight,straight,bad,long
76,2,7,putt,putter,straight,straight,good,
77,2,7,putt,putter,straight,straight,good,
78,2,8,tee,iron,straight,right,bad,
79,2,8,approach,wedge,straight,straight,good,
80,2,8,putt,putter,straight,straight,bad,
81,2,9,tee,driver,straight,straight,good,
82,2,9,approach,iron,straight,straight,good,
83,2,9,approach,wedge,straight,straight,bad,
84,2,9,putt,putter,straight,straight,good,
85,2,9,putt,putter,straight,straight,good,
86,2,10,tee,driver,straight,left,good,
87,2,10,approach,iron,straight,left,good,
88,2,10,chip,wedge,straight,straight,good,
89,2,10,putt,putter,straight,straight,good,
90,2,10,putt,putter,straight,straight,good,
91,2,11,tee,driver,draw,straight,good,
92,2,11,approach,iron,draw,straight,good,
93,2,11,putt,putter,straight,straight,good,
94,2,11,putt,putter,straight,straight,good,
95,2,12,tee,iron,draw,straight,good,
96,2,12,putt,putter,straight,straight,good,
97,2,12,putt,putter,straight,straight,good,
98,2,13,tee,driver,draw,straight,good,
99,2,13,approach,wood,straight,straight,bad,topped
100,2,13,putt,putter,straight,straight,good,
101,2,13,putt,putter,straight,straight,good,
102,2,14,tee,driver,draw,straight,good,
103,2,14,approach,wood,straight,straight,bad,
104,2,14,approach,iron,draw,straight,good,
105,2,14,approach,wedge,straight,straight,bad,
106,2,14,putt,putter,straight,straight,bad,
107,2,14,putt,putter,straight,straight,good,
108,2,15,tee,iron,draw,right,bad,
109,2,15,approach,wedge,straight,straight,good,
110,2,15,putt,putter,straight,straight,good,
111,2,15,putt,putter,straight,straight,good,
112,2,16,tee,driver,draw,right,good,
113,2,16,approach,iron,straight,left,bad,
114,2,16,approach,wedge,straight,left,bad,
115,2,16,putt,putter,straight,straight,good,
116,2,17,tee,driver,straight,straight,good,
117,2,17,approach,wood,straight,right,bad,
118,2,17,approach,wedge,straight,straight,good,
119,2,17,putt,putter,straight,straight,good,
120,2,17,putt,putter,straight,straight,good,
121,2,18,tee,driver,fade,right,bad,
122,2,18,approach,wedge,straight,straight,good,
123,2,18,approach,wedge,straight,straight,good,
124,2,18,putt,putter,straight,straight,good,
125,2,18,putt,putter,straight,straight,good,
And I would like to be able to identify which combinations of values are the most frequently occurring.
- club types: driver, wood, iron, wedge, putter
- Shot types: tee, approach, chip, putt
- line directions: left, center, right
- shot qualities: good, bad, neutral
Where ideally I'd be able to identify a sweet spot (no pun intended) combination: "driver" + "tee" + "straight" + "good"
I intend only to measure this for a static dataset, not for any future values or prediction. So, my thought is that this is probably a clustering / k-means problem. Is that correct?
If so, how would I begin doing a K-Mean analysis with these types of values in R?
If it isn't a kmeans problem, then what is it?
Solution
If I understand your question you want to know which combination is most frequent or how frequent a combination is relative to others. This is a static method that will determine the unique combinations in total (i.e., combinations of all five columns).
The plyr
package has a nifty utility for grouping unique combinations of columns in a data.frame
. We can specify the names of the columns we want to group by, and then specify a function to perform for each of those combinations. In this case, we specify the columns associated with your golf shot qualities and use the function nrow
which will count the number of rows in every subset of the large data.frame for which the columns are the identical.
# You need this library for the ddply() function
require(plyr)
# These are the columns that determine a unique situation (change this if you need)
qualities <- c("shotType","clubType","desiredShape","lineDirection","shotQuality")
# The call to ddply() actually gives us what we want, which is the number
# of times that combination is present in the dataset
countedCombos <- ddply(golf,qualities,nrow)
# To be nice, let's give that newly added column a meaningful name
names(countedCombos) <- c(qualities,"count")
# Finally, you probably want to order it (decreasing, in this case)
countedCombos <- countedCombos[with(countedCombos, order(-count)),]
Now check out your product. The final column has the count associated with each unique combination of columns you provided to ddply
:
head(countedCombos)
shotType clubType desiredShape lineDirection shotQuality count
16 putt putter straight straight good 30
10 approach wedge straight straight good 6
9 approach wedge straight straight bad 5
19 tee driver draw straight good 5
22 tee driver straight straight good 4
2 approach iron draw straight good 3
To see the results for a particular cross-section (say, for example, the driver clubType
):
countedCombos[which(countedCombos$clubType=="driver"),]
shotType clubType desiredShape lineDirection shotQuality count
19 tee driver draw straight good 5
22 tee driver straight straight good 4
21 tee driver straight left good 2
17 tee driver draw right good 1
18 tee driver draw straight bad 1
20 tee driver fade right bad 1
As a bonus, you can dig into these results with ddply
again. For example, if you wanted to look at the ratio of "good" to "bad" shotQuality based on shotType
and clubType
:
shotPerformance <- ddply(countedCombos,c("shotType","clubType"),
function(x){
total<- length(x$shotQuality)
good <- length(which(x$shotQuality=="good"))
bad <- length(which(x$shotQuality=="bad"))
c(total,good,bad,good/(good+bad))
}
)
names(shotPerformance)<-c("count","shotType","clubType","good","bad","goodPct")
This gives you a new breakdown of some math performed on the counts of a character field (shotQuality
) and shows you how you can build custom functions for ddply
. Of course, you can still order these whichever way you want, too.
head(shotPerformance)
shotType clubType total good bad goodPct
1 approach hybrid 1 1 0 1.0000000
2 approach iron 6 4 2 0.6666667
3 approach wedge 3 1 2 0.3333333
4 approach wood 3 1 2 0.3333333
5 chip wedge 1 1 0 1.0000000
6 putt putter 2 1 1 0.5000000