Question

I am starting to play around in datamining / machine learning and I am stuck on a problem that's probably easy.

So I have a report that lists the url and the number of visits a person did. So a combination of ip and url result in an amount of visits.

Now I want to run the k-means clustering algorithm on this so I thought I could approach it like this:

This is my data:

url      ip    visits

abc.be   123   5
abc.be/a 123   2
abc.be/b 123   2
abc.be/b 321   4

And I would turn in into a feature vector/matrix like so:

abc.be  abc.be/a   abc.be/b   impressions
   1       0          0          5
   0       1          0          2
   0       0          1          2
   0       0          1          4

But I am stuck on how to transform my data set to a feature matrix. Any help would be appreciated.

Was it helpful?

Solution

I don't understand what you mean by

So I have a report that lists the url and the number of visits a person did. So a combination of ip and url result in an amount of visits.

Assuming that you equate an IP with a user, and you wish to cluster users by their URL visitation frequencies, your matrix, M, would have

  • One row per IP (user)
  • One column for each URL that you are tracking (your features)
  • and the entries in M would be "visits" of a given URL by a particular IP

Given these assumptions, and your report, M would be:

    abc.be  abc.be/a  abc.be/b
123   5        2         2
321   0        0         4
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top