Pergunta

I'm using Google Analytics on my mobile app to see how different users use the app. I draw a path based on the pages they move to. Given a list of paths for say a 100 users, how do I go about clustering the users. Which algorithm to use? By the way, I'm thinking of using sckit learn package for the implementation.

My dataset (in csv) would look like this :

DeviceID,Pageid,Time_spent_on_Page,Transition.<br> 
ABC,Page1, 3s, 1->2.<br>
ABC,Page2, 2s, 2->4.<br>
ABC,Page4,1s,4->1.<br>

So the path, here is 1->2->4->1, where 1,2,4 are Pageids.

Foi útil?

Solução

@Shagun's answer is right actually. I just expand it!

There are 2 different approaches to your problem:

Graph Approach

  • As stated in @Shagun's answer you have a weighted directed graph and you want to cluster the paths. I mention again because it's important to know that your problem is not a Graph Clustering or Community Detection problem where vertices are clustered!
  • Cunstructing a Graph in networkx using the last two column of the data, you can add time spent as weight and users who passed that link as an edge attribute. After all you'll have different features for clustering: the set of all vertices an individual ever met in the graph, total, mean and std of time spent, shortest path distribution parameters, ... which can be used for clustering the user behaviors.

Standard Data

  • All above can be done by reading data efficiently in a matrix. If you consider each edge for a specified user as a single row (i.e. you'll have MxN rows where M is the number of users and N the number of edges in case you stick with 100 case!) and add properties as columns you'll probably able to cluster behaviors. if a user passed an edge n times, in the row corresponding to that user and that edge add a count column with value n and same for time spend, etc. Starting and ending edges are also informative. Be careful that node names are categorical variables.

Regarding clustering algorithms you can find enough if you have a quick look at SKlearn.

Hope it helped. Good Luck :)

Outras dicas

I have not worked with such a dataset myself but I think you can model this problem as a graph where the pages form the node and then you have directed edges based on transition. Add weights to nodes based on time spent on them and then use graph clustering algorithms. If you choose to use this, you can use the networkx library in python for graph based analysis.

Edit : We can use the information about different possible paths and how frequently they are use to classify the users. Let's take example of Google search app. Suppose I want to search for images. One option is that I use the image search option, make the query and reach the results page. Other is that I make the query first and then switch to image option after getting the results. In both the cases I did up at the same page. I can use this information to classify my users. Now there can be quite a lot of paths possible so which all do I consider? The graph can be used to leveraged here along with the information about how you want to classify your users. Modeling it as a graph looks very intuitive to me as it lends itself to the concept of path.

Licenciado em: CC-BY-SA com atribuição
scroll top