Question

I am collecting a lot of really interesting data points as users come to my Python web service. For example, I have their current city, state, country, user-agent, etc. What I'd like to be able to do is run these through some type of machine learning system / algorithm (maybe a Bayesian classifier?), with the eventual goal of getting e-mail notifications when something out-of-the-ordinary occurs (anomaly detection). For example, Jane Doe has only ever logged in from USA on Chrome. So if she suddenly logs into my web service from the Ukraine on Firefox, I want to see that as a highly 'unusual' event and fire off a notification.

I am using CouchDB (specifically with Cloudant) already, and I see people often saying here and there online that Cloudant / CouchDB is perfect for this sort of thing (big data analysis). However I am at a complete loss for where to start. I have not found much in terms of documentation regarding relatively simple tracking of outlying events for a web service, let alone storing previously 'learned' data using CouchDB. I see several dedicated systems for doing this type of data crunching (PredictionIO comes to mind), but I can't help but feel that they are overkill given the nature of CouchDB in the first place.

Any insight would be much appreciated. Thanks!

Was it helpful?

Solution

You're correct in assuming that this is a problem ideally suited to Machine Learning, and scikit-learn.org is my preferred library for these types of problems. Don't worry about specifics - (couchdb cloudant) for now, lets get your problem into a state where it can be solved.

If we can assume that variations in log-in details (time, location, user-agent etc.) for a given user are low, then any large variation from this would trigger your alert. This is where the 'outlier' detection that @Robert McGibbon suggested comes into play.

For example, squeeze each log-in detail into one dimension, and the create a log-in detail vector for each user (there is significant room for improving this digest of log-in information);

  • log-in time (modulo 24 hrs)
  • location (maybe an array of integer locations, each integer representing a different country)
  • user-agent (a similar array of integer user-agents)

and so on. Every time a user logs in, create this detail array and store it. Once you have accumulated a large set of test data you can try running some ML routines.

So, we have a user and a set of log-in data corresponding to successful log-ins (a training set). We can now train a Support Vector Machine to recognise this users log-in pattern:

from sklearn import svm

# training data [[11.0, 2, 2], [11.3, 2, 2] ... etc]
train_data = my_training_data()

# create and fit the model
clf = svm.OneClassSVM()
clf.fit(train_data)

and then, every time a new log-in even occurs, create a single log-in detail array and pass that past the SVM

if clf.predict(log_in_data) < 0:
    fire_alert_event()
else:
    # log-in is not dissimilar to previous attempts
    print('log in ok')

if the SVM finds the new data point to be significantly different from it's training set then it will fire the alarm.

My Two Pence. Once you've got hold of a good training set, there are many more ML techniques that may be better suited to your task (they may be faster, more accurate etc) but creating your training sets and then training the routines would be the most significant challenge.

There are many exciting things to try! If you know you have bad log-in attempts, you can add these to the training sets by using a more complex SVM which you train with good and bad log-ins. Instead of using an array of disparate 'location' values, you could find the Euclidean different log-ins and use that! This sounds like great fun, good luck!

OTHER TIPS

I also thought the approach using svm.OneClassSVM from sklearn was going to produce a good outlier detector. However, I put together some representative data based upon the example in the question and it simply could not detect an outlier. I swept the nu and gamma parameters from .01 to .99 and found no satisfactory SVM predictor.

My theory is that because the samples have categorical data (cities, states, countries, web browsers) the SVM algorithm is not the right approach. (I did, BTW, first convert the data into binary feature vectors with the DictVectorizer.fit_transform method).

I believe @sullivanmatt is on the right track when he suggests using a Bayesian classifier. Bayesian classifiers are used for supervised learning but, at least on the surface, this problem was cast as an unsupervised learning problem, ie we don't know a priori which observations are normal and which are outliers.

Because the outliers you want to detect are very rare in the stream of web site visits, I believe you could train the Bayesian classifier by labeling every observation in your training set as a positive/normal observation. The classifier should predict that true normal observations have higher probability simply because the majority of the observations really are normal. A true outlier should stand out as receiving a low predicted probability.

If you're trying to investigate on anomalies of user behaviours during the time, I'd recommend you to look at time-series anomaly detectors. With this approach you'll be able to statistically/automatically figure out new, potentially suspicious, emerging patters and abnormal events.

http://www.autonlab.org/tutorials/biosurv.html and http://web.engr.oregonstate.edu/~wong/workshops/icml2006/slides/agarwal.ppt explain some techniques based on machine learning. In this case you can use scikit-learn.org, a very powerful Python library that contains tons of ML algos.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top