Frage

I got the following interesting task:

Given a list of 1 million numbers with 16 digits (say, credit card numbers), which includes 990,000 purely random numbers generated by a computer system, and 10,000 created manually by fraudsters. These numbers are labeled as genuine or fraud. Build an algorithm to predict non-random numbers.

My approach so far is a bit of a brute-force: looking at non-random numbers to find patterns (such as repeated numbers: 22222, or 01234).

I wonder if there's a ready-made algorithm or tool for this kind of task. I imagine this task should be quite common among fraud analytic community.

Thanks.

War es hilfreich?

Lösung

First off, if you know they're credit card numbers, use Luhn's algorithm, which is a quick checksum algorithm for valid credit card numbers.

However, if they are simply 16 digit integers, there are a couple of approaches that you can use. It is hard to tell if an individual number came from a random source(as the number 1111111111111111 is just as likely as any other number out of a random number generator). As for your repeated numbers and patterns, that is very reminiscent of the concept of Kolmogorov complexity(see links below). You could try looking for patterns in this brute force method, but I feel like it would be quite inaccurate, as humans might actually tend to avoid putting digits and sequences in these numbers!

Instead, I suggest focusing on the way people generate numbers. You can treat human input like a very poor random number generator. So I recommend just making a list yourself of random human entered numbers, if you don't have another dataset. Then, you can use machine learning to generate a classifier algorithm to distinguish between purely random numbers(those without 'human-like' attributes that your machine learning algorithm has recognized). In terms of the metrics for the statistical classifier, Kolmogorov complexity could be one, perhaps frequency of digits for another metric(see Benford's law on Wikipedia), and number of repeating digits for another(humans might try to avoid repeating digits to look non-random, so let your classifier do the work!)

From my personal experience, tough problems like this are a textbook case for machine learning algorithms and statistical classifiers.

Hope this helps!

Links:

Kolmogorov Complexity
Complexity calculator

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top