Question

Traditionional software metrics deal with quality of software. I'm looking for metrics that can be used to identify developers by their code, in the same vein as plagiarism software and stylometry can be used to identify authors by their writing style. I can imagine that certain existing metrics can be used here as well, such as comment ratio. I can also imagine metrics that would irrelevant from a quality point of view, such as the (over)use of certain methods or design patterns, average length of variable names, etc.

I'm interested either in a pointer to a collection of such metrics or studies, or individual metrics. They may be language-agnostic or related to a language or programming paradigm.

I want to use it to understand and analyze different coding styles, not to detect plagiarism.

Était-ce utile?

La solution

I see there are already a couple of studies that looked into this. They might help.

  1. Kothari, J., Shevertalov, M., Stehle, E., Mancoridis, S., "A probabilistic approach to source code authorship identification", In Proceedings of the International Conference on Information Technology, pp.243-248, IEEE, 2007.

    Available online here

    Quoting from the abstract:

    We begin by computing a set of metrics to build profiles for a population of known authors using code samples that are verified to be authentic. We then compute metrics on unidentified source code to determine the closest matching profile. [...] In our case study we are able to determine authorship with greater than 70% accuracy in choosing the single nearest match and greater than 90% accuracy in choosing the top three ordered nearest matches.

  2. Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S., "On the use of discretized source code metrics for author identification", In Proceedings of the 1st International Symposium on Search Based Software Engineering, pp.69-78, IEEE, 2009.

    Available online here, this is a follow-up of the previous study.

  3. Lange, R., Mancoridis, S., "Using code metric histograms and genetic algorithms to perform author identification for software forensics", In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp.2082-2089, ACM, 2007.

    Available online here

    This is also related to the first reference (common author), and discusses the metrics in more detail. Again quoting from the abstract:

    Our method involves measuring the differences in histogram distributions for code metrics. Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics.

You can also use Google Scholar for other references, and for finding other papers based on the ones above (using the "cited by" option).

Autres conseils

If you're looking for potential metrics, you might try reviewing some coding standards. Since these dictate a particular style, it follows that the things they talk about (spacing, placement of braces, identifier lengths, mandatory comments, etc.) are things that might be used to identify developers from their code.

Also, if you're interested in .NET code, you might find NDepend to be a useful tool. It enables you to run queries against a code base, and supports 82 metrics.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top