Question

Does anyone has some tool or some recommended practice how to find a piece of code which is similar to some other code?

Often I write a function or a code fragment and I remember I have already written something like that before, and I would like to reuse previous implementation, however using plain text search does not reveal anything, as I did not use the variable names which would be exactly the same.

Having similar code fragments leads to unnecessary code duplication, however with a large code base it is impossible to keep all code in memory. Are there any tools which would perform some analysis of the code and marked fragments or functions which are "similar" in terms of functionality?

Consider following examples:

  float xDistance = 0, zDistance = 0;
  if (camPos.X()<xgMin) xDistance = xgMin-camPos.X();
  if (camPos.X()>xgMax) xDistance = camPos.X()-xgMax;
  if (camPos.Z()<zgMin) zDistance = zgMin-camPos.Z();
  if (camPos.Z()>zgMax) zDistance = camPos.Z()-zgMax;
  float dist = sqrt(xDistance*xDistance+zDistance*zDistance);

and

  float distX = 0, distZ = 0;
  if (cPos.X()<xgMin) distX = xgMin-cPos.X();
  if (cPos.X()>xgMax) distX = cPos.X()-xgMax;
  if (cPos.Z()<zgMin) distZ = zgMin-cPos.Z();
  if (cPos.Z()>zgMax) distZ = cPos.Z()-zgMax;
  float dist = sqrt(distX*distX +distZ*distZ);

It seems to me this has been already asked and answered several times:

https://stackoverflow.com/questions/204177/what-tool-to-find-code-duplicates-in-c-projects

How to detect code duplication during development?

I suggest closing as duplicate here.


Actually I think it is a more general search problem, like: How do I search if the question was already asked on StackOverflow?

Was it helpful?

Solution

You can use Simian. It is a tool that detects duplicate code in Java, C#, C++, XML, and many more (even plain txt files). It even integrates nicely in a tool like CruiseControl.

OTHER TIPS

Our CloneDR finds duplicate code, both exact copies and near-misses, across large source systems, parameterized by langauge syntax. It supports Java, C#, COBOL, C++, PHP, Python and many other languages.

It accepts a number of parameters to define "What is a clone?", including: a) Similarilty threshold, controlling how similar two blocks of code must be to be declared as clones (typically 95% is good) b) number of lines minimum clone size (3 tends to be a good choice) c) number of parameters (distinct changes to the text; 5 tends to be a good choice) With these settings, it tends to find 10-15% redundant code in virturally everything it processes.

Line-oriented clone detection tools such as Simian can't find cloned code that has been reformatted, but CloneDR will. They may tell that two blocks of code match, but they usually don't show you exactly how they match or where the differences are; CloneDR will. They don't suggest how to abstract the cloned code; CloneDR will.

By virtue of having weaker matching algorithms, they tend to produce more false positives; when you get 5000 clones reported across a million lines, the number of false positives matters a lot.

Based on your example, I'd expect it to find those two fragments (you don't have have point to either one) and note that they are similar if you abstract away the variable names.

Here is the best collection on code clones detection I've seen:

https://web.archive.org/web/20120502162147/http://students.cis.uab.edu/tairasr/clones/literature

There are many programs, but none of them seems to be the best or the most popular. You can think what is the most important for you and find what suits your needs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top