finding duplicate source code

https://stackoverflow.com/questions/4724880

12-10-2019
|

Question

I'm analyzing some legacy code. It is about 80.000 lines of old plsql code. On a fist look there is quite some duplication in the source which needs to be removed. Instead off doing diff's manual and looking at each file there must be some tool/commandline confu out there to detect duplicate lines of source code.

My goal is to make an educated guess about the minimal size of a rewrite of source and about how much actual knowledge is captured in this program. I wrote some a basic static code analyzer to find the amount of control statements IF ELSE FOR etc and Functions in each file. But duplicated code still needs to be removed from my statistics.

Solution

Have you looked at Simian - Similarity Analyser? (Just checked and it's no longer free, but it is available for a period of 15 days for evaluation purposes.)

Simian (Similarity Analyser) identifies duplication in Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy source code and even plain text files. In fact, simian can be used on any human readable files such as ini files, deployment descriptors, you name it.

I have used it in practice and it does work well.

OTHER TIPS

Sonar has duplication detection and claims to support PL/SQL, though I've never used it for that.

You would need to beg/borrow/steal/write a plsql parser and compare the resulting abstract syntax trees. With the size of the code base you have, that might be worthwhile. There would be other uses for the parser once you're done.

How about this:

http://sourceforge.net/projects/sddforeclipse/

It is opensource, and is said to be used by commercial software. It is a plugin to Eclipse, by the way.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow