What is a suitable lexer generator that I can use to strip identifiers from many language source files?

https://stackoverflow.com/questions/2120026

22-09-2019
|

Question

I'm working on a group project for my University which is going to be used for plagiarism detection in Computer Science.

My group is primarily going off the hashing/fingerprinting techniques described in this journal article: Winnowing: Local Algorithms for Document Fingerprinting. This is very similar to how the MOSS plagiarism detection system works.

We are basically taking k-gram hashes of fellow students source code and looking them up in a database for relevant matches (along with lots of optimization in how we determine which hashes to select as a document's fingerprints).

The first aspect of our project is the "Front-End" portion of it, which will hold some semantic knowledge about each file format our detection system can process. This will allow us to strip some details from the document that we no longer want for the purpose of plagiarism detection. Basically we want to be able to rename all variables in various programming languages to a constant string or letter.

What is a lightweight solution (lexer generator or something similar) that we can use to aid in renaming all variables in different languages source code files to constants?

Our project is being written in Java.

Ideally I'd simply like to be able to define a grammar for each language and then our front end will be able rename all identifiers in that languages source file to some constant. We would then do this for each file format we wanted to support (java, c++, python, etc).

Solution

For a lexer/parser generator, you should look at ANTLR. TXL, which is a textual transformation interpreter, is also worth a look. Ready-made grammars should be available for both.

OTHER TIPS

Apart from ANTLR, which was already suggested, you can also take a look at JFlex.

Be aware that there are some languages where it's not really possible to do what you're trying to do. Specifically, those where it's not possible to determine what is or isn't a variable based on the grammar. Tcl is an example of such, but there are a number of dynamic languages that have the same issue (Lisp?).

acacia-lex lexer has method replace.

In Lexer token define, what looks like identifiers, for example, "ident1" -> "[a..d]", "ident2" -> "[e..h]".

In replace method input map provide the info, which identifier type to replace with which constant (object), for example, "ident1" -> "ident1", "ident2" -> "ident2".

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow