Creating a Lexical Analyzer in C

Question

You've asked a few question since this one, so I guess you've moved on. There are a few things that can be noted about your problem and your start at a solution that can help others starting to solve a similar problem. You'll also find that people can often be slow at answering things that are obvious homework. We often wait until homework deadlines have passed. :-)

First, I noted you used a few features specific to Borland C compiler which are non-standard and would not make the solution portable or generic. YOu could solve the problem without them just fine, and that is usually a good choice. For example, you used #include <conio.h> just to clear the screen with a clrscr(); which is probably unnecessary and not relevant to the lexer problem.

I tested the program, and as written it works! It transcribes all the lines of the file Sum.c to stdout. If you only saw a blank screen it is because it could not find the file. Either you did not write it to your C:\ directory or had a different name. As already mentioned by @WhozCraig you need to check that the file was found and opened properly.

I see you are using the C function strtok to divide the input up into tokens. There are some nice examples of using this in the documentation you could include in your code, which do more than your simple case. As mentioned by @Grijesh Chauhan there are more separators to consider than \n, or end-of-line. What about spaces and tabs, for example.

However, in programs, things are not always separated by spaces and lines. Take this example:

result=(number*scale)+total;

If we only used white space as a separator, then it would not identify the words used and only pick up the whole expression, which is obviously not tokenization. We could add these things to the separator list:

char seprators [] = "\n=(*)+;";

Then your code would pick out those words too. There is still a flaw in that strategy, because in programming languages, those symbols are also tokens that need to be identified. The problem with programming language tokenization is there are no clear separators between tokens.

There is a lot of theory behind this, but basically we have to write down the patterns that form the basis of the tokens we want to recognise and not look at the gaps between them, because as has been shown, there aren't any! These patterns are normally written as regular expressions. Computer Science theory tells us that we can use finite state automata to match these regular expressions. Writing a lexer involves a particular style of coding, which has this style:

while ( NOT <<EOF>> ) {
  switch ( next_symbol() ) {

     case state_symbol[1]: 
              ....
             break;

      case state_symbol[2]:
              ....
              break;

       default:
             error(diagnostic);
  }
}

So, now, perhaps the value of the academic assignment becomes clearer.