Removing strings from C source code [closed]

https://stackoverflow.com/questions/1294418

18-09-2019
|

Question

Can anyone point me to a program that strips off strings from C source code? Example

#include <stdio.h>
static const char *place = "world";
char * multiline_str = "one \
two \
three\n";
int main(int argc, char *argv[])
{
        printf("Hello %s\n", place);
        printf("The previous line says \"Hello %s\"\n", place);
        return 0;
}

becomes

#include <stdio.h>
static const char *place = ;
char * multiline_str = ;
int main(int argc, char *argv[])
{
        printf(, place);
        printf(, place);
        return 0;
}

What I am looking for is a program very much like stripcmt only that I want to strip strings and not comments.

The reason that I am looking for an already developed program and not just some handy regular expression is because when you start considering all corner cases (quotes within strings, multi-line strings etc) things typically start to be (much) more complex than it first appears. And there are limits on what REs can achieve, I suspect it is not possible for this task. If you do think you have an extremely robust regular expression feel free to submit, but please no naive sed 's/"[^"]*"//g' like suggestions.

(No need for special handling of (possibly un-ended) strings within comments, those will be removed first)

Support for multi-line strings with embedded newlines is not important (not legal C), but strings spanning multiple lines ending with \ at the end must be supported.

This is almost the same as the some other questions, but I found no reference to any tools.

Solution

You can download the source code to StripCmt (.tar.gz - 5kB). It's trivially small, and shouldn't be too difficult to adapt to striping strings instead (it's released under the GPL).

You might also want to investigate the official lexical language rules for C strings. I found this very quickly, but it might not be definitive. It defines a string as:

stringcon ::= "{ch}", where ch denotes any printable ASCII character (as specified by isprint()) other than " (double quotes) and the newline character.

OTHER TIPS

All of the tokens in C (and most other programming languages) are "regular". That is, they can be matched by a regular expression.

A regular expression for C strings:

"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"

The regex isn't too hard to understand. Basically a string literal is a pair of double quotes surrounding a bunch of:

non-special (non-quote/backslash/newline) characters
escapes, which start with a backslash and then consist of one of:
- a simple escape character
- 1 to 3 octal digits
- x and 1 or more hex digits

This is based on sections 6.1.4 and 6.1.3.4 of the C89/C90 spec. If anything else crept in in C99, this won't catch that, but that shouldn't be hard to fix.

Here's a python script to filter a C source file removing string literals:

import re, sys
regex = re.compile(r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"''')
for line in sys.stdin:
  print regex.sub('', line.rstrip('\n'))

EDIT:

It occurred to me after I posted the above that while it is true that all C tokens are regular, by not tokenizing everything we've got an opportunity for trouble. In particular, if a double quote shows up in what should be another token we can be lead down the garden path. You mentioned that comments have already been stripped, so the only other thing we really need to worry about are character literals (though the approach Im going to use can be easily extended to handle comments as well). Here's a more robust script that handles character literals:

import re, sys
str_re = r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"'''
chr_re = r"""'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))'"""

regex = re.compile('|'.join([str_re, chr_re]))

def repl(m):
  m = m.group(0)
  if m.startswith("'"):
    return m
  else:
    return ''
for line in sys.stdin:
  print regex.sub(repl, line.rstrip('\n'))

Essentially we're finding string and character literal token, and then leaving char literals alone but stripping out string literals. The char literal regex is very similar to the string literal one.

In ruby:

#!/usr/bin/ruby
f=open(ARGV[0],"r")
s=f.read
puts(s.gsub(/"(\\(.|\n)|[^\\"\n])*"/,""))
f.close

prints to the standard output

In Python using pyparsing:

from pyparsing import dblQuotedString

source = open(filename).read()
dblQuotedString.setParseAction(lambda : "")
print dblQuotedString.transformString(source)

Also prints to stdout.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow