Domanda

I am trying to write parser for which i need to identify string literals, if my string starts and ends with ' (i.e single quote) then what will be the regular expression to identify string literal?

I'm using javacc for writing parser. can anybody help me with actual regular expression code in token format? i have tried enough on my own.

eg.

< INTEGER_VALUE : "0" | (["1"-"9"] (["0"-"9"])*) >

this is the token format to identify integer literal, I want same token format for string literal where string starts and end with single quote, I also tried using metacharacters (given in http://www.vogella.com/articles/JavaRegularExpressions/article.html tutorial) but there were no successful results.

È stato utile?

Soluzione

I'm assuming that you are using JavaCC. The answer depends on the syntax of strings in your language. Let's say any character is allowed in a string other than an apostrophe. I.e. a string consists of two apostrophes and any number (0 or more) of nonapostrophes in between.

<STRING: "'" (~["'"])* "'">

Now many languages don't allow newlines or returns in strings. So here let's ban those too:

<STRING: "'" (~["'","\n","\r"])* "'">

Now the problem is: what if someone wants to put apostrophes, newlines or returns? Some languages (e.g. C) use backslashes as an escape, so let's say

  • \' means apostrophe
  • \n means newline
  • \r means return
  • \\ means backslash
  • \x where x is any other character is considered an error

Here is the expression

<STRING: "'"  ("\\" ("\\" | "n" | "r" | "'") | ~["\\","\n","\r","'"] )* "'"

I.e. a string is two apostrophes with a sequence of 0 or more groups in between, where each group is either one of the two character sequences \\, \n, \r, \', or a character that is not a backslash, a newline, a return or an apostrophe.

Another approach is to use lexical states.

<DEFAULT> MORE: { "'" : INSTRING }
<INSTRING> MORE: { "\\\\" 
                 | "\\n" 
                 | "\\r"  
                 | "\\'"
                 | ~["\\","\n","\r","'"]
                 }
<INSTRING> TOKEN: { "'" : DEFAULT }

Altri suggerimenti

Not close enough, let's consider the following

// 'here is comment'
'is't correct string?'

where you have single quotes but it's not a string for sure. If you assure to strip out comments and that any in between quotes symbol will be escaped \' (like in most programming languages). I believe then everything will be just fine as in algorithm you described.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top