Formal grammar of XML

https://stackoverflow.com/questions/12814394

06-07-2021
|

Pergunta

Im trying to build small parser for XML files in C. I know, i could find some finished solutions but, i need just some basic stuff for embedded project. I`m trying to create grammar for describing XML without attributes, just tags, but it seems it is not working and i was not able to figure out why.

Here is the grammar:

   XML : FIRST_TAG NIZ
   NIZ : VAL NIZ | eps
   VAL : START VAL END
     | STR
     | eps

Here is part of C code that implement this grammar :

void check() {

getSymbol();
if( sym == FIRST_LINE )
{
    niz();
}
else {
    printf("FIRST_LINE EXPECTED");
    exit(1);
 }
}

 void niz() {
getSymbol();
if( sym == ERROR )
    return;
if( sym == START ) {
    back = 1;
    val();
    niz();
}
printf(" EPS OR START EXPECTED\n");

}

void val() {
getSymbol();
if( sym == ERROR )
    return;
if( sym == START ) {
    back = 0;

    val();
    getSymbol();
    if( sym != END ) {
        printf("END EXPECTED");
        exit(1);
    }
    return;
}
if( sym == EMPTY_TAG || sym == STR)
    return;
printf("START, STR, EMPTY_TAG OR EPS EXPECTED\n");
exit(1);

}

 void getSymbol() {
int pom;

if(back == 1) {
    back = 0;
    return;
}
sym = getNextToken(cmd + offset, &pom);
offset += pom + 1;


   }

EDIT: Here is the example of XML file that does not satisfy this grammar:

<?xml version="1.0"?> 
<VATCHANGES> 
<DATE>15/08/2012</DATE>
<TIME>1452</TIME>
<EFDSERIAL>01KE000001</EFDSERIAL> 
<CHANGENUM>1</CHANGENUM> 
<VATRATE>A</VATRATE> 
<FROMVALUE>16.00</FROMVALUE> 
<TOVALUE>18.00</TOVALUE> 
<VATRATE>B</VATRATE> 
<FROMVALUE>2.00</FROMVALUE> 
<TOVALUE>0.00</TOVALUE> 
<VATRATE>C</VATRATE> 
<FROMVALUE>5.00</FROMVALUE> 
<TOVALUE>0.00</TOVALUE> 
<DATE>25/05/2010</DATE> 
<CHANGENUM>2</CHANGENUM> 
<VATRATE>C</VATRATE> 
<FROMVALUE>0.00</FROMVALUE> 
<TOVALUE>4.00</TOVALUE> 
</VATCHANGES>

It gives END EXPECTED at the output.

Solução

First, your grammar needs some work. Assuming the preamble is handled correctly, you have a basic error in the definition of NIZ.

NIZ : VAL NIZ | eps
VAL : START VAL END
    | STR
    | eps

So we enter NIZ and we look for VAL first. The problem is the eps on the end of both VAL's possible productions and NIZ. Therefore, if VAL produces nothing (i.e. eps) and consumes no tokens in the process (which it can't to be proper, since eps is the production), NIZ reduces to:

NIZ: eps NIZ | eps

which isn't good.

Consider into something more along these lines: I just spewed this with no real foresight into having something beyond a purely basic construction.

XML:         START_LINE ELEMENT
ELEMENT:     OPENTAG BODY CLOSETAG
OPENTAG:     lt id(n) gt
CLOSETAG:    lt fs id(n) gt
BODY:        ELEMENT | VALUE
VALUE:       str | eps

This is super basic. Terminals include:

lt:    '<'
gt:    '>'
fs:    '/'
str:   any alphanumeric string excluding chars lt or gt.
id(n): any alphanumeric string excluding chars lt, gt, or fs.

I can almost feel the wrath of the XML purists raining down on me right now, but the point I'm trying to get across is that, when an grammar is well-defined, the RDP will literally write itself. Obviously the lexer (i.e. the token engine) needs to handle the terminals accordingly. Note: the id(n) is an id-stack to ensure you properly close the innermost tag, and is an attribute of your parser in accordance with how it manages tag ids. Its not traditional, but it makes things MUCH easier.

This can/should clearly be expanded to include stand-alone element declarations and short-cut element closure. For example, this grammar allows for elements of this form:

<ElementName>...</ElementName>

but not of this form:

<ElementName/>

Nor does it account for short-cut termination such as:

<ElementName>...</>

Accounting for such additions will obviously complicate the grammar considerably, but also make the parser substantially more robust. Like I said, the sample above is basic with a capital B. If you're really going to embark on this these are things you want to consider when designing your grammar, and thus also your RDP by consequence.

Anyway, just consider how a few reworks in your grammar can/will substantially make this easier on you.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow