How is the Python grammar used internally?

Question 1

A grammar is used to describe all possible strings in a language. It is also useful in specifying how a parser should parse the language.

In this grammar it seems like they are using their own version of EBNF, where a non-terminal is any lowercase word and a terminal is all uppercase or surrounded by quotes. For example, NEWLINE is a terminal, arith_expr is a non-terminal and 'if' is also a terminal. Any non-terminal can be replaced by anything to the right of the colon of it's respective production rule. For example, if you look at the first rule:

single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE

We can replace single_input with one of either a NEWLINE, a simple_stmt or a compound_stmt followed by a NEWLINE. Suppose we replaced it with "compound_stmt NEWLINE", then we would look for the production rule for compound_stmt:

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated

and choose which of these we want to use, and substitute it for "compound_stmt" (Keeping NEWLINE in it's place)

Suppose we wanted to generate the valid python program:

if 5 < 2 + 3 or not 1 == 5:
    raise

We could use the following derivation:

single_input
compound_stmt NEWLINE
if_stmt NEWLINE
'if' test ':' suite NEWLINE
'if' or_test ':' NEWLINE INDENT stmt stmt DEDENT NEWLINE
'if' and_test 'or' and_test ':' NEWLINE INDENT simple_stmt DEDENT NEWLINE
'if' not_test 'or' not_test ':' NEWLINE INDENT small_stmt DEDENT NEWLINE
'if' comparison 'or' 'not' not_test ':' NEWLINE INDENT flow_stmt DEDENT NEWLINE
'if' expr comp_op expr 'or' 'not' comparison ':' NEWLINE INDENT raise_stmt DEDENT NEWLINE
'if' arith_expr '<' arith_expr 'or' 'not' arith_expr comp_op arith_expr ':' NEWLINE INDENT 'raise' DEDENT NEWLINE
'if' term '<' term '+' term 'or' 'not' arith_expr == arith_expr ':' NEWLINE INDENT 'raise' DEDENT NEWLINE
'if' NUMBER '<' NUMBER '+' NUMBER 'or' 'not' NUMBER == NUMBER ':' NEWLINE INDENT 'raise' DEDENT NEWLINE

A couple of notes here, firstly, we must start with one of the non-terminals which is listed as a starting non-terminal. In that page, they list them as single_input, file_input, or eval_input. Secondly, a derivation is finished once all the symbols are terminal (hence the name). Thirdly, it is more common to do one substitution per line, for the sake of brevity I did all possible substitutions at once and started skipping steps near the end.

Given a string in the language, how do we find it's derivation? This is the job of a parser. A parser reverse-engineers a production sequence to first check that it is indeed a valid string, and furthermore how it can be derived from the grammar. It's worth noting that many grammars can describe a single language. However, for a given string, it's derivation will of course be different for each grammar. So technically we write a parser for a grammar not a language. Some grammars are easier to parse, some grammars are easier to read/understand. This one belongs in the former.

Also this doesn't specify the entire language, just what it looks like. A grammar says nothing about semantics.

If you're interested in more about parsing and grammar I recommend Grune, Jacobs - Parsing Techniques. It's free and good for self-study.

Question 2

The python grammar - as most others - is given in BNF or Backus–Naur Form. Try reading up on how to read it but the basic structure is:

<something> ::= (<something defined elsewhere> | [some fixed things]) [...]

This is read as a <something> is defined as something else or any of the fixed things repeated a multitude of times.

BNF is based on a nearly 2000 year old format for describing the permitted structure of a language, is incredibly terse and will describe all the allowed structures in a given language, not necessarily all those that would make sense.

Example

Basic arithmetic can be described as:

<simple arithmetic expression> ::= <numeric expr>[ ]...(<operator>[ ]...<numeric expr>|<simple arithmetic expression>)
<numeric expr> ::= [<sign>]<digit>[...][.<digit>[...]]
<sign> ::= +|-
<operator> ::= [+-*/]
<digit> ::= [0123456789]

Which says that a simple arithmetic operation is an, optionally signed, number consisting of one or more digits, possibly with a decimal point and one, or more, subsequent digits, optionally followed by spaces, followed by exactly one of +-*/, optionally followed by spaces, followed by either a number or another simple arithmetic operation, i.e. a number followed by, etc.

This describes, just about, all of the basic arithmetic operations and can be extended to include functions, etc. Notice that does allow invalid operations that are a valid syntax, e.g.: 22.34 / -0.0 is valid syntactically even though the result is not valid.

It can sometimes make you aware that operations are possible that you might not have thought of, e.g.: 56+-50 is a valid operation as is 2*-10 but 2*/3 is not.

Note that SGML and XML/Schema are both related but different methodologies for describing the structure of any language. YAML is another method for describing the allowed structures in a computer specific languages.

Disclaimer: My BNF is a little rusty so if I have made any major mistakes in the above my apologies and please correct me.

Question 3

That is basically an EBNF (Extended Backus–Naur Form) specification.

Question 4

When you write a program in a language, the very first thing your interpreter/compiler must do in order to go from a sequence of characters to actual action is to translate that sequence of characters in a higher complexity structure. To do so, first it chunks up your program in a sequence of tokens expressing what each "word" represents. For example, the construct

if foo == 3: print 'hello'

will be converted into

1,0-1,2:    NAME    'if'
1,3-1,6:    NAME    'foo'
1,7-1,9:    OP  '=='
1,10-1,11:  NUMBER  '3'
1,11-1,12:  OP  ':'
1,13-1,18:  NAME    'print'
1,19-1,26:  STRING  "'hello'"
2,0-2,0:    ENDMARKER   ''

But note that even something like "if if if if" is correctly made into tokens

1,0-1,2:    NAME    'if'
1,3-1,5:    NAME    'if'
1,6-1,8:    NAME    'if'
1,9-1,11:   NAME    'if'
2,0-2,0:    ENDMARKER   ''

What follows the tokenization is the parsing into a higher level structure that analyzes if the tokens actually make sense taken together, something that the latter example does not, but the first does. To do so, the parser must recognize the actual meaning of the tokens (e.g. the if is a keyword, and foo is a variable), then build a tree out of the tokens, organizing them in a hierarchy and see if this hierarchy actually makes sense. Here is where the grammar you are seeing comes in. That grammar is in BNF, which is a notation to express the constructs the language can recognize. That grammar is digested by a program (for example, bison) which has the magic property of taking that grammar and generate actual C code that does the heavy work for you, normally by recognizing the tokens, organizing them, returning you a parse tree, or tell you where there's a mistake.

Short version: developing a language is about defining tokens and how these tokens are put together to give something meaningful. This is done through the grammar, which you use to generate the actual "parser" code with automated tools.