I've been wondering why XML has an L in its name.

By itself, XML doesn't "do" anything. It's just a data storage format, not a language! Languages "do" things.

The way you get XML to "do" stuff, to turn it into a language proper, is to add xmlns attributes to its root element. Only then does it tell its environment what it's about.
One example is XHTML. It's active, it has links, hypertext, styles etc, all triggered by the xmlns. Without that, an XHTML file is just a bunch of data in markup nodes.

So why then is XML called a language? It doesn't describe anything, it doesn't interpret, it just is.

Edit: Maybe my question should have been broader. Since the answer is currently "because XML was named after SGML, which was named after GML, etc" the question should have been, why are markup languages (like XML) called languages?

Oh, and WRT the close votes: no, I'm not asking about the X. I'm asking about the L!

有帮助吗?

解决方案

The real answer is XML has an L in the name because a guy named Raymond Lorie was among the designers of the first "markup language" at IBM in the 1970'ies. The developers had to find a name for the language so they chose GML because it was the initials of the three developers (Goldfarb, Mosher and Lorie). They then created the backronym Generalized Markup Language.

This later became standardized as SGML (Standardized General Markup Language), and when XML was created, the developers wanted to retain the ML-postfix to indicate the family relationship to SGML, and they added the X in front because they thought it looked cool. (Even though it doesn't actually make sense - XML is a meta language which allows you to define extensible languages, but XML is not really extensible itself.)

As for your second question if XML can legitimately be called a language:

Any structured textual (or even binary) format which can be processed computationally can be called a language. A language doesn't "do" anything as such, but some software might process input in the language and "do" something based on it.

You note that XML is a "storage format" which is true, but a textual storage format can be called a language, these term are not mutually exclusive.

Programming languages are a subset of languages. E.g. HTML and CSS are languages but not programming languages, while JavaScript is a real programming language. That said, there is no formal definition of programming language either, and there is a large grey zone of languages which could be called either data formats or programming languages depending on your point of view.

Given this, XML is clearly a language. just not a programming language - though it can be used to define programming languages like XSLT.

Your point about namespaces is irrelevant. Namespaces are an optional feature of XML and do not change the semantics of an XML vocabulary. It is just needed to disambiguate element names if the format may contain multiple vocabularies.


Edit: reinierpost pointed out that you might have meant something different with the question than what I understood. Maybe you meant that specific vocabularies like XHTML, RSS, XSLT etc. are languages because they associate elements and attributes with particular semantics, but the XML standard itself does not define any semantics for specific elements and attributes, so it does not feel like a "real language".

My answer to this would be that XML does define both syntax and semantics, it just defines it at a different level. For example it defines the syntax of elements and attributes and rules about how to process them. XML is a "metalanguage" which is still a kind of language (just like metadata is still data!). As an example EBNF is also clearly a language, but its purpose is to define the syntax of other languages, so it is also a metalanguage.

其他提示

Because it is a language. A markup language, not a programming language.

Notice that natural human languages like English and Spanish don't "do" anything either. In fact, technically C++ and Java and the like don't "do" anything until they're fed into a compiler and the output gets executed. Doing stuff and being a language are largely orthogonal to each other.

Let Σ be a non-empty, finite set of symbols, called an alphabet. Then Σ* is the countable infinite set of finite words that can be formed by concatenating zero or more symbols from Σ. Any well-defined subset L ⊆ Σ* is a language.

Let's apply this to XML. Its alphabet is the Unicode character set U, which is non-empty and finite. Not every concatenation of zero or more Unicode characters is a well-formed XML document, for example, the string

<tag> soup &; not <//good>

is clearly not. The subset XML ⊂ U* that forms well-formed XML documents is decidable (or “recursive”). There exists a machine (algorithm or computer program) that takes as input any word wU* and after a finite amount of time, outputs either 1 if w ∈ XML and 0 otherwise. Such an algorithm is a sub-routine of any XML processing software. Not all languages are decidable. For example, the set of valid C programs that terminate in a finite amount of time, is not (this is known as the halting problem). When one designs a new language, an important decision to make is whether it should be as powerful as possible or whether the expressiveness would better be restricted in favor of decidability.

Some languages can be defined by means of a grammar that is said to produce the language. A grammar consists of

  • a finite set of literals (also called terminal symbols),
  • a disjoint finite set of variables of the grammar (also called non-terminal symbols),
  • a distinguished starting symbol, taken from the set of variables and
  • a finite set of rules (so-called productions) that allow certain kinds of replacements.

Any word that consists exclusively of literals and can be derived by starting with the starting symbol and then applying the given rules belongs to the language produced by the grammar.

For example, the following grammar (in rather informal notation) lets you derive exactly the integers in decimal notation.

  1. The literals of the grammar are the digits 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0.
  2. The variables are the symbols S and D.
  3. S is the starting symbol.
  4. Any occurrence of the variable S may be replaced
    • with the literal 0 or
    • by any of the literals other than 0 followed by the variable D.
  5. Any occurrence of the variable D may be replaced
    • by any of the literals followed by another instance of the variable D or
    • by the empty string.

Here is how we derive 42:

S —(apply rule 4, 2nd variant)→ 4 D —(apply rule 5, 1st variant)→ 42 D —(apply rule 5, 2nd variant)→ 42.

Depending on how elaborate rules you allow in your grammar, differently sophisticated machines are required to prove that a given word can actually be produced by the grammar. The example given above is a regular grammar, which is the most simple and least powerful. The next powerful class of grammars are called context-free. These grammars are also very simple to verify. XML (unless I'm overlooking some obscure feature I'm not aware of) can be described by a context-free grammar. The classification of grammars forms the Chomsky Hierarchy of grammars (and therefore languages). Every language that can be described by a grammar is at least semi-decidable (or “recursively enumerable”). That is, there exists a machine that, given a word that actually belongs to the language, derives a proof that it can be produced by the grammar within finite time, and will never output a wrong proof. Such a machine is called a verifier. Note that the machine may never halt when given a word that doesn't actually belong to the language. Clearly, we want our programming languages be described by less powerful grammars for the benefit of being able to reject invalid programs within finite time.

Schemata are an addition to XML that allow refining the set of well-formed documents. A well-formed document that follows a certain schema is called valid according to that schema. For example, the string

<?xml version="1.0" encoding="utf-8" ?>
<root>all evil</root>

is a well-formed XML document but not a valid XHTML document. There exists schemata for XHTML, SVG, XSLT and what not else. Schema validation can also be done by an algorithm that is guaranteed to halt after finite amount of steps for every input. Such a program is called a validator or a validating parser. Schemata are defined by so-called scema definition languages, which are a way to formally define grammars. XSD is the official schema-definition language for XML and is, itself, XML-based. RELAX NG is a more elegant, much simpler and slightly less powerful alternative to XSD.

Because you can define your own schemata, XML is called an extensible language, which is the origin of the “X” in “XML”.

You can define a set of rules that gives XML documents an interpretation as descriptions of computer programs. XSLT, mentioned earlier, is an example of such a programming language built with XML. More generally, you can serialize the abstract syntax tree of almost any programming language quite naturally into XML, if this is what you want.

In computer science, formal language is just a set of strings, usually infinite and often described using rules (two common versions of those rules are regular expressions and formal grammars).

Note that this means that all a language needs is syntax, language doesn't need to describe what each valid string means (that's called semantics).

Now, this means that programming languages are formal languages that also have semantics, which describes some computation. And for example XHTML is a formal language, whose semantics describe (roughly and informally) how a hypertext document looks and behaves.

XML is still a language, even though it doesn't have semantics itself (but many languages derived from XML do, like XHTML and XAML).

Technically, binary formats are also languages, but they're not called that way. The term "language" is reserved for human-readable formats.

A language is a method of conveying information.

A programming language is a method of conveying algorithms.

A markup language like XML is a language for conveying data.

XML is a meta-language. You use it to define specific languages. Languages never do anything, they just allow us to express things. Also, it is not true that XML is a "storage language". Just the opposite, in fact. You can store XML docs however you please. XML is better thought of as a transfer language. PS. If you don't think XML "does" anything, you'll have to explain how it is that many systems (e.g. jetty) use XML as a (bad) programming language. It's a lamentable abuse of XML, but it exists in the wild, and that just one of many examples.

许可以下: CC-BY-SA归因
scroll top