Pergunta

In, for example, the Bash scripting language, the following creates a string called $VAR which begins at the first " quote and continues until the next unescaped " quote.

$VAR="
    hello
world!

this string preserves all
    whitespace"

This makes it very easy to write multiline strings without concatentation or a million annoying \ns everywhere, and it makes the parser very easy to write (speaking from experience) because you can just gobble everything between unescaped quotes with a regex like "([^"\\]*(?:\\.[^"\\]*)*)" or so.

Bash is (hopefully!) not a mission-critical or systems-programming language, but it is a systems-scripting language intended for *nx boxes on which everything is text, so perhaps it's apt.

Recall that Bash is written in C, and so this string is (probably) stored as \n\thello\nworld\n etc, but the point is the source written by the programmer (and the above is far more readable).

Many (I daresay C-influenced) "proper" Languages Used For Real Purposes find some unknown problem with allowing strings to contain literal newlines, and thus require one or more of the following:

  • escape sequences\n (which get compiled into \r\n on Windows)

  • special syntax (""" multiline string """ in Py, `multiline string` in Go, or R" raw string literal " in C++11, etc)

  • special functions to write newlines (Forth's CR, for example, although Forth gets a pass because it knows squat about strings)

I do not understand why more languages don't allow strings to be "implicitly" multiline.

Pros:

  • ease of use & practicality, clearer code, etc

  • simpler, more straightforward and thus more maintainable parser (at least, for hand-written ones)

Cons:

  • may make some code less readable, if abused

  • ?

Is there an explicit reason this is the case, or has it just been blindly(?) adopted from C like so many other things? Moreover, if I'm writing a parser or designing a language, is there a compelling argument as to why I should restrict string literals to a single line?

Foi útil?

Solução

FWIW, Ocaml accepts a limited form of multi-line string literal :

String literals are delimited by " (double quote) characters. The two double quotes enclose a sequence of either characters different from " and \, or escape sequences from the table given above for character literals.

To allow splitting long string literals across lines, the sequence \newline spaces-or-tabs (a backslash at the end of a line followed by any number of spaces and horizontal tabulations at the beginning of the next line) is ignored inside string literals.

and C++11 has raw string literals so you can code:

const char* s1 = R"foo(
Hello
World
)foo";

Hence several languages have some ways to write multi-string literals.

Outras dicas

What happens when you didn't mean to have a multi-line string, but instead forgot to close the quote?

The parser will chew through the code until it hits another quote in a completely different part of the program, then proceed as normal. This will very likely lead to confusing, unrelated errors since the string is no longer the parse error. At worst, you get a program that compiles properly and does something completely different.

This is compounded by partial-processing of code in modern IDEs. As you're typing the string, you're going to cause this scenario naturally. That will cause the IDE to toss the cached AST since it sees a bunch of stuff has changed, leading to slower intellisense (and similar constructs).

The preprocessor has already given meaning to newline characters. You can't completely undo that at a higher level. Compare:

char s1[] = "This is how macros work in C\nExample\n    #define IS_GOOD 1\n";

with

char s2[] = "This is how macros work in C
Example
    #define IS_GOOD 0
";

Clearly the second is easier to read (in a hypothetical C compiler that accepts multiline string literals).

It also doesn't do what you expected. s2 doesn't contain an example of C code at all, what you actually got was:

char s2[] = "This is how macros work in C\nExample\n";

oops.

Or, you can change the preprocessor grammar also, by making it aware of quoting. Then you lose the ability to expand macros to definitions containing quotes. Hardly any better.

Things that cause unexpected and confusing results are not desirable features.

I cannot answer on the "why"; as far as I know language designers tend to copy the "bad stuff" as many times as the "good stuff" when designing a language based on other languages.

I do have to say that using RegEx to parse your code is not the best way to do it and writing a parser which can keep track of multiline strings might be harder than you would expect it to be especially if indentation is part of the language.

What I can say about designing a new string format is to not even use the double quote ". If I were to design a language I would use parentheses to encapsulate "content" and implement a function in the standard library like so:

let foo = String(This is my string which can be multiline and does not
                 need escape characters for anything other than \( and \).)
Licenciado em: CC-BY-SA com atribuição
scroll top