Pros and cons of immutable strings

https://stackoverflow.com/questions/11854683

25-06-2021
|

Question

Some languages (C# or Java) have immutable strings while others (e.g. Ruby) have mutable ones. What are the reasons behind those desgin choices?

Solution

One reason why immutable strings are good is that it makes Unicode support easier. Modern Unicode can no longer fit efficiently into a fixed-size data cell, which kills the one-to-one correspondence between string index and memory address which gives mutable strings their advantage.

In the past, most Western applications used single-byte characters (various ASCII-based codings, or EBCDIC...), so you could usually handle them efficiently by treating strings as byte buffers (as in traditional C applications).

When Unicode was fairly new, there wasn't much requirement for anything outside the first 16 bits, so Java used double-byte characters for its Strings (and StringBuffers). This used twice the memory, and ignored any problems that might occur from Unicode extensions beyond 16 bits, but it was convenient at the time.

Now Unicode is not so new, and while the most-used characters still fit in 16 bits, you can't really get away with pretending the Basic Multilingual Plane is all that exists. If you want to honestly claim Unicode support, you need either variable-length characters or even larger (32-bit?) character cells.

With variable-length characters, you can no longer index into an arbitrary-length string in O(1) time -- barring additional information, you need to count from the beginning to figure out what the N'th character is. This also kills the main advantage of mutable string buffers: the ability to seamlessly modify substrings in place.

Fortunately, most string manipulation doesn't actually need this modify-in-place ability. Lexing, parsing, and search all proceed on a sequential, iterative basis, from beginning to end. General search-and-replace was never in-place to begin with, since the replacement string doesn't have to be the same length as the original.

Concatenating large numbers of substrings doesn't actually need modify-in-place to be efficient, either. You do need to be more careful about it, though, since (as others have pointed out) a naive concatenation loop can easily be O(N^2) by allocating a new string for each of N partial substrings...

One way to avoid naive concatenation is to provide a mutable StringBuffer or ConcatBuffer object designed to do concatenation efficiently. Another way would be to include an immutable string constructor that takes an iterator into a sequence of strings to be (efficiently) concatenated.

But, more generally, it is possible to write an immutable string library that efficiently concatenates by reference. This kind of string is often called a "rope" or "cord" to suggest that it is at least a bit more heavyweight than the basic strings it's composed of, but for concatenation purposes it is much more efficient, since it doesn't need to recopy the data at all!

The above Wikipedia link says that "rope" datastructures are O(log N) to concatenate, but the seminal paper "Purely Functional Data Structures" by Okasaki shows how to do concatenation in O(1) time.

OTHER TIPS

At least in the case of Java, part of the reason for making strings immutable was for security and thread-safety. Java places a premium on runtime safety (it was originally designed to allow set-top boxes and web browsers to download and execute remote content without compromising the host system). To help increase security, strings are immutable and cannot be subclassed. This means that the Java runtime can pass around and receive strings from the user while guaranteeing that the string's value will remain constant (that is, an attacker can't subclass the string, pass in what looks like a valid string into a function, but then change the value later on to gain access to the wrong data, or alternatively use multiple threads so that a string looks correct at one point, but then is mutated later on).

Additionally, immutability carries efficiency advantages in multithreaded systems, since no locking has to be done on the string. It also makes it possible to easily implement substring operations, since many strings can share the same underlying array of characters, though with different start and end points.

If you think about it, all the fundamental data types are immutable. You don't change integer 10 into 11, you replace 10 with 11. Making strings fundamental, and immutable, allows pooling and other optimizations that would not be possible otherwise.

As for the cons, immutable strings require complementary mutable data structures (i.e. string buffers) to allow economical appending, reordering and other similar operations.

Such operations performed over immutable structures would require unreasonable amounts of resources.

Programming in Lua has a brilliant explanation on the matter.

To reflect further, some languages (like Common Lisp) have both non-destructive and destructive functions, others – both immutable and mutable lists (Python).

To quote a book on Common Lisp:

If assignment is so fraught with peril, why not just omit it from the language? There are two reasons: expressiveness and efficiency. Assignment is the clearest way to alter shared data. And assignment is more efficient than binding. Binding creates a new storage location, which allocates storage, which consumes additional memory (if the binding never goes out of scope) or taxes the garbage collector (if the binding eventually does go out of scope).

However, as a counter-example, many JavaScript (which has immutable strings) interpreters, treat strings as mutable arrays on the implementation level.

In the same vein, Clojure has transients, which look like elegant pure functions over immutable data structures, but inside use mutable state for efficiency.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow