Why do most language's definition of substring allow substring(“abc”, 3) => “”

https://softwareengineering.stackexchange.com/questions/302176

08-12-2020
|

Pergunta

I've seen most languages have it be the case for their substring method that using the length of a given string in the method as the start index will give you an empty string. It is most definitely helpful when writing algorithms involving successive shortening of the string down to an empty string. The problem I have is that it makes most of the summaries of these methods/functions look inaccurate when they describe the result.

For instance, Java's substring for String states that "The substring begins at the specified beginIndex and extends to the character at index endIndex - 1.". This makes sense for all values 0 <= i < len(string). As soon as you use len(string) however, what does that index refer to? In a language like C using C strings, it naturally becomes the null terminator which we treat as an empty string. The implementations I've seen specifically check the range of the indices to be [0, len(string)]. When only one argument is specified we will take the difference between the start index and the length of the string which happens to work out to 0 for startIndex=len(string).

I've come to believe this is just an unspoken convention among languages that go back to the roots of NULL terminated strings to act as empty strings. Can anyone shed some light beyond it's just the way it is?

Solução

You're thinking of the meaning of this argument wrong.

Slightly surprisingly, the value doesn't count characters. What it does count is positions between the characters in the string.

The String "abc" has four such positions: before the a, between a and b, between b and c, and after the c. As usual in computing, we count these positions from 0 to 3. By specifying two such positions, it's immediately clear which characters should and shouldn't belong to the substring: substr("abc", 0, 1) obviously means "a", and substr("abc", 2,3) means "c". And substr("abc", 3) can only mean "Start at the position after the c", which will obviously yield the empty string.

All substring functions I have ever seen work this way, although the only place I've seen this spelt out for me is the emacs online help - that one is a real eye-opener. (Emacs is "focused with maniacal intensity on the deceptively simple-seeming problem of editing text" - Neal Stephenson)

Outras dicas

substring ("abc", 0) gives you three characters.

substring ("abc", 1) gives you two characters, one less.

substring ("abc", 2) gives you one character, one less again.

At this point you expect substring ("abc", 3) to give you one character less again, which would be zero characters. It's possible to return zero characters (result = "") so that's what it returns.

substring ("abc", 4) should give one character less again, which would be -1 characters. Clearly, that's not possible, so there's no reasonable result that this could return to you. Accordingly, it fails.

Your substring function would be an overload of a function that takes the start and end. There you would pass the start and 1 past the last.

This means that the entire string would be 0, length; it also means that an empty string would have both values the same.

Letting the index be between 0 to length inclusive also lets you call them on empty strings (for what it's worth).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange