Question

Is this possible? Given that C# uses immutable strings, one could expect that there would be a method along the lines of:

var expensive = ReadHugeStringFromAFile();
var cheap = expensive.SharedSubstring(1);

If there is no such function, why bother with making strings immutable? Or, alternatively, if strings are already immutable for other reasons, why not provide this method?

The specific reason I'm looking into this is doing some file parsing. Simple recursive descent parsers (such as the one generated by TinyPG, or ones easily written by hand) use Substring all over the place. This means if you give them a large file to parse, memory churn is unbelievable. Sure there are workarounds - basically roll your own SubString class, and then of course forget about being able to use String methods such as StartsWith or String libraries such as Regex, so you need to roll your own version of these as well. I assume parser generators such as ANTLR basically do that, but my format is simple enough not to justify using such a monster tool. Even TinyPG is probably an overkill.

Somebody please tell me I am missing some obvious or not-so-obvious standard C# method call somewhere...

Was it helpful?

Solution

No, there's nothing like that.

.NET strings contain their text data directly, unlike Java strings which have a reference to a char array, an offset and a length.

Both solutions have "wins" in some situations, and losses in others.

If you're absolutely sure this will be a killer for you, you could implement a Java-style string for use in your own internal APIs.

OTHER TIPS

As far as I know, all larger parsers use streams to parse from. Isn't that suitable for your situation?

The .NET framework supports string interning. This is a partial solution but does not offer the posibility to reuse parts of a string. I think reusing substring will cause some problems not that obviouse at a first look. If you have to do a lot of string manipulation using the StringBuilder is the way to go.

Nothing in C# provides you the out-of-the-box functionality you're looking for.

What want is a Rope data structure, an immutable data structure which supports O(1) concats and O(log n) substrings. I can't find any C# implementations of a rope, but here a Java one.

Barring that, there's nothing wrong with using TinyPG or ANTLR if that's the easiest way to get things done.

Well you could use "unsafe" to do the memory management yourself, which might allow you to do what you are looking for. Also the StringBuilder class is great for situations where a string needs to be manipulated numerous times, since it doesn't make a new string with each manipulation.

You could easily write a trivial class to represent "cheap". It would just hold the index of the start of the substring and the length of the substring. A couple of methods would allow you to read the substring out when needed - a string cast operator would be ideal as you could use

string text = myCheapObject;

and it would work seamlessly as if it were an actual string. Adding support for a few handy methods like StartsWith would be quick and easy (they'd all be one liners).

The other option is to write a regular parser and store your tokens in a Dictionary from which you share references to the tokens rather than keeping multiple copies.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top