Question

I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:

key1:value1
key2:value2
key3:value3
...

This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?

Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.

Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.

Note: I do NOT need any whitespaces at all, even inside a key or a value.

Was it helpful?

Solution

The requirements are too fuzzy. Consider:

"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?

These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.

Define a simple grammar and take out most of the guesswork.

"{key}":"{value}",

Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.

Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).

Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.

OTHER TIPS

I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:

static public string StripControlChars(this string s)
{
    return Regex.Replace(s, @"[^\x20-\x7F]", "");
}

Combined with the other RegEx examples already posted it should get you where you want to go.

If you use Regex (Regular Expressions) you can filter out all of that with one function.

string newVariable Regex.Replace(variable, @"\s", "");

That will remove whitespace, invisible chars, \n, and \r.

One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.

Regex.Replace(s, @"[^\x20-\x7F]", "")

should do that job.

var split = textLine.Split(":").Select(s => s.Trim()).ToArray();

The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.

You can use string.Trim() to remove white-space characters:

var results = lines
        .Select(line => {
            var pair = line.Split(new[] {':'}, 2);
            return new {
                Key = pair[0].Trim(),
                Value = pair[1].Trim(),
            };
        }).ToList();

However, if you want to remove all white-spaces, you can use regular expressions:

var whiteSpaceRegex = new Regex(@"\s+", RegexOptions.Compiled);
var results = lines
        .Select(line => {
            var pair = line.Split(new[] {':'}, 2);
            return new {
                Key = whiteSpaceRegex.Replace(pair[0], string.Empty),
                Value = whiteSpaceRegex.Replace(pair[1], string.Empty),
            };
        }).ToList();

If it doesn't have to be fast, you could use LINQ:

string clean = new String(tainted.Where(c => 0 <= "ABCDabcd1234:\r\n".IndexOf(c)).ToArray());
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top