Question

I'm trying to remove strings with unrecognized characters from string collection. What is the best way to accomplish this?

Was it helpful?

Solution

To remove strings that contain any characters you don't recognize: (EG: if you want to accept lowercase letters, then "foo@bar" would be rejected")

  1. Create a regular expression which defines the set of "recognized" characters, and starts with ^ and ends with $. For example, if your "recognized" characters are uppercase A through Z, it would be ^[A-Z]$
  2. Reject strings that don't match

Note: This won't work for strings that contain newlines, but you can tweak it if you need to support that

To remove strings that contain entirely characters you don't recognize: (EG: If you want to accept lowercase letters, then "foo@bar" would be accepted because it does contain at least one lowercase letter)

  1. Create a regular expression which defines the set of "recognized" characters, but with a ^ character inside the square brackets, and starts with ^ and ends with $. For example, if your "recognized" characters are uppercase A through Z, it would be ^[^A-Z]$
  2. Reject strings that DO match

OTHER TIPS

Since Array (assuming string[]) is not re-sized when removing items you will need to create new one anyway. So basic LINQ filtering with ToArray() will give you new array.

myArray = myArray.Where(s => !ContainsSpecialCharacters(s)).ToArray();

I would look at Linq's where method, along with a regular expression containing the characters you're looking for. In pseudocode:

return myStringCollection.Where(!s matches regex)

this does what you seem to want.

List<string> strings = new List<string>()
{
    "one",
    "two`",
    "thr^ee",
    "four"
};

List<char> invalid_chars = new List<char>()
{
    '`', '-', '^'
};

strings.RemoveAll(s => s.Any(c => invalid_chars.Contains(c)));
strings.ForEach(s => Console.WriteLine(s));

generates output:

one
four

This question has some similar answers to what I think you are looking for. However, I think you want to include all letters, numbers, whitespace and punctuation, but exclude everything else. Is that accurate? If so, this should do it for you:

char[] arr = str.ToCharArray();

arr = Array.FindAll<char>(arr, (c => (char.IsLetterOrDigit(c) || 
                      char.IsWhiteSpace(c) || char.IsPunctuation(c))));
str = new string(arr);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top