Question

I am trying to build a phishing scanner for a class project and I am stuck on trying to get an e-mail saved in a text file to properly copy into an array for later processing. What I want is for each word to be in it's own array index.

Here is my sample e-mail:

Subject: Insufficient Funds Notice
Date: September 25, 2013

Insufficient Funds Notice
Unfortunately, on 09/25/2013 your available balance in your Wells Fargo account XXXXXX4653 was insufficient to cover one or more of your checks, Debit Card purchases, or other transactions. 
An important notice regarding one or more of your payments is now available in your Messages & Alerts inbox. 
To read the message, click here, and first confirm your identity. 
Please make deposits to cover your payments, fees, and any other withdrawals or transactions you have initiated. If you have already taken care of this, please disregard this notice. 
We appreciate your business and thank you for your prompt attention to this matter. 
If you have questions after reading the notice in your inbox, please refer to the contact information in the notice. Please do not reply to this automated email. 
Sincerely, 
Wells Fargo Online Customer Service 
wellsfargo.com | Fraud Information Center
4f57e44c-5d00-4673-8eae-9123909604b6

I don't want any of the punctuation all I need is the words and numbers.

Here is the code I have written for it so far.

    StreamReader sr1 = new StreamReader(lblDisplaySelectedFilePath.Text);
    string line = sr1.ReadToEnd();
    words = line.Split(' ');
    int wordslowercount = 0;
    foreach (string word in words)
    {
        words[wordslowercount] = word.ToLower();
        wordslowercount = wordslowercount + 1;   
    }

The issue with the above code is that I keep getting words that are either strung together and/or have "\r" or "\n" on them in the array. Here is an example of what is in the array that I don't want.

"notice\r\ndate:" don't want the \r, \n, or the :. Also the two words should be in different indexes.

Was it helpful?

Solution

The regex \W will allow you to split your string and create a list of words. This uses word boundaries, so it will not include punctuation.

Regex.Split(inputString, "\\W").Where(x => !string.IsNullOrWhiteSpace(x));

OTHER TIPS

using System;
using System.Text.RegularExpressions;

public class Example
{
    static string CleanInput(string strIn)
    {
        // Replace invalid characters with empty strings. 
        try {
           return Regex.Replace(strIn, @"[^\w\.@-]", "", 
                                RegexOptions.None, TimeSpan.FromSeconds(1.5)); 
        }
        // If we timeout when replacing invalid characters,  
        // we should return Empty. 
        catch (RegexMatchTimeoutException) {
           return String.Empty;   
        }
    }
}

Using line.Split(null) will split on white-space. From the C# String.Split method documentation:

If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the Char.IsWhiteSpace method.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top