Pergunta

I have written an application named address_parser.exe in C# (WinForm), targeted for PCs running Windows XP, Vista, 7 and 8. With the .NET Framework version 3.5 being the minimal set up...

The application reads in and parses text files (plain text files only, as I have no control over the input files so XML is not an option, unfortunately).

These text files contain a set of data, lets say an address, split over multiple, non consecutive, lines.

Please have a look at the following two text files as a demo:

address_type_1.txt:

Elm Grove
47

PO5 1JF


Southsea

and

address_type_2.txt:

Southsea

Albert Road



147b


PO4 0JW

Now, currently I have hard coded the information where in the input file the street, the house number, the zip code and the city is located, in my code. So for each address file type if have created a set of rules, which line contains which information.

In addition, I have a set of regular expressions that check the validity of each information (street, house number, zip code, city).

Since these two sets of rules/checks (which line contains which information/regex pattern for each information) vary for each different address type, I would like to store these rules in a sort of config file. So instead of hard coding this, I would like to have a configuration file for each address type, that my application can read and configure itself how to parse the particular address file type.

I would like to get some ideas and inspiration from you. Please share your thoughts and best practises!

Thanks!

Below are some thoughts of mine, and code snippets I am using so far...

My currently hard coded address file parsing runs like this:

public static Address Parse(string fileName)
{
    var a = new Address();
    a.OriginalFile = fileName;
    int i = 0;
    using (var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.None))
    {
        using (var reader = new StreamReader(fs, Encoding.GetEncoding(65001)))
        {
            Regex rgxStreet = new Regex(@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$");
            Regex rgxNumber = new Regex(@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,20}$");
            Regex rgxCity = new Regex(@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$");
            Regex rgxZIP = new Regex(@"^([0-9]){5}$");
            while (!reader.EndOfStream)
            {
                var line = reader.ReadLine().TrimEnd(';').Trim();
                if (line != null)
                {
                    if (i == 4 && rgxStreet.IsMatch(line))
                    {
                        a.Street = line;
                    }
                    else if (i == 7 && rgxNumber.IsMatch(line))
                    {
                        a.Number = line;
                    }
                    else if (i == 12 && (rgxZIP.IsMatch(line) || String.IsNullOrEmpty(line)))
                    {
                        a.Zip = line;
                    }
                    else if (i == 15 && rgxCity.IsMatch(line))
                    {
                        a.City = line;
                    }
                }
                i++;
            }
        }
    }
    return a;
}

As you can see, I am also using individual regular expressions on those 4 attributes to check if the stuff that I am reading is valid.

Now, I would like to modify this hard coded information (line X contains field Y with regular expression Z) so that I can support reading and parsing files where the same information is stored in a different order, or with different valid values.

The example above targets a file containing an address in Germany (ZIP code is 5 digits).

Parsing another type of text file which contains an adress in the UK may look like this:

line 1: city;
line 2: zip;
line 20: street;
line 159: number;

In this example, the order of the information has changed as well as the needed reg ex for the zip code (postal codes in the UK are 6 digits long, and contain letters and numbers).

Instead of hard coding the information how to parse this type of file, I would like something like a config file which tells my application how to parse a specific type of file. Something like this:

#config file for UK address files:
#line;field;regex;
1;city;@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$";
2;zip;@"^([A-Za-z0-9]){6}$";
20;street;@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$";
150;number;@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,20}$";

My question is: is this a good idea, or are there better ways to achieve this (to tell my application how a specific file needs to be read and parsed and its contents interpreted and validated)?

Thank you!

Foi útil?

Solução

Yes is a good idea, use Newtonsoft.Json to help you with the config load like

private class StartSettings
{
    public string CityReg;
    public int CityNum;
    public string ZipReg;
    public int ZipNum;
    public string StreetReg;
    public int StreetNum;
    public string NumberReg;
    public int NumberNum;
}

var configString = File.ReadAllText(configFilePath);
var config = JsonConvert.DeserializeObject<StartSettings>(configString);

And to read the files just use

Regex rgxStreet = new Regex(config.StreetReg);
Regex rgxNumber = new Regex(config.NumberReg);
Regex rgxCity = new Regex(config.CityReg);
Regex rgxZIP = new Regex(config.ZipReg);

foreach (var line = File.ReadLines(fileName, Encoding.GetEncoding(65001))
                        .Select(l => l.TrimEnd(';').Trim())
{
    if(config.CityNum == i && rgxCity.IsMatch(line))
        a.City = line;
    ...
    i++;
}
return a;

Outras dicas

Since I doubt it is possible to determine if a value is a street or Cityname, you need to specifiy atleast some information on iput-data in what "format" the data is made up.

If it is possible for you to still decide dataformat go for XML.

Use XML and XmlSerializer like so:

[Serializable]
public class AdressData
{
    [XmlArrayItem("Adress")]
    public Adress[] Adresses

}

[Serializable]
public class Adress
{
    public string Street {get; set;}
    public int Number {get; set;}
    public int Zip{get; set;}
    public string City{get; set;}
    public string State{get; set;}
}

Then use it like this:

XmlSerializer serializer = new XmlSerializer(typeof(AdressData));
AdressData data = (AdressData)serializer.Deserialize(File.Open(fileName));

foreach(Adress adress in data.Adresses)
{
    checkIfItExists(adress);
}

Your XMl should look like this:

<AdressData>
  <Adresses>
    <Adress>
         <Street>WhateverStr</Street>
         <Number>7</Number>
         <Zip>5675765</Zip>
         <City>Citytown</City>
         <State>Alabama</State>
    </Adress>
      <Adress>
         <!-- Order doesnt matter here -->
         <Number>7</Number>
         <Zip>5675765</Zip>
         <City>Citytown</City>
         <State>Alabama</State>
         <Street>WhateverStr</Street>
    </Adress>
  </Adresses>
</AdressData>

The order of the data in the XML doesnt matter, as long as it fitts in the hirearchy. The serializer does some Validation e.g. tries to parse numeric values. All you need to do is check whether the information itself is valid.

It is capable of parsing Enums aswell, so you could (wouldnt recommend though) create an Enum containing all US-Statenames...

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top