Domanda

Assuming a post code is in the form A0A 0AA, or A0 0AA where A is any letter and 0 is any number i have written the following sed script to search a web page for a post code.

s/\(([[:alnum:]]\{2,4\})\) \(([[:alnum:]]\{3\})\)/\1 \2/p

To store the first part (A0A) in the first region and second part (0AA) in the second region. then printing out what is found. However running this is currently not finding any postcodes.

Any ideas? thanks

È stato utile?

Soluzione 2

It's hard to find something right with your regex.

  1. What are the inner, unescaped parentheses there for? Because they are unescaped, they are literally matched. They serve no purpose, in any case.
  2. Why are you trying to match two [:alnum:] blocks when your actual pattern requires [:alpha:] in some places and [:digit:] in others?
  3. Why {2,4}? You want two or three, not two, three or four. What you actually want is either letter-number-letter or letter-number.
  4. Because you don't specify word boundaries, even if you fix your regex, the first pattern will match A0 at the end of a word and the second pattern will match 0AA at the beginning of the word.

You need to, at minimum

  1. Drop the inner parentheses
  2. Change the {2,4} to {2,3}
  3. Add word boundary matches at the beginning and end of the regex

However, this will still not properly satisfy your requirements. It will match invalid patterns. What you really need to do is

  1. Drop the inner parentheses
  2. Change the first pattern to match either [:alpha:][:digit:] or [:alpha:][:digit:][:alpha:] (there are two ways to do this).
  3. Change the second pattern to match [:digit:][:alpha:][:alpha:]
  4. Add word boundary matches at the beginning and end of the regex.

I didn't give a concrete example of how to do this because you asked for "any ideas". I'm assuming you want to try and fix this yourself given the right pointers.

Altri suggerimenti

I realise you're asking about a subset of valid postcodes, but I hope this solution for UK postcodes will help. I'd approach the problem like this:

Looking at the format for post-codes, the formats are

  • A9 9AA
  • A99 9AA
  • AA9 9AA
  • AA99 9AA
  • A9A 9AA
  • AA9A 9AA

A regex for the last part is easy: [0-9][A-Z]{2}

The first part is tricker. I'd split the problem into two:

  • The first four patterns above can be matched using [A-Z]{1,2}[0-9]{1,2}, i.e. one or two letters followed by one or two digits;
  • The last two patterns can be matched using [A-Z]{1,2}[0-9][A-Z], i.e. one or two letters, then a digit and a letter.

Putting it all together:

sed -rn 's/.*(([A-Z]{1,2}[0-9]{1,2}|[A-Z]{1,2}[0-9][A-Z]) [0-9][A-Z]{2}).*/\1/p'

It looks like you have some problems with your brackets. The following works for me:

$ sed -n 's/.*\b\([[:alnum:]]\{2,3\}\) \([[:alnum:]]\{3\}\)\b.*/\1 \2/p' <<< "here is a postcode: A0A 0AA. some more text"
A0A 0AA
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top