RegEx para los códigos postales del Reino Unido coincidentes

https://stackoverflow.com/questions/164979

03-07-2019
|

Pregunta

Estoy buscando una expresión regular que valide un código postal completo del Reino Unido solo dentro de una cadena de entrada. Todos los formularios de códigos postales que no sean comunes deben estar cubiertos, así como los habituales. Por ejemplo:

Matches

CW3 9SS
SE5 0EG
SE50EG
se5 0eg
WC2H 7LT

Sin coincidencia

aWC2H 7LT
WC2H 7LTa
WC2H

¿Cómo resuelvo este problema?

Solución

Recomiendo echar un vistazo a la Norma de Datos del Gobierno del Reino Unido para códigos postales [enlace ahora muerto; archivo de XML , consulte Wikipedia para discusión]. Hay una breve descripción sobre los datos y el esquema xml adjunto proporciona una expresión regular. Puede que no sea exactamente lo que quieres pero sería un buen punto de partida. El RegEx se diferencia ligeramente del XML, ya que la definición dada permite un carácter P en la tercera posición en el formato A9A 9AA.

El RegEx suministrado por el Gobierno del Reino Unido era:

([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})

Como se señaló en la discusión de Wikipedia, esto permitirá algunos códigos postales no reales (por ejemplo, los que comienzan con AA, ZY) y proporcionan una prueba más rigurosa que podrías probar.

Otros consejos

Parece que vamos a utilizar ^ (GIR? 0AA | [A-PR-UWYZ] ([0-9] {1,2} | ([A-HK-Y] [ 0-9] ([0-9ABEHMNPRV-Y])?) | [0-9] [A-HJKPS-UW])? [0-9] [ABD-HJLNP-UW-Z] {2}) $ , que es una versión ligeramente modificada de la sugerida por Minglis arriba.

Sin embargo, tendremos que investigar exactamente cuáles son las reglas, ya que las diversas soluciones enumeradas anteriormente parecen aplicar reglas diferentes en cuanto a qué letras están permitidas.

Después de algunas investigaciones, hemos encontrado más información. Aparentemente, una página en 'govtalk.gov.uk' apunta a una especificación de código postal govtalk-postcodes . Esto apunta a un esquema XML en Esquema XML que proporciona una declaración 'pseudo regex' de las reglas del código postal.

Hemos tomado eso y lo hemos trabajado un poco para darnos la siguiente expresión:

^((GIR &0AA)|((([A-PR-UWYZ][A-HK-Y]?[0-9][0-9]?)|(([A-PR-UWYZ][0-9][A-HJKSTUW])|([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y]))) &[0-9][ABD-HJLNP-UW-Z]{2}))$

Esto hace que los espacios sean opcionales, pero te limita a un espacio (reemplaza '& amp;' con '{0,} para espacios ilimitados). Se supone que todo el texto debe estar en mayúsculas.

Si desea permitir minúsculas, con cualquier número de espacios, use:

^(([gG][iI][rR] {0,}0[aA]{2})|((([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y]?[0-9][0-9]?)|(([a-pr-uwyzA-PR-UWYZ][0-9][a-hjkstuwA-HJKSTUW])|([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y][0-9][abehmnprv-yABEHMNPRV-Y]))) {0,}[0-9][abd-hjlnp-uw-zABD-HJLNP-UW-Z]{2}))$

Esto no cubre los territorios de ultramar y solo aplica el formato, NO la existencia de diferentes áreas. Se basa en las siguientes reglas:

Puede aceptar los siguientes formatos:

"GIR 0AA"
A9 9ZZ
A99 9ZZ
AB9 9ZZ
AB99 9ZZ
A9C 9ZZ
AD9E 9ZZ

Donde:

9 puede ser cualquier número de un solo dígito.
A puede ser cualquier letra excepto Q, V o X.
B puede ser cualquier letra excepto I, J o Z.
C puede ser cualquier letra excepto I, L, M, N, O, P, Q, R, V, X, Y o Z.
D puede ser cualquier letra excepto I, J o Z.
E puede ser cualquiera de A, B, E, H, M, N, P, R, V, W, X o Y.
Z puede ser cualquier letra, excepto C, I, K, M, O o V.

Mis mejores deseos

Colin

No existe una expresión regular completa del código postal del Reino Unido que sea capaz de validar un código postal. Puede verificar que un código postal esté en el formato correcto usando una expresión regular; No es que realmente exista.

Los códigos postales son arbitrariamente complejos y cambian constantemente. Por ejemplo, el código de salida W1 no tiene, y puede que nunca, tenga todos los números entre 1 y 99, para cada área de código postal.

No puedes esperar que lo que hay actualmente sea verdad para siempre. Como ejemplo, en 1990, la oficina de correos decidió que Aberdeen se estaba llenando de gente. Agregaron un 0 al final de AB1-5, lo que lo convierte en AB10-50 y luego crearon una serie de códigos postales entre estos.

Cada vez que se crea una nueva calle, se crea un nuevo código postal. Es parte del proceso para obtener permiso para construir; las autoridades locales están obligadas a mantener esto actualizado con la Oficina de Correos (no es que todas lo hagan).

Además, como indican otros usuarios, están los códigos postales especiales como Girobank, GIR 0AA y el de las cartas a Santa, SAN TA1. Probablemente no quiera publicar nada allí, pero no lo hace. No parece estar cubierto por ninguna otra respuesta.

Luego, están los códigos postales BFPO, que ahora son cambiando a Un formato más estándar . Ambos formatos van a ser válidos. Por último, están los territorios de ultramar ^{fuente Wikipedia}.

+----------+----------------------------------------------+
| Postcode |                   Location                   |
+----------+----------------------------------------------+
| AI-2640  | Anguilla                                     |
| ASCN 1ZZ | Ascension Island                             |
| STHL 1ZZ | Saint Helena                                 |
| TDCU 1ZZ | Tristan da Cunha                             |
| BBND 1ZZ | British Indian Ocean Territory               |
| BIQQ 1ZZ | British Antarctic Territory                  |
| FIQQ 1ZZ | Falkland Islands                             |
| GX11 1AA | Gibraltar                                    |
| PCRN 1ZZ | Pitcairn Islands                             |
| SIQQ 1ZZ | South Georgia and the South Sandwich Islands |
| TKCA 1ZZ | Turks and Caicos Islands                     |
+----------+----------------------------------------------+

A continuación, debe tener en cuenta que el Reino Unido " " " Su sistema de código postal a muchos lugares del mundo. Cualquier cosa que valide un " Reino Unido " El código postal también validará los códigos postales de otros países.

Si desea validar un código postal del Reino Unido, la forma más segura de hacerlo es utilizar una búsqueda de códigos postales actuales. Hay una serie de opciones:

La encuesta de artillería publica Code-Point Open bajo una licencia de datos abiertos. Estará un poco atrasado, pero es gratis. Esto (probablemente, no puedo recordar) no incluirá datos de Irlanda del Norte, ya que la Encuesta de Artillería no tiene competencias allí. El mapeo en Irlanda del Norte está a cargo de Ordnance Survey of Northern Ireland y tienen su Pointer product. Podría usar esto y agregar los pocos que no se cubren con bastante facilidad.
Royal Mail lanza el Archivo de dirección de código postal (PAF) , esto incluye BFPO, que soy No estoy seguro de Code-Point Open. Se actualiza regularmente, pero cuesta dinero (y pueden ser francamente malos al respecto). PAF incluye la dirección completa en lugar de solo códigos postales y viene con su propio Programmers Guía . El Grupo de Usuarios de Datos Abiertos (ODUG) está actualmente presionando para que PAF sea liberado de forma gratuita, aquí hay una descripción de su posición .
Por último, hay AddressBase . Esta es una colaboración entre Ordnance Survey, Local Authorities, Royal Mail y una compañía que combina para crear un directorio definitivo de toda la información sobre todas las direcciones del Reino Unido (también han tenido bastante éxito). Se paga, pero si está trabajando con una Autoridad local, un departamento gubernamental o un servicio gubernamental, es gratuito para que lo usen. Hay mucha más información que los códigos postales incluidos.

I recently posted an answer to this question on UK postcodes for the R language. I discovered that the UK Government's regex pattern is incorrect and fails to properly validate some postcodes. Unfortunately, many of the answers here are based on this incorrect pattern.

I'll outline some of these issues below and provide a revised regular expression that actually works.

Note

My answer (and regular expressions in general):

Only validates postcode formats.
Does not ensure that a postcode legitimately exists.
- For this, use an appropriate API! See Ben's answer for more info.

_{If you don't care about the bad regex and just want to skip to the answer, scroll down to the Answer section.}

The Bad Regex

The regular expressions in this section should not be used.

This is the failing regex that the UK government has provided developers (not sure how long this link will be up, but you can see it in their Bulk Data Transfer documentation):

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$

Problems

Problem 1 - Copy/Paste

See regex in use here.

As many developers likely do, they copy/paste code (especially regular expressions) and paste them expecting them to work. While this is great in theory, it fails in this particular case because copy/pasting from this document actually changes one of the characters (a space) into a newline character as shown below:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))
[0-9][A-Za-z]{2})$

The first thing most developers will do is just erase the newline without thinking twice. Now the regex won't match postcodes with spaces in them (other than the GIR 0AA postcode).

To fix this issue, the newline character should be replaced with the space character:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                                     ^

Problem 2 - Boundaries

See regex in use here.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
^^                     ^ ^                                                                                                                                            ^^

The postcode regex improperly anchors the regex. Anyone using this regex to validate postcodes might be surprised if a value like fooA11 1AA gets through. That's because they've anchored the start of the first option and the end of the second option (independently of one another), as pointed out in the regex above.

What this means is that ^ (asserts position at start of the line) only works on the first option ([Gg][Ii][Rr] 0[Aa]{2}), so the second option will validate any strings that end in a postcode (regardless of what comes before).

Similarly, the first option isn't anchored to the end of the line $, so GIR 0AAfoo is also accepted.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$

To fix this issue, both options should be wrapped in another group (or non-capturing group) and the anchors placed around that:

^(([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))$
^^                                                                                                                                                                      ^^

Problem 3 - Improper Character Set

See regex in use here.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                       ^^

The regex is missing a - here to indicate a range of characters. As it stands, if a postcode is in the format ANA NAA (where A represents a letter and N represents a number), and it begins with anything other than A or Z, it will fail.

That means it will match A1A 1AA and Z1A 1AA, but not B1A 1AA.

To fix this issue, the character - should be placed between the A and Z in the respective character set:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                        ^

Problem 4 - Wrong Optional Character Set

See regex in use here.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                        ^

I swear they didn't even test this thing before publicizing it on the web. They made the wrong character set optional. They made [0-9] option in the fourth sub-option of option 2 (group 9). This allows the regex to match incorrectly formatted postcodes like AAA 1AA.

To fix this issue, make the next character class optional instead (and subsequently make the set [0-9] match exactly once):

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) [0-9][A-Za-z]{2})$
                                                                                                                                                ^

Problem 5 - Performance

Performance on this regex is extremely poor. First off, they placed the least likely pattern option to match GIR 0AA at the beginning. How many users will likely have this postcode versus any other postcode; probably never? This means every time the regex is used, it must exhaust this option first before proceeding to the next option. To see how performance is impacted check the number of steps the original regex took (35) against the same regex after having flipped the options (22).

The second issue with performance is due to the way the entire regex is structured. There's no point backtracking over each option if one fails. The way the current regex is structured can greatly be simplified. I provide a fix for this in the Answer section.

Problem 6 - Spaces

See regex in use here

This may not be considered a problem, per se, but it does raise concern for most developers. The spaces in the regex are not optional, which means the users inputting their postcodes must place a space in the postcode. This is an easy fix by simply adding ? after the spaces to render them optional. See the Answer section for a fix.

Answer

1. Fixing the UK Government's Regex

Fixing all the issues outlined in the Problems section and simplifying the pattern yields the following, shorter, more concise pattern. We can also remove most of the groups since we're validating the postcode as a whole (not individual parts):

See regex in use here

^([A-Za-z][A-Ha-hJ-Yj-y]?[0-9][A-Za-z0-9]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})$

This can further be shortened by removing all of the ranges from one of the cases (upper or lower case) and using a case-insensitive flag. Note: Some languages don't have one, so use the longer one above. Each language implements the case-insensitivity flag differently.

See regex in use here.

^([A-Z][A-HJ-Y]?[0-9][A-Z0-9]? ?[0-9][A-Z]{2}|GIR ?0A{2})$

Shorter again replacing [0-9] with \d (if your regex engine supports it):

See regex in use here.

^([A-Z][A-HJ-Y]?\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$

2. Simplified Patterns

Without ensuring specific alphabetic characters, the following can be used (keep in mind the simplifications from 1. Fixing the UK Government's Regex have also been applied here):

See regex in use here.

^([A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$

And even further if you don't care about the special case GIR 0AA:

^[A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}$

3. Complicated Patterns

I would not suggest over-verification of a postcode as new Areas, Districts and Sub-districts may appear at any point in time. What I will suggest potentially doing, is added support for edge-cases. Some special cases exist and are outlined in this Wikipedia article.

Here are complex regexes that include the subsections of 3. (3.1, 3.2, 3.3).

In relation to the patterns in 1. Fixing the UK Government's Regex:

See regex in use here

^(([A-Z][A-HJ-Y]?\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$

And in relation to 2. Simplified Patterns:

See regex in use here

^(([A-Z]{1,2}\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$

3.1 British Overseas Territories

The Wikipedia article currently states (some formats slightly simplified):

AI-1111: Anguila
ASCN 1ZZ: Ascension Island
STHL 1ZZ: Saint Helena
TDCU 1ZZ: Tristan da Cunha
BBND 1ZZ: British Indian Ocean Territory
BIQQ 1ZZ: British Antarctic Territory
FIQQ 1ZZ: Falkland Islands
GX11 1ZZ: Gibraltar
PCRN 1ZZ: Pitcairn Islands
SIQQ 1ZZ: South Georgia and the South Sandwich Islands
TKCA 1ZZ: Turks and Caicos Islands
BFPO 11: Akrotiri and Dhekelia
ZZ 11 & GE CX: Bermuda (according to this document)
KY1-1111: Cayman Islands (according to this document)
VG1111: British Virgin Islands (according to this document)
MSR 1111: Montserrat (according to this document)

An all-encompassing regex to match only the British Overseas Territories might look like this:

See regex in use here.

^((ASCN|STHL|TDCU|BBND|[BFS]IQQ|GX\d{2}|PCRN|TKCA) ?\d[A-Z]{2}|(KY\d|MSR|VG|AI)[ -]?\d{4}|(BFPO|[A-Z]{2}) ?\d{2}|GE ?CX)$

3.2 British Forces Post Office

Although they've been recently changed it to better align with the British postcode system to BF# (where # represents a number), they're considered optional alternative postcodes. These postcodes follow(ed) the format of BFPO, followed by 1-4 digits:

See regex in use here

^BFPO ?\d{1,4}$

3.3 Santa?

There's another special case with Santa (as mentioned in other answers): SAN TA1 is a valid postcode. A regex for this is very simply:

^SAN ?TA1$

I had a look into some of the answers above and I'd recommend against using the pattern from @Dan's answer (c. Dec 15 '10), since it incorrectly flags almost 0.4% of valid postcodes as invalid, while the others do not.

Ordnance Survey provide service called Code Point Open which:

contains a list of all the current postcode units in Great Britain

I ran each of the regexs above against the full list of postcodes (Jul 6 '13) from this data using grep:

cat CSV/*.csv |
    # Strip leading quotes
    sed -e 's/^"//g' |
    # Strip trailing quote and everything after it
    sed -e 's/".*//g' |
    # Strip any spaces
    sed -E -e 's/ +//g' |
    # Find any lines that do not match the expression
    grep --invert-match --perl-regexp "$pattern"

There are 1,686,202 postcodes total.

The following are the numbers of valid postcodes that do not match each $pattern:

'^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]?[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$'
# => 6016 (0.36%)

'^(GIR ?0AA|[A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]([0-9ABEHMNPRV-Y])?)|[0-9][A-HJKPS-UW]) ?[0-9][ABD-HJLNP-UW-Z]{2})$'
# => 0

'^GIR[ ]?0AA|((AB|AL|B|BA|BB|BD|BH|BL|BN|BR|BS|BT|BX|CA|CB|CF|CH|CM|CO|CR|CT|CV|CW|DA|DD|DE|DG|DH|DL|DN|DT|DY|E|EC|EH|EN|EX|FK|FY|G|GL|GY|GU|HA|HD|HG|HP|HR|HS|HU|HX|IG|IM|IP|IV|JE|KA|KT|KW|KY|L|LA|LD|LE|LL|LN|LS|LU|M|ME|MK|ML|N|NE|NG|NN|NP|NR|NW|OL|OX|PA|PE|PH|PL|PO|PR|RG|RH|RM|S|SA|SE|SG|SK|SL|SM|SN|SO|SP|SR|SS|ST|SW|SY|TA|TD|TF|TN|TQ|TR|TS|TW|UB|W|WA|WC|WD|WF|WN|WR|WS|WV|YO|ZE)(\d[\dA-Z]?[ ]?\d[ABD-HJLN-UW-Z]{2}))|BFPO[ ]?\d{1,4}$'
# => 0

Of course, these results only deal with valid postcodes that are incorrectly flagged as invalid. So:

'^.*$'
# => 0

I'm saying nothing about which pattern is the best regarding filtering out invalid postcodes.

^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$

Regular expression to match valid UK postcodes. In the UK postal system not all letters are used in all positions (the same with vehicle registration plates) and there are various rules to govern this. This regex takes into account those rules. Details of the rules: First half of postcode Valid formats [A-Z][A-Z][0-9][A-Z] [A-Z][A-Z][0-9][0-9] [A-Z][0-9][0-9] [A-Z][A-Z][0-9] [A-Z][A-Z][A-Z] [A-Z][0-9][A-Z] [A-Z][0-9] Exceptions Position - First. Contraint - QVX not used Position - Second. Contraint - IJZ not used except in GIR 0AA Position - Third. Constraint - AEHMNPRTVXY only used Position - Forth. Contraint - ABEHMNPRVWXY Second half of postcode Valid formats [0-9][A-Z][A-Z] Exceptions Position - Second and Third. Contraint - CIKMOV not used

http://regexlib.com/REDetails.aspx?regexp_id=260

Most of the answers here didn't work for all the postcodes I have in my database. I finally found one that validates with all, using the new regex provided by the government:

https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/413338/Bulk_Data_Transfer_-_additional_validation_valid_from_March_2015.pdf

It isn't in any of the previous answers so I post it here in case they take the link down:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$

UPDATE: Updated regex as pointed by Jamie Bull. Not sure if it was my error copying or it was an error in the government's regex, the link is down now...

UPDATE: As ctwheels found, this regex works with the javascript regex flavor. See his comment for one that works with the pcre (php) flavor.

According to this Wikipedia table

enter image description here

This pattern cover all the cases

(?:[A-Za-z]\d ?\d[A-Za-z]{2})|(?:[A-Za-z][A-Za-z\d]\d ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d{2} ?\d[A-Za-z]{2})|(?:[A-Za-z]\d[A-Za-z] ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d[A-Za-z] ?\d[A-Za-z]{2})

When using it on Android\Java use \\d

An old post but still pretty high in google results so thought I'd update. This Oct 14 doc defines the UK postcode regular expression as:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([**AZ**a-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$

from:

https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/359448/4__Bulk_Data_Transfer_-_additional_validation_valid.pdf

The document also explains the logic behind it. However, it has an error (bolded) and also allows lower case, which although legal is not usual, so amended version:

^(GIR 0AA)|((([A-Z][0-9]{1,2})|(([A-Z][A-HJ-Y][0-9]{1,2})|(([A-Z][0-9][A-Z])|([A-Z][A-HJ-Y][0-9]?[A-Z])))) [0-9][A-Z]{2})$

This works with new London postcodes (e.g. W1D 5LH) that previous versions did not.

This is the regex Google serves on their i18napis.appspot.com domain:

GIR[ ]?0AA|((AB|AL|B|BA|BB|BD|BH|BL|BN|BR|BS|BT|BX|CA|CB|CF|CH|CM|CO|CR|CT|CV|CW|DA|DD|DE|DG|DH|DL|DN|DT|DY|E|EC|EH|EN|EX|FK|FY|G|GL|GY|GU|HA|HD|HG|HP|HR|HS|HU|HX|IG|IM|IP|IV|JE|KA|KT|KW|KY|L|LA|LD|LE|LL|LN|LS|LU|M|ME|MK|ML|N|NE|NG|NN|NP|NR|NW|OL|OX|PA|PE|PH|PL|PO|PR|RG|RH|RM|S|SA|SE|SG|SK|SL|SM|SN|SO|SP|SR|SS|ST|SW|SY|TA|TD|TF|TN|TQ|TR|TS|TW|UB|W|WA|WC|WD|WF|WN|WR|WS|WV|YO|ZE)(\d[\dA-Z]?[ ]?\d[ABD-HJLN-UW-Z]{2}))|BFPO[ ]?\d{1,4}

Postcodes are subject to change, and the only true way of validating a postcode is to have the complete list of postcodes and see if it's there.

But regular expressions are useful because they:

are easy to use and implement
are short
are quick to run
are quite easy to maintain (compared to a full list of postcodes)
still catch most input errors

But regular expressions tend to be difficult to maintain, especially for someone who didn't come up with it in the first place. So it must be:

as easy to understand as possible
relatively future proof

That means that most of the regular expressions in this answer aren't good enough. E.g. I can see that [A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y] is going to match a postcode area of the form AA1A — but it's going to be a pain in the neck if and when a new postcode area gets added, because it's difficult to understand which postcode areas it matches.

I also want my regular expression to match the first and second half of the postcode as parenthesised matches.

So I've come up with this:

(GIR(?=\s*0AA)|(?:[BEGLMNSW]|[A-Z]{2})[0-9](?:[0-9]|(?<=N1|E1|SE1|SW1|W1|NW1|EC[0-9]|WC[0-9])[A-HJ-NP-Z])?)\s*([0-9][ABD-HJLNP-UW-Z]{2})

In PCRE format it can be written as follows:

/^
  ( GIR(?=\s*0AA) # Match the special postcode "GIR 0AA"
    |
    (?:
      [BEGLMNSW] | # There are 8 single-letter postcode areas
      [A-Z]{2}     # All other postcode areas have two letters
      )
    [0-9] # There is always at least one number after the postcode area
    (?:
      [0-9] # And an optional extra number
      |
      # Only certain postcode areas can have an extra letter after the number
      (?<=N1|E1|SE1|SW1|W1|NW1|EC[0-9]|WC[0-9])
      [A-HJ-NP-Z] # Possible letters here may change, but [IO] will never be used
      )?
    )
  \s*
  ([0-9][ABD-HJLNP-UW-Z]{2}) # The last two letters cannot be [CIKMOV]
$/x

For me this is the right balance between validating as much as possible, while at the same time future-proofing and allowing for easy maintenance.

I've been looking for a UK postcode regex for the last day or so and stumbled on this thread. I worked my way through most of the suggestions above and none of them worked for me so I came up with my own regex which, as far as I know, captures all valid UK postcodes as of Jan '13 (according to the latest literature from the Royal Mail).

The regex and some simple postcode checking PHP code is posted below. NOTE:- It allows for lower or uppercase postcodes and the GIR 0AA anomaly but to deal with the, more than likely, presence of a space in the middle of an entered postcode it also makes use of a simple str_replace to remove the space before testing against the regex. Any discrepancies beyond that and the Royal Mail themselves don't even mention them in their literature (see http://www.royalmail.com/sites/default/files/docs/pdf/programmers_guide_edition_7_v5.pdf and start reading from page 17)!

Note: In the Royal Mail's own literature (link above) there is a slight ambiguity surrounding the 3rd and 4th positions and the exceptions in place if these characters are letters. I contacted Royal Mail directly to clear it up and in their own words "A letter in the 4th position of the Outward Code with the format AANA NAA has no exceptions and the 3rd position exceptions apply only to the last letter of the Outward Code with the format ANA NAA." Straight from the horse's mouth!

<?php

    $postcoderegex = '/^([g][i][r][0][a][a])$|^((([a-pr-uwyz]{1}([0]|[1-9]\d?))|([a-pr-uwyz]{1}[a-hk-y]{1}([0]|[1-9]\d?))|([a-pr-uwyz]{1}[1-9][a-hjkps-uw]{1})|([a-pr-uwyz]{1}[a-hk-y]{1}[1-9][a-z]{1}))(\d[abd-hjlnp-uw-z]{2})?)$/i';

    $postcode2check = str_replace(' ','',$postcode2check);

    if (preg_match($postcoderegex, $postcode2check)) {

        echo "$postcode2check is a valid postcode<br>";

    } else {

        echo "$postcode2check is not a valid postcode<br>";

    }

?>

I hope it helps anyone else who comes across this thread looking for a solution.

Here's a regex based on the format specified in the documents which are linked to marcj's answer:

/^[A-Z]{1,2}[0-9][0-9A-Z]? ?[0-9][A-Z]{2}$/

The only difference between that and the specs is that the last 2 characters cannot be in [CIKMOV] according to the specs.

Edit: Here's another version which does test for the trailing character limitations.

/^[A-Z]{1,2}[0-9][0-9A-Z]? ?[0-9][A-BD-HJLNP-UW-Z]{2}$/

Some of the regexs above are a little restrictive. Note the genuine postcode: "W1K 7AA" would fail given the rule "Position 3 - AEHMNPRTVXY only used" above as "K" would be disallowed.

the regex:

^(GIR 0AA|[A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|[ABEHMNPRV-Y]))|[0-9][A-HJKPS-UW])[0-9][ABD-HJLNP-UW-Z]{2})$

Seems a little more accurate, see the Wikipedia article entitled 'Postcodes in the United Kingdom'.

Note that this regex requires uppercase only characters.

The bigger question is whether you are restricting user input to allow only postcodes that actually exist or whether you are simply trying to stop users entering complete rubbish into the form fields. Correctly matching every possible postcode, and future proofing it, is a harder puzzle, and probably not worth it unless you are HMRC.

Basic rules:

^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}$

Postal codes in the U.K. (or postcodes, as they’re called) are composed of five to seven alphanumeric characters separated by a space. The rules covering which characters can appear at particular positions are rather complicated and fraught with exceptions. The regular expression just shown therefore sticks to the basic rules.

Complete rules:

If you need a regex that ticks all the boxes for the postcode rules at the expense of readability, here you go:

^(?:(?:[A-PR-UWYZ][0-9]{1,2}|[A-PR-UWYZ][A-HK-Y][0-9]{1,2}|[A-PR-UWYZ][0-9][A-HJKSTUW]|[A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y]) [0-9][ABD-HJLNP-UW-Z]{2}|GIR 0AA)$

Source: https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch04s16.html

Tested against our customers database and seems perfectly accurate.

I use the following regex that I have tested against all valid UK postcodes. It is based on the recommended rules, but condensed as much as reasonable and does not make use of any special language specific regex rules.

([A-PR-UWYZ]([A-HK-Y][0-9]([0-9]|[ABEHMNPRV-Y])?|[0-9]([0-9]|[A-HJKPSTUW])?) ?[0-9][ABD-HJLNP-UW-Z]{2})

It assumes that the postcode has been converted to uppercase and has not leading or trailing characters, but will accept an optional space between the outcode and incode.

The special "GIR0 0AA" postcode is excluded and will not validate as it's not in the official Post Office list of postcodes and as far as I'm aware will not be used as registered address. Adding it should be trivial as a special case if required.

First half of postcode Valid formats

[A-Z][A-Z][0-9][A-Z]
[A-Z][A-Z][0-9][0-9]
[A-Z][0-9][0-9]
[A-Z][A-Z][0-9]
[A-Z][A-Z][A-Z]
[A-Z][0-9][A-Z]
[A-Z][0-9]

Exceptions
Position 1 - QVX not used
Position 2 - IJZ not used except in GIR 0AA
Position 3 - AEHMNPRTVXY only used
Position 4 - ABEHMNPRVWXY

Second half of postcode

[0-9][A-Z][A-Z]

Exceptions
Position 2+3 - CIKMOV not used

Remember not all possible codes are used, so this list is a necessary but not sufficent condition for a valid code. It might be easier to just match against a list of all valid codes?

here's how we have been dealing with the UK postcode issue:

^([A-Za-z]{1,2}[0-9]{1,2}[A-Za-z]?[ ]?)([0-9]{1}[A-Za-z]{2})$

Explanation:

expect 1 or 2 a-z chars, upper or lower fine
expect 1 or 2 numbers
expect 0 or 1 a-z char, upper or lower fine
optional space allowed
expect 1 number
expect 2 a-z, upper or lower fine

This gets most formats, we then use the db to validate whether the postcode is actually real, this data is driven by openpoint https://www.ordnancesurvey.co.uk/opendatadownload/products.html

hope this helps

To check a postcode is in a valid format as per the Royal Mail's programmer's guide:

          |----------------------------outward code------------------------------| |------inward code-----|
#special↓       α1        α2    AAN  AANA      AANN      AN    ANN    ANA (α3)        N         AA
^(GIR 0AA|[A-PR-UWYZ]([A-HK-Y]([0-9][A-Z]?|[1-9][0-9])|[1-9]([0-9]|[A-HJKPSTUW])?) [0-9][ABD-HJLNP-UW-Z]{2})$

All postcodes on doogal.co.uk match, except for those no longer in use.

Adding a ? after the space and using case-insensitive match to answer this question:

'se50eg'.match(/^(GIR 0AA|[A-PR-UWYZ]([A-HK-Y]([0-9][A-Z]?|[1-9][0-9])|[1-9]([0-9]|[A-HJKPSTUW])?) ?[0-9][ABD-HJLNP-UW-Z]{2})$/ig);
Array [ "se50eg" ]

This one allows empty spaces and tabs from both sides in case you don't want to fail validation and then trim it sever side.

^\s*(([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})\s*$)

I wanted a simple regex, where it's fine to allow too much, but not to deny a valid postcode. I went with this (the input is a stripped/trimmed string):

/^([a-z0-9]\s*){5,7}$/i

Lengths 5 to 7 (not counting whitespace) means we allow the shortest possible postcodes like "L1 8JQ" as well as the longest ones like "OL14 5ET".

EDIT: Changed the 8 to a 7 so we don't allow 8 character postcodes.

To add to this list a more practical regex that I use that allows the user to enter an empty string is:

^$|^(([gG][iI][rR] {0,}0[aA]{2})|((([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y]?[0-9][0-9]?)|(([a-pr-uwyzA-PR-UWYZ][0-9][a-hjkstuwA-HJKSTUW])|([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y][0-9][abehmnprv-yABEHMNPRV-Y]))) {0,1}[0-9][abd-hjlnp-uw-zABD-HJLNP-UW-Z]{2}))$

This regex allows capital and lower case letters with an optional space in between

From a software developers point of view this regex is useful for software where an address may be optional. For example if a user did not want to supply their address details

Have a look at the python code on this page:

http://www.brunningonline.net/simon/blog/archives/001292.html

I've got some postcode parsing to do. The requirement is pretty simple; I have to parse a postcode into an outcode and (optional) incode. The good new is that I don't have to perform any validation - I just have to chop up what I've been provided with in a vaguely intelligent manner. I can't assume much about my import in terms of formatting, i.e. case and embedded spaces. But this isn't the bad news; the bad news is that I have to do it all in RPG. :-(

Nevertheless, I threw a little Python function together to clarify my thinking.

I've used it to process postcodes for me.

We were given a spec:

UK postcodes must be in one of the following forms (with one exception, see below): 
    § A9 9AA 
    § A99 9AA
    § AA9 9AA
    § AA99 9AA
    § A9A 9AA
    § AA9A 9AA
where A represents an alphabetic character and 9 represents a numeric character.
Additional rules apply to alphabetic characters, as follows:
    § The character in position 1 may not be Q, V or X
    § The character in position 2 may not be I, J or Z
    § The character in position 3 may not be I, L, M, N, O, P, Q, R, V, X, Y or Z
    § The character in position 4 may not be C, D, F, G, I, J, K, L, O, Q, S, T, U or Z
    § The characters in the rightmost two positions may not be C, I, K, M, O or V
The one exception that does not follow these general rules is the postcode "GIR 0AA", which is a special valid postcode.

We came up with this:

/^([A-PR-UWYZ][A-HK-Y0-9](?:[A-HJKS-UW0-9][ABEHMNPRV-Y0-9]?)?\s*[0-9][ABD-HJLNP-UW-Z]{2}|GIR\s*0AA)$/i

But note - this allows any number of spaces in between groups.

I have the regex for UK Postcode validation.

This is working for all type of Postcode either inner or outer

^((([A-PR-UWYZ][0-9])|([A-PR-UWYZ][0-9][0-9])|([A-PR-UWYZ][A-HK-Y][0-9])|([A-PR-UWYZ][A-HK-Y][0-9][0-9])|([A-PR-UWYZ][0-9][A-HJKSTUW])|([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY]))) || ^((GIR)[ ]?(0AA))$|^(([A-PR-UWYZ][0-9])[ ]?([0-9][ABD-HJLNPQ-UW-Z]{0,2}))$|^(([A-PR-UWYZ][0-9][0-9])[ ]?([0-9][ABD-HJLNPQ-UW-Z]{0,2}))$|^(([A-PR-UWYZ][A-HK-Y0-9][0-9])[ ]?([0-9][ABD-HJLNPQ-UW-Z]{0,2}))$|^(([A-PR-UWYZ][A-HK-Y0-9][0-9][0-9])[ ]?([0-9][ABD-HJLNPQ-UW-Z]{0,2}))$|^(([A-PR-UWYZ][0-9][A-HJKS-UW0-9])[ ]?([0-9][ABD-HJLNPQ-UW-Z]{0,2}))$|^(([A-PR-UWYZ][A-HK-Y0-9][0-9][ABEHMNPRVWXY0-9])[ ]?([0-9][ABD-HJLNPQ-UW-Z]{0,2}))$

This is working for all type of format.

Example:

AB10-------------------->ONLY OUTER POSTCODE

A1 1AA------------------>COMBINATION OF (OUTER AND INNER) POSTCODE

WC2A-------------------->OUTER

The accepted answer reflects the rules given by Royal Mail, although there is a typo in the regex. This typo seems to have been in there on the gov.uk site as well (as it is in the XML archive page).

In the format A9A 9AA the rules allow a P character in the third position, whilst the regex disallows this. The correct regex would be:

(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKPSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})

Shortening this results in the following regex (which uses Perl/Ruby syntax):

(GIR 0AA)|([A-PR-UWYZ](([0-9]([0-9A-HJKPSTUW])?)|([A-HK-Y][0-9]([0-9ABEHMNPRVWXY])?))\s?[0-9][ABD-HJLNP-UW-Z]{2})

It also includes an optional space between the first and second block.

What i have found in nearly all the variations and the regex from the bulk transfer pdf and what is on wikipedia site is this, specifically for the wikipedia regex is, there needs to be a ^ after the first |(vertical bar). I figured this out by testing for AA9A 9AA, because otherwise the format check for A9A 9AA will validate it. For Example checking for EC1D 1BB which should be invalid comes back valid because C1D 1BB is a valid format.

Here is what I've come up with for a good regex:

^([G][I][R] 0[A]{2})|^((([A-Z-[QVX]][0-9]{1,2})|([A-Z-[QVX]][A-HK-Y][0-9]{1,2})|([A-Z-[QVX]][0-9][ABCDEFGHJKPSTUW])|([A-Z-[QVX]][A-HK-Y][0-9][ABEHMNPRVWXY])) [0-9][A-Z-[CIKMOV]]{2})$

I needed a version that would work in SAS with the PRXMATCH and related functions, so I came up with this:

^[A-PR-UWYZ](([A-HK-Y]?\d\d?)|(\d[A-HJKPSTUW])|([A-HK-Y]\d[ABEHMNPRV-Y]))\s?\d[ABD-HJLNP-UW-Z]{2}$

Test cases and notes:

/* 
Notes
The letters QVX are not used in the 1st position.
The letters IJZ are not used in the second position.
The only letters to appear in the third position are ABCDEFGHJKPSTUW when the structure starts with A9A.
The only letters to appear in the fourth position are ABEHMNPRVWXY when the structure starts with AA9A.
The final two letters do not use the letters CIKMOV, so as not to resemble digits or each other when hand-written.
*/

/*
    Bits and pieces
    1st position (any):         [A-PR-UWYZ]         
    2nd position (if letter):   [A-HK-Y]
    3rd position (A1A format):  [A-HJKPSTUW]
    4th position (AA1A format): [ABEHMNPRV-Y]
    Last 2 positions:           [ABD-HJLNP-UW-Z]    
*/


data example;
infile cards truncover;
input valid 1. postcode &$10. Notes &$100.;
flag = prxmatch('/^[A-PR-UWYZ](([A-HK-Y]?\d\d?)|(\d[A-HJKPSTUW])|([A-HK-Y]\d[ABEHMNPRV-Y]))\s?\d[ABD-HJLNP-UW-Z]{2}$/',strip(postcode));
cards;
1  EC1A 1BB  Special case 1
1  W1A 0AX   Special case 2
1  M1 1AE    Standard format
1  B33 8TH   Standard format
1  CR2 6XH   Standard format
1  DN55 1PT  Standard format
0  QN55 1PT  Bad letter in 1st position
0  DI55 1PT  Bad letter in 2nd position
0  W1Z 0AX   Bad letter in 3rd position
0  EC1Z 1BB  Bad letter in 4th position
0  DN55 1CT  Bad letter in 2nd group
0  A11A 1AA  Invalid digits in 1st group
0  AA11A 1AA  1st group too long
0  AA11 1AAA  2nd group too long
0  AA11 1AAA  2nd group too long
0  AAA 1AA   No digit in 1st group
0  AA 1AA    No digit in 1st group
0  A 1AA     No digit in 1st group
0  1A 1AA    Missing letter in 1st group
0  1 1AA     Missing letter in 1st group
0  11 1AA    Missing letter in 1st group
0  AA1 1A    Missing letter in 2nd group
0  AA1 1     Missing letter in 2nd group
;
run;

Below method will check the post code and provide complete info

const valid_postcode = postcode => {
    try {
        postcode = postcode.replace(/\s/g, "");
        const fromat = postcode
            .toUpperCase()
            .match(/^([A-Z]{1,2}\d{1,2}[A-Z]?)\s*(\d[A-Z]{2})$/);
        const finalValue = `${fromat[1]} ${fromat[2]}`;
        const regex = /^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$/i;
        return {
            isValid: regex.test(postcode),
            formatedPostCode: finalValue,
            error: false,
            info: 'It is a valid postcode'
        };
    } catch (error) {
        return { error: true , info: 'Invalid post code has been entered!'};
    }
};

valid_postcode('GU348RR')
result => {isValid: true, formatedPostCode: "GU34 8RR", error: false, info: "It is a valid postcode"}

valid_postcode('sdasd4746asd')
result => {error: true, info: "Invalid post code has been entered!"}

valid_postcode('787898523')
result => {error: true, info: "Invalid post code has been entered!"}

I stole this from an XML document and it seems to cover all cases without the hard coded GIRO:

%r{[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}}i

(Ruby syntax with ignore case)

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow