Linting UK Postcodes

I experimented with a few combinations of postcodes and found that if a letter came after any digits in the outing code then the postcode would be seen as invalid by the library even though it is valid..For example: "Golden Square, London, W1R 3AD": >>> parse_uk_postcode('w1r3ad') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "…/ukpostcodeparser/parser.py", line 129, in parse_uk_postcode raise ValueError('Invalid postcode') ValueError: Invalid postcode But then I tried another postcode, this one for "216 Oxford Street, London, W1D 1LA" and it did work: >>> parse_uk_postcode('W1D1LA') ('W1D', '1LA') Before looking to patch the library I wanted to see if there were any other obvious solutions..Googling for Regexes I began googling for regexes which claimed to parse UK postcodes..I can across the following regex and ran it against the 2.5 million postcodes list..import re pattern = '^([A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|' + '[ABEHMNPRV-Y]))|[0-9][A-HJKS-UW]) [0-9][ABD-HJLNP-UW-Z]{2}|' + '(GIR 0AA)|(SAN TA1)|(BFPO (C/O )?[0-9]{1,4})|' + '((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA) 1ZZ))$' _POSTCODE_RE = re.compile(pattern) def is_postcode(postcode): postcode = postcode return _POSTCODE_RE.match(postcode) != None """ The layout of postcodes.csv looks like the following AB1 0AD,57.10056,-2.248342,385053,….AB1 0AE,57.084447,-2.255708,384600,….AB1 0AF,57.096659,-2.258103,384460,….""" for line in open('postcodes.csv'): pieces = line.strip().split(',') try: assert len(pieces) > 1, pieces except AssertionError: print 'line invalid', line continue # Make sure the postcode is in upper case postcode = pieces[0].upper() if not is_postcode(postcode): print 'invalid postcode', postcode This failed against 8,614 postcodes..Here is a sampling of a few of them: ➫ sort –random-sort results | head invalid postcode W1V 9PD invalid postcode W1Y 8HE invalid postcode W1Y 8DH invalid postcode W1P 7FW invalid postcode W1Y 1AR invalid postcode W1R 6JJ invalid postcode W1M 5AE invalid postcode W1R 1FH invalid postcode NPT 8ET invalid postcode W1R 0HD Ignoring the old Newport postcodes W1R was still being caught..I could see that only certain W1[A-Z] outing codes were being caught out so I got a list of them together and found only 7 were being flagged up as invalid..I adjusted the regular expression to allow for M, N, P, R, V, X and Y after any digits in the outing code: pattern = '^([A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|' + '[ABEHMNPRV-Y]))|[0-9][A-HJKMNPRS-UVWXY]) [0-9][ABD-HJLNP-UW-Z]{2}|' + '(GIR 0AA)|(SAN TA1)|(BFPO (C/O )?[0-9]{1,4})|' + '((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA) 1ZZ))$' _POSTCODE_RE = re.compile(pattern) I ran the script again and only the 2,418 depreciated Newport postcodes were seen as invalid..Seeing the pattern of the W1M, W1N, W1P, W1R, W1V, W1X and W1Y outing codes being the edge cases that tripped up Simon Haywards library I created a pull request.. More details

Leave a Reply