Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using postal codes for better match #159

Open
niravmehta opened this issue Aug 8, 2019 · 6 comments
Open

Using postal codes for better match #159

niravmehta opened this issue Aug 8, 2019 · 6 comments

Comments

@niravmehta
Copy link

Is there a reason why placeholder does not use postal codes for lookups? In some ad-hoc tests I did, Placeholder could not find correct places due to insufficient address / misspellings etc.

I tried looking up the postal code (which was available) in Geonames post code data, found a match, then used location names from Geonames post code data in Placeholder, and then Placeholder could correctly identify the place.

I don't have much experience with this, so don't know all the intricacies. I have read that post codes can be a mess. But if an address has a country and postcode, we can pretty much find the exact location just with that information - at least that's why I am thinking..

What are the problems with using post codes then?

@missinglink
Copy link
Member

Placeholder uses bounding-boxes in order to determine if tokens overlap spatially.
Our source of postcodes doesn't often have bounding-box information, and most openly licensed postcode datasets only contain points.

We could buffer the points to generate expanded bounding boxes, then it would be a matter of ensuring that the quality is good.
In general, the quality of postcode data available is pretty poor globally, due to the postal services not releasing the data to the public.

It would be possible to do but would require some time to build and test, if you'd like to pursue the issue further I'd be happy to assist.

@missinglink
Copy link
Member

Can you please provide some examples that can be used as acceptance tests for this feature?

@niravmehta
Copy link
Author

Here are some cases that didn't quite work, but can be improved greatly with postal code lookups. Some of these are misspells, some are insufficient. State name expansion is another thing that can help here - but if we take care of postal codes, state names are covered too.

  • Osaki Tajiri Oonuki Aza Kitanagane, Jp04, 989-4302, Japan
  • Wattana, Th-10, 10110, Thailand
  • Mexico, DF, 06140, Mexico
  • Seropédica, RJ, 23898-500, Brazil
  • Avda. La Torrecilla, CO, 14013, Spain
  • Ganesh Meridian, SG Highway, GJ, 380060, India
  • Songjiang, Cn10, 201600, China

@niravmehta
Copy link
Author

I'm working on adding postal code support. Feedback welcome...

Here's what I'm doing:

Objective: Expand postal code to admin1/2/3 names (state / county / region / locality..) and use expanded form for Placeholder search

  • Create a new table for postalcodes from Geonames postal codes CSV exports (allCountries, GB, NL)
  • Create a new table for ISO 3166 country names, 2 letter and 3 letter codes (Because WOF uses 3 letter abbreviation but Geonames uses 2 letter abbreviation)
  • Create a new table for countrycodes that has countries subset of WOF data from "docs" table (placetype in country / empire / dependecy / disputes and abbr = iso3166 3 letter code). New table helps in speeding up country searches.
  • Expect two additional parameters in search: country, postalcode (+ text)
  • For every search:
    • Tokenize and query for input country. Use existing tokens table, but use our newly created countrycodes table to get two letter country code + country name as in WOF data (since input name may not be standardized / may not match our names exactly)
    • Cleanup input postal code, search on Geonames data and pickup placename, admin1, admin2 and admin3 values from that data.
    • Expanded postalcode = unique values from those four values concatenated
    • Create new search "text" as original text value + expanded postalcode + WOF country name
    • Perform usual Placeholder search to determine location

Notes

  • Geonames postalcodes data has centroid based long/lat. We could potentially reverse geocode it to find exact location and lineage. This could be faster. But I'm not sure of its accuracy.
  • We could add postalcodes to the "docs" json as an additional field. But this would require matching all postalcodes to existing WOF entries. This I thought was too complex for our needs at the moment. But if we could use WOF postalcode data, this may turn out to be a lot simpler.

@niravmehta
Copy link
Author

Some updates:

  • I implemented the above logic and it works quite well.
  • It relies on structured input, rather than a single line of address text.
  • I added ISO 3166-2 region code support to expand state/region names. I took that data from IP2Location and it's CC BY-SA 4.0. So not sure of its compatibility with Placeholder - for public release.
  • I added reverse geo search, but that doesn't work too well because many places don't have proper bounding box data.
  • There is a bit of work involved to get this whole thing working (importing external CSVs, other updates to SQLite db, a new route..).

Overall: I'm happy to share the code. But am not sure if it will be useful, since it does deviate from the core Placeholder promise of parsing a single line plus it uses some external data.

If enough people are interested, I can make this into a PR.

@niravmehta
Copy link
Author

For anyone who's interested, all my changes are here:
https://github.com/niravmehta/placeholder/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants