-
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better documentation of what parser is? #82
Comments
Good questions! I think the reason we haven't covered the first three here 🔽is that it's intended as a standalone project, as such it's unrelated to But yeah, we should do that, either here or somewhere else in the Pelias documentation 📖 Some more definitive answers:
In the beginning we used a regular-expression based parser and we found it to be too simple and not very accurate. So we met with Al and Mapzen funded the original work on Libpostal is based on a machine learning model, it's a black box which unfortunately noone except Al has contributed to very much, it hasn't seen much development in the last 2 years or so, although there's actually been some activity this year! Libpostal is amazing but it has a some negatives:
The 'Pelias Parser' was never intended to replace In the process I tried to address some of the issues we had with
Hope that helps ;) |
That's super helpful, thanks Peter! I might submit some PRs to READMEs and docs to get the ball rolling from here. re: https://github.com/pelias/parser/tree/master/test - that doesn't really give me an intuition about what each one is good or bad at, that's the type of thing that you probably have a good sense for and would be good to write down somewhere? |
Machine Learning model My intuition is that the machine learning model is superior in all inputs it was trained on. The weakness of the machine learning model is it performs poorly on anything it wasn't trained to recognise. Here's one example from an old issue of ours which predates the pelias parser pelias/api#795 (comment), I just linked the first one I found, there are many more reports on the libpostal github if you're curious. But the thing is it's a super super difficult problem and there's always going to be edge cases and people opening issues about how their address doesn't parse correctly. I think one of the major strengths of this architecture is that, when the machine was trained, it saw many different formats of addresses from all over the world and hopefully learned to recognise their unique syntax patterns. Dictionary and Pattern based Natural language processing model This architecture is superior in terms of how easy it is to test and iterate on. The pelias/parser uses the dictionary files from On top of the token matching are logical classifiers which are able to look at the context of the terms, their adjacency etc. So this model shines in its flexibility, a machine learning model would not be so easy to edit and add things like plus codes or 'pizza new york' or dynamic dictionaries. The disadvantage is that it's written by humans, and we don't have knowledge of the whole world and its weird and wonderful addressing patterns, so that will need to be added by humans when we encounter a new and unusual address format. |
My intuition is that both parsers do very well in USA and French/German speaking countries, in other countries I suspect But there are some inputs which |
cc/ @blackmad this is a good example of where the |
Hey team,
I'm reading through as much of the pelias docs as I can find. I followed a link to pelias/parser and after reading through it (as well as https://geocode.earth/blog/2019/improved-autocomplete-parsing-is-here ) I think there could be some changes to the README that would help make it more understandable. Questions I still have
I would try to update the docs myself but I'm unclear as to the answers.
Best,
David
The text was updated successfully, but these errors were encountered: