-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First implementation of the Earley Parser. #1
base: master
Are you sure you want to change the base?
Conversation
…nd checks if a string is parsable
Added the benchmark test, couldn't come up with a good algorithmm for parse tree :/ first implementation of the parse tree construction. Graph Parcouring is to be redone completely, but the tree constructor given an ordered list of items and edges should be correct. added comments GraphParcour works correctly even for ambiguous grammars! Cheers ! Whoops, forgot a if(false)... the first draft for all parse trees. Doesn't work AT ALL. Implemented feedback. Unit tests are live. Removed a few println. Comments! Added better caching management, added a clear method to flush the cache, implementation of incremental parsing on the way: very fast update, a correct parsing table is generated, still needs a few tweaks for better prediction gestion Tweaks are all done Test cases for update are done Tests will be done on loacl.
Here is how the tests on local machine works. As you suggested, I created a file corresponding to a small statement, which will be inserted between two statements. The statement was:
Moreover, I did the same with a very large class, which is inserted between two classes declarations. finally, I also was able to create a method that deletes a single statement from a whole tool program. As such, there are 3 new programs created from a single one, and I have 3 tool programs to test: Between two parsing/updates, the cache is cleared an reinitialized with the original program. Here is a screenshot of the results. Here is an interpretation of the results:
As such, we can start to see a pattern from such results:
|
Hi Thomas, |
I rewrote the tests to better show what you asked:
The last question is a bit tricky, and I don't have a definitive answer to it.
However, my intuition is that the first one is true. Given the fact that the first fiboN takes twice as long to parse a small addition when it clearly should take approximately the same time. |
Thanks for the explanation. Actually it may very well be because of the cache effects. You may want to take the average of these 10 runs (though the first run without a warmed up cache is also interesting by itself.) These results are very promising. If possible, try to do this for more benchmarks (say about 10 benchmarks). It would be best if you automate this process of executing each benchmark ten times and calculating the average. Describe this experiment and the results in detail in the report. I guess possibly tabulating the averages for each benchmark would be great for the experimental results section. For example, a table with the following columns would be good. Benchmark Name | Initial Parse time (ms) | Incremental Parse Time (ms) (for Type 1 , Type 2, Type 3 in subcolumns) | % speed up (for Type 1, Type 2, Type 3) % speed up is the percentage ratio of non-incremental parse time over incremental parse time i.e, Non-incremental parse time If it is 200 then the incremental parsing was 2 times as fast as non-incremental traditional parsing. Also do tabulate the LL(1) parse time (on a LL(1) grammar) vs Earley parse time (on the unmodified ambiguous tool grammar) vs CYK parse time (on the unmodified ambiguous grammar). |
Implemented:
What's left:
On incremental parsing:
The basic idea is as follows:
Given a new list of terminals, where the changes starts in the original list (which is the same as where it starts in the new list), where it finishes in the original list and where it finishes in the new list, compute the parsing table as fast as possible.
Earley's implementation allows us to directly copy the old parsing table up to where the changes starts. Next, we do a regular Earley parsing steps until we hit the end of the changes, then, we keep on doing those steps, but when a new item is added, if it was in the original table (in a corresponding place) and if it is useful for constructing a tree, we can stop there and fill the table with only the useful items that were in the original table.
Benchmark Commentary:
We can clearly see the difference. On a simple tool program (ie: fibonnacci), given the tool grammar, Earley parser takes around 5-7 milliseconds to parse it, while CYK parser takes around 2500-3500 milliseconds.
Note that converting the grammar to CNF was done BEFORE parsing, so that isn't the bottleneck.
On another note, running the parsers on a colossal tool program (ie: Merge-Sort), Earley takes approximately 40 milliseconds, while CYK didn't finish parsing after 30 minutes...
A few more advanced tests revealed that on most cases, EarleyParse is even faster than LL1Parser in constructing the parse tree (though the acceleration is negligeable, ~5ms).
Current conclusion:
In terms of determining whether a program is parsable or not, Earley is indisputably better (around 500x faster on fibonnacci).