First implementation of the Earley Parser. #1

rakachan · 2017-03-06T15:59:30Z

Implemented:

Builds the Parsing table.
Checks if a string is parsable or not.
Construct one parsing tree.
Bunch of test cases
Incremental parsing

What's left:

Construct all parsing trees

On incremental parsing:
The basic idea is as follows:
Given a new list of terminals, where the changes starts in the original list (which is the same as where it starts in the new list), where it finishes in the original list and where it finishes in the new list, compute the parsing table as fast as possible.
Earley's implementation allows us to directly copy the old parsing table up to where the changes starts. Next, we do a regular Earley parsing steps until we hit the end of the changes, then, we keep on doing those steps, but when a new item is added, if it was in the original table (in a corresponding place) and if it is useful for constructing a tree, we can stop there and fill the table with only the useful items that were in the original table.

Benchmark Commentary:
We can clearly see the difference. On a simple tool program (ie: fibonnacci), given the tool grammar, Earley parser takes around 5-7 milliseconds to parse it, while CYK parser takes around 2500-3500 milliseconds.
Note that converting the grammar to CNF was done BEFORE parsing, so that isn't the bottleneck.
On another note, running the parsers on a colossal tool program (ie: Merge-Sort), Earley takes approximately 40 milliseconds, while CYK didn't finish parsing after 30 minutes...
A few more advanced tests revealed that on most cases, EarleyParse is even faster than LL1Parser in constructing the parse tree (though the acceleration is negligeable, ~5ms).

Current conclusion:
In terms of determining whether a program is parsable or not, Earley is indisputably better (around 500x faster on fibonnacci).

…nd checks if a string is parsable

Added the benchmark test, couldn't come up with a good algorithmm for parse tree :/ first implementation of the parse tree construction. Graph Parcouring is to be redone completely, but the tree constructor given an ordered list of items and edges should be correct. added comments GraphParcour works correctly even for ambiguous grammars! Cheers ! Whoops, forgot a if(false)... the first draft for all parse trees. Doesn't work AT ALL. Implemented feedback. Unit tests are live. Removed a few println. Comments! Added better caching management, added a clear method to flush the cache, implementation of incremental parsing on the way: very fast update, a correct parsing table is generated, still needs a few tweaks for better prediction gestion Tweaks are all done Test cases for update are done Tests will be done on loacl.

rakachan · 2017-05-22T12:19:53Z

Here is how the tests on local machine works.

As you suggested, I created a file corresponding to a small statement, which will be inserted between two statements. The statement was:

if ((a<b)||true) {
    a = 5+2;
    println("hey there");
}

Moreover, I did the same with a very large class, which is inserted between two classes declarations.

finally, I also was able to create a method that deletes a single statement from a whole tool program.

As such, there are 3 new programs created from a single one, and I have 3 tool programs to test:
Fibonacci (fiboN), Merge Sort (MS), and Binary Search Tree(BST).

Between two parsing/updates, the cache is cleared an reinitialized with the original program.

Here is a screenshot of the results.

Here is an interpretation of the results:

As you can see, Fibonacci being a pretty small program, incrementally parsing from it is not much of an improve. (Note that the big addition is significantly larger than fibonacci's whole program, justifying the increase in time)
a significant improvement is seen in sufficently big programs such as MS or BST. Incrementally parsing either a small addition or a deletion is pretty much done instantly and is way faster than a full re-parsing.
However, considering the big addition, it is rarely better than a full re-parsing.

As such, we can start to see a pattern from such results:

Should the modification be larger than half of the original file, it may be better to do a full re-parsing.
In case of a very small modification (whether it be insertion or deletion), it is almost certain than incremental parsing will be much more faster than a full parse.

ravimad · 2017-06-04T10:28:20Z

Hi Thomas,
The following are some question I have:
a) what are the units of time (seconds, ms etc.)
b) I suppose the times are shown only for incremental parsing
e.g. small addition: 1 means incremental parsing took 1 (presumably ms) to parse the program with the small change. However, how much did the full parsing take on this program?
c) I see 10 sets of results (10 measurements for each benchmark). Are you running the benchmarks by creating different kinds of errors. In that case, the original time should be (approx) identical in each of the 10 measurements for a given benchmark right? Why is it so different?
E.g. fiboN takes anywhere between 9 ms to 1 ms

rakachan · 2017-06-04T12:11:43Z

I rewrote the tests to better show what you asked:

Times are indeed in ms
In parenthesis are the times it took for a full parse

The last question is a bit tricky, and I don't have a definitive answer to it.
First, the programs used are the same for all the tests. I chose to run them 10 times at a time because I saw that the results were not stable when executed one by one. By doing so, the results seem consistent around the 3rd iteration.
Now for the reason why it is so different, I have two possibilities in mind:

Either it's because the resources of my PC were not as dedicated to the first few tests as they should have been,
Or it's because things were cached in memory after the first test.

However, my intuition is that the first one is true. Given the fact that the first fiboN takes twice as long to parse a small addition when it clearly should take approximately the same time.

ravimad · 2017-06-04T13:21:57Z

Thanks for the explanation. Actually it may very well be because of the cache effects. You may want to take the average of these 10 runs (though the first run without a warmed up cache is also interesting by itself.) These results are very promising. If possible, try to do this for more benchmarks (say about 10 benchmarks). It would be best if you automate this process of executing each benchmark ten times and calculating the average. Describe this experiment and the results in detail in the report. I guess possibly tabulating the averages for each benchmark would be great for the experimental results section. For example, a table with the following columns would be good.
Say Type 1 modification is small addition, Type 2 modification is big addition and Type 3 modification is deletion.

Benchmark Name | Initial Parse time (ms) | Incremental Parse Time (ms) (for Type 1 , Type 2, Type 3 in subcolumns) | % speed up (for Type 1, Type 2, Type 3)

% speed up is the percentage ratio of non-incremental parse time over incremental parse time i.e,

Non-incremental parse time
---------------------------- x 100
Incremental parse time

If it is 200 then the incremental parsing was 2 times as fast as non-incremental traditional parsing.

Also do tabulate the LL(1) parse time (on a LL(1) grammar) vs Earley parse time (on the unmodified ambiguous tool grammar) vs CYK parse time (on the unmodified ambiguous grammar).
Obviously CYK will be slow. It just gives an idea of how good Earley is compared to CYK. The real deal is between LL(1) and Earley. If the CYK parse time exceeds Earley parse time by say a factor of 2x then you can terminate it and say it is a timeout or took long.
Feel free any other results you obtained also you feel relevant.
(A remark: Running Earley over LL(1) grammar is a bit meaningless since if one has an LL(1) grammar one may as well run an LL(1) parser. So ignore these results.)

…arents

Thomas Garcia added 5 commits March 6, 2017 16:47

First implementation of the Earley Parser. Builds the parsing table a…

dd181df

…nd checks if a string is parsable

squashed commits

16bc094

encapsulated update in a try. -just in case-

183dc99

considered two distinct exceptions

dbf35a6

rakachan force-pushed the master branch from 755a31e to dbf35a6 Compare May 22, 2017 11:17

Thomas Garcia added 2 commits May 22, 2017 13:42

Should have corrected LL1Parser

ae82d1d

added a comment to point to the previous error

5395aef

Added some documentation

e2202b9

Thomas Garcia added 4 commits August 30, 2017 19:42

saving for possible rollbacks

ba2f559

Backtracking mostly done, left to consider nodes that are their own p…

b7cd6c8

…arents

rollback point

45567df

Most recent try for graph traversal, failure

d33ab0e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First implementation of the Earley Parser. #1

First implementation of the Earley Parser. #1

rakachan commented Mar 6, 2017 •

edited

Loading

rakachan commented May 22, 2017

ravimad commented Jun 4, 2017 •

edited

Loading

rakachan commented Jun 4, 2017

ravimad commented Jun 4, 2017 •

edited

Loading

First implementation of the Earley Parser. #1

Are you sure you want to change the base?

First implementation of the Earley Parser. #1

Conversation

rakachan commented Mar 6, 2017 • edited Loading

rakachan commented May 22, 2017

ravimad commented Jun 4, 2017 • edited Loading

rakachan commented Jun 4, 2017

ravimad commented Jun 4, 2017 • edited Loading

rakachan commented Mar 6, 2017 •

edited

Loading

ravimad commented Jun 4, 2017 •

edited

Loading

ravimad commented Jun 4, 2017 •

edited

Loading