You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When submissions are > 1000 LoC, the performance degrades. I have not yet investigated whether this is CPU load or memory/paging issues, or a general algorithmic flaw.
The text was updated successfully, but these errors were encountered:
I profiled match merging for a large dataset (>400 submissions, ~1500 loc each, with basecode on):
Method profiling
Memory profiling
My current theory: Due to the nature of the match merging, smaller matches below the MTM are generated during greedy string tiling. This is required, as neighbors with lengths below MTM can be merged into eligible matches again. However, this effectively forces the greedy string tiling to execute to a subsequence length of 2 (default neighbor length is 2), producing exponentially more subsequences, and thus the matching is slowed. Thus, most of the overhead is algorithmic. For my dataset, enabling match merging comes with a ~4000% runtime overhead. Most samples come from GreedyStringTiling.compareInternal().
Besides that issue, I see two minor performance issues in the MatchMerging class:
The method computeNeighbors() is inefficient due to the sorting.
The method removeTooShortMatches() is inefficient due to the type of deletion (a simple stream statement is more efficient here based on rudimentary tests).
Hotifx: Setting --neighbor-length to 5 reduces the overhead to ~190%.
One idea to work around this problem would be to set match merging as activated by default but increase the neighbor length dynamically depending on the average tokens per submission. We would need measurements for different datasets for different neighbor lengths. to see when the slow-down factor gets excessive. Something like 2x to 4x is okay, but 40x is not. Then, we would have a CLI parameter for match merging: Off, auto, manual.
When submissions are > 1000 LoC, the performance degrades. I have not yet investigated whether this is CPU load or memory/paging issues, or a general algorithmic flaw.
The text was updated successfully, but these errors were encountered: