Match merging has performance issues with submission sets of larger programs #1430

tsaglam · 2023-12-11T12:09:29Z

When submissions are > 1000 LoC, the performance degrades. I have not yet investigated whether this is CPU load or memory/paging issues, or a general algorithmic flaw.

tsaglam · 2023-12-12T08:58:09Z

I profiled match merging for a large dataset (>400 submissions, ~1500 loc each, with basecode on):

Method profiling

Memory profiling

My current theory: Due to the nature of the match merging, smaller matches below the MTM are generated during greedy string tiling. This is required, as neighbors with lengths below MTM can be merged into eligible matches again. However, this effectively forces the greedy string tiling to execute to a subsequence length of 2 (default neighbor length is 2), producing exponentially more subsequences, and thus the matching is slowed. Thus, most of the overhead is algorithmic. For my dataset, enabling match merging comes with a ~4000% runtime overhead. Most samples come from GreedyStringTiling.compareInternal().

Besides that issue, I see two minor performance issues in the MatchMerging class:

The method computeNeighbors() is inefficient due to the sorting.
The method removeTooShortMatches() is inefficient due to the type of deletion (a simple stream statement is more efficient here based on rudimentary tests).

Hotifx: Setting --neighbor-length to 5 reduces the overhead to ~190%.

tsaglam · 2023-12-15T13:07:35Z

One idea to work around this problem would be to set match merging as activated by default but increase the neighbor length dynamically depending on the average tokens per submission. We would need measurements for different datasets for different neighbor lengths. to see when the slow-down factor gets excessive. Something like 2x to 4x is okay, but 40x is not. Then, we would have a CLI parameter for match merging: Off, auto, manual.

tsaglam added bug Issue/PR that involves a bug minor Minor issue/feature/contribution/change labels Dec 11, 2023

tsaglam added this to the v5.0.0 milestone Dec 11, 2023

tsaglam mentioned this issue Dec 15, 2023

Improve performance of the match merging after the matching is done #1447

Merged

tsaglam linked a pull request Dec 15, 2023 that will close this issue

Improve performance of the match merging after the matching is done #1447

Merged

tsaglam removed a link to a pull request Dec 15, 2023

Improve performance of the match merging after the matching is done #1447

Merged

tsaglam mentioned this issue Feb 19, 2024

Open tasks regarding the match merging feature #1569

Open

tsaglam removed this from the v5.0.0 milestone Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match merging has performance issues with submission sets of larger programs #1430

Match merging has performance issues with submission sets of larger programs #1430

tsaglam commented Dec 11, 2023 •

edited

Loading

tsaglam commented Dec 12, 2023 •

edited

Loading

tsaglam commented Dec 15, 2023

Match merging has performance issues with submission sets of larger programs #1430

Match merging has performance issues with submission sets of larger programs #1430

Comments

tsaglam commented Dec 11, 2023 • edited Loading

tsaglam commented Dec 12, 2023 • edited Loading

tsaglam commented Dec 15, 2023

tsaglam commented Dec 11, 2023 •

edited

Loading

tsaglam commented Dec 12, 2023 •

edited

Loading