Add partial ratio #10

ljnsn · 2024-12-10T22:58:42Z

Adds

ScoreAlignment struct
partial_ratio
partial_ratio_alignment
partial_ratio_impl (short needle)

Plus tests.

maxbachmann

Thanks for the addition.

I made a preliminary review with some things I noticed. I believe this should cover most of it, but this will certainly require another look.

The Args interface requires a bit of design work to properly work for partial_ratio_alignment. I believe I have a solution for this, but I will need to test around with this tomorrow.

maxbachmann · 2024-12-11T18:33:30Z

src/fuzz.rs

+    let mut res = if len1 <= len2 {
+        partial_ratio_impl(
+            s1_iter.clone(),
+            len1,
+            s2_iter.clone(),
+            len2,
+            score_cutoff,
+            args.score_hint,
+        )
+    } else {
+        partial_ratio_impl(
+            s2_iter.clone(),
+            len2,
+            s1_iter.clone(),
+            len1,
+            score_cutoff,
+            args.score_hint,
+        )
+    };


If the two strings are swapped for the comparison we have to swap the start and end positions to fix them up. I guess that would come down to something along the lines of:

if len1 > len2 { let mut res = partial_ratio_alignment(s2, len2, s1, len1, args); std::mem::swap(&mut res.src_start, &mut res.dest_start); std::mem::swap(&mut res.src_end, &mut res.dest_end); return res; }

Alternatively you can construct a new ScoreAlignment and return that.

A test would make sense for this as well. I should add one to the C++ tests too.

src/fuzz.rs

maxbachmann · 2024-12-11T18:39:33Z

src/fuzz.rs

+    match alignment {
+        Some(alignment) => alignment.score,
+        None => 0.0,
+    }


This will not return None if a score below score_cutoff was passed.
For now this could be:

let score = match alignment { Some(alignment) => alignment.score, None => 0.0, } args.score_cutoff.score(score)

This will likely need to be changed when making the return type of partial_ratio_alignment alignment dependent on the presence of a score_cutoff.

Please add a test for this as well. fn issue206 might be a good place. For the other languages I managed to find ways to iterate over functions to reduce the boilerplate for some of these tests. I didn't look into ways to achieve the same in rust so far.

maxbachmann · 2024-12-11T18:41:52Z

src/fuzz.rs

+    let s1_char_set = s1_iter
+        .clone()
+        .map(|c| c.hash_char())
+        .collect::<HashSet<_>>();


things like the batchcomparator and the charset should be passed as argument. This is required so the same implementation can be used from PartialRatioBatchComparator. PartialRatioBatchComparator should get implemented.

maxbachmann · 2024-12-11T18:48:10Z

src/fuzz.rs

+    }
+
+    let s1_iter = s1.into_iter();
+    let s2_vec = s2.into_iter().collect::<Vec<_>>();


I am a bit unsure about this so far. In C++ we directly use random access, but that can only be done because we assume fast random access. In Rust e.g. string is utf8 and so has no fast random access.
I believe we could write this so random access isn't required. However we would still iterate the string multiple times and so maybe for those types it's still faster to just make the copy.

I would leave it like this for now, since partial_ratio is never super fast anyway. If we were to change it we should probably benchmark it first.

src/fuzz.rs

maxbachmann · 2024-12-11T18:59:55Z

src/fuzz.rs

+        assert_eq!(result.src_end, 3);
+        assert_eq!(result.dest_start, 1);
+        assert_eq!(result.dest_end, 4);
+    }


I think there are still a couple of other test in https://github.com/rapidfuzz/RapidFuzz/blob/main/tests/test_fuzz.py and https://github.com/rapidfuzz/rapidfuzz-cpp/blob/main/test/tests-fuzz.cpp that we could port over, but I didn't check this in depth so far.

ljnsn · 2024-12-12T00:34:52Z

Thanks for the comments! I've already taken a stab at improving typing and did the simplifications you suggested. Tests still to be extended.

maxbachmann · 2024-12-12T10:51:15Z

src/fuzz.rs

+/**
+implementation of partial_ratio for needles <= 64. assumes s1 is already the
+shorter string
+*/


It assumes len(s1) <= len(s2) not len(s1) < len(s2). I fixed the comment in the Python version as well.

The implementation should be based on the C++ implementation you can fine here: https://github.com/rapidfuzz/rapidfuzz-cpp/blob/c6a3ac87c42ddf52f502dc3ed7001c8c2cefb900/rapidfuzz/fuzz_impl.hpp#L68

The Python implementation only skips windows if they can't exist since a character is unique. The C++ implementation makes use of the fact that a score can only change by a certain distance per shift of the window to skip windows. An example would be:

"aaa" <-> "bcdef"

Here the alignments with the full length are:

"aaa" <-> "bcd"

"aaa" <-> "cde"

"aaa" <-> "def"

Now if we first calculate the distance for the alignments 1 and 3 both of the have a indel distance of 6. That way we know that alignment 2 can at best have an indel distance of 4.

Looking at the C++ implementation I feel like the actual min_score calculation in there is incorrect.. I believe instead of:

/* half of the cells that are not needed for known_edits can lead to a better score */ ptrdiff_t min_score = static_cast<ptrdiff_t>(std::min(scores[window.first], scores[window.second])) - static_cast<ptrdiff_t>(cell_diff + known_edits / 2);

this should be

/* half of the cells that are not needed for known_edits can lead to a better score */ size_t max_score_improvement = (cell_diff - known_edits / 2) / 2 * 2; ptrdiff_t min_score = static_cast<ptrdiff_t>(std::min(scores[window.first], scores[window.second])) - static_cast<ptrdiff_t>(max_score_improvement);

The current implementation doesn't lead to incorrect results but allows skipping less cells than we really could.

I improved the calculation of min_score in rapidfuzz-cpp to skip more alignments

ljnsn added 11 commits December 8, 2024 19:10

add score alignment struct

4c3d5f3

partial_ratio v1

83c1aab

use batch comparator for partial ratio impl

f5c87d8

change short needle name

e8900f4

fix wrong call

5f8aae9

unwrap to zero

9d65c4b

add tests

e23005a

collect into vec for perf

d98e40a

remove useless reconstruct

77fb71b

minor niceness improvement

b1f61bb

don't multiply score by 100

89f6094

maxbachmann requested changes Dec 11, 2024

View reviewed changes

ljnsn added 3 commits December 12, 2024 01:28

add alignment output type

158e579

derive clone and copy for score alignment

ef1e2be

simplify code and improve typing

e352abf

maxbachmann reviewed Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partial ratio #10

Add partial ratio #10

ljnsn commented Dec 10, 2024

maxbachmann left a comment

maxbachmann Dec 11, 2024 •

edited

Loading

maxbachmann Dec 11, 2024

maxbachmann Dec 11, 2024

maxbachmann Dec 11, 2024 •

edited

Loading

maxbachmann Dec 11, 2024

ljnsn commented Dec 12, 2024

maxbachmann Dec 12, 2024 •

edited

Loading

maxbachmann Dec 14, 2024

Add partial ratio #10

Are you sure you want to change the base?

Add partial ratio #10

Conversation

ljnsn commented Dec 10, 2024

maxbachmann left a comment

Choose a reason for hiding this comment

maxbachmann Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

maxbachmann Dec 11, 2024

Choose a reason for hiding this comment

maxbachmann Dec 11, 2024

Choose a reason for hiding this comment

maxbachmann Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

maxbachmann Dec 11, 2024

Choose a reason for hiding this comment

ljnsn commented Dec 12, 2024

maxbachmann Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

maxbachmann Dec 14, 2024

Choose a reason for hiding this comment

maxbachmann Dec 11, 2024 •

edited

Loading

maxbachmann Dec 11, 2024 •

edited

Loading

maxbachmann Dec 12, 2024 •

edited

Loading