-
Notifications
You must be signed in to change notification settings - Fork 3
Feature request: stateful calculation to cut down allocations #1
Comments
Haven't been really active lately so sorry for late response, but sure sounds like an obvious optimization :) Would you like to share your benchmark code? Had a quick look at your link and the only obvious difference seems to be how |
Hmm I'm kinda interested in what you say about correctness too, would you like to share your code and results for that too please? |
Sorry for the late reply. Sure, the issue I uncovered is alextanhongpin/stringdist#3. My reference implementation is documented at https://michael.stapelberg.ch/posts/2012-09-22-member_id/ The code I’m using to compare is: // Script to assign a 6-digit identifier to each member. Each identifier has a
// levenshtein-damerau-distance ("edit distance") of 5 to every other existing
// identifier. Identifiers are represented by an alphabet of 16 characters which
// are valid in bank transfer subjects and have a low risk of being mistaken
// with a different character (e.g. O and 0 are not included).
package main
import (
"crypto/sha1"
"fmt"
"log"
"strconv"
"strings"
"github.com/alextanhongpin/stringdist"
tdl "github.com/lmas/Damerau-Levenshtein"
)
var okay = [16]byte{'A', 'C', 'D', 'E', 'H', 'K', 'L', 'P', 'T', 'W', 'X', 'Y', '3', '4', '7', '9'}
// Generate a 6-character identifier by using the first 6 characters of the
// SHA1-hash of the argument. Since hash functions generate a very different
// output for slightly different inputs, this will quickly result in identifiers
// with a big-enough levenshtein-damerau-distance.
func generateNum(src int) string {
dig := sha1.Sum([]byte(strconv.Itoa(src)))
var out [6]byte
out[0] = okay[dig[0]&0xF0>>4]
out[1] = okay[dig[0]&0x0F]
out[2] = okay[dig[1]&0xF0>>4]
out[3] = okay[dig[1]&0x0F]
out[4] = okay[dig[2]&0xF0>>4]
out[5] = okay[dig[2]&0x0F]
return fmt.Sprintf("%c%c%c%c%c%c",
out[0], out[1],
out[2], out[3],
out[4], out[5])
}
func minDistance(dl *stringdist.TrueDamerauLevenshtein, others []string) func(string) int {
o := make([]string, len(others))
for idx, num := range others {
o[idx] = strings.ReplaceAll(num, "-", "")
}
return func(num string) int {
min := len(num)
for _, other := range o {
dist := dl.Calculate(num, other)
odist := tdl.Distance(num, other)
if dist != odist {
log.Printf("params: (%v, %v), stringdist = %v, tdl = %v", num, other, dist, odist)
}
if dist < min {
min = dist
}
}
return min
}
}
func findIdentifier(src int, minDistance func(string) int) string {
for minDistance(generateNum(src)) < 5 {
src++
}
num := generateNum(src)
return num[:2] + "-" + num[2:4] + "-" + num[4:]
}
func main() {
var existing []string
dl := stringdist.NewTrueDamerauLevenshtein()
src := 0
for i := 0; i < 10; i++ {
fmt.Printf("Generating identifier %d\n", i)
minDist := minDistance(dl, existing)
for minDist(generateNum(src)) < 5 {
src++
}
num := generateNum(src)
existing = append(existing, num)
identifier := num[:2] + "-" + num[2:4] + "-" + num[4:]
fmt.Printf("Done, assigning %s\n", identifier)
}
} |
Alright thanks, that's a pretty interesting use case. Also gave a quick look at that other library and lotsa other implementations in other languages too, but everyone differs in various ways and it's really difficult finding a known good implementation or test results. I'm going to open a separate issue to track that deep dive into the rabbit hole, finish it and continue with your issue and optimize allocation usage and stuff. |
When doing many calculations, the excessive amount of memory allocations (346278609 in my benchmark) slows down the program.
Other packages (e.g. https://github.com/alextanhongpin/stringdist) offer an API to create state (allocating then) and finish much faster in my benchmark. Unfortunately, that package suffers from a correctness issue, which is why I was looking into yours :)
What do you think about this performance optimization?
Thanks,
The text was updated successfully, but these errors were encountered: