Skip to content
This repository has been archived by the owner on Aug 3, 2024. It is now read-only.

Feature request: stateful calculation to cut down allocations #1

Closed
stapelberg opened this issue Apr 2, 2020 · 4 comments
Closed
Labels
bug Something isn't working enhancement New feature or request

Comments

@stapelberg
Copy link

When doing many calculations, the excessive amount of memory allocations (346278609 in my benchmark) slows down the program.

Other packages (e.g. https://github.com/alextanhongpin/stringdist) offer an API to create state (allocating then) and finish much faster in my benchmark. Unfortunately, that package suffers from a correctness issue, which is why I was looking into yours :)

What do you think about this performance optimization?

Thanks,

@lmas
Copy link
Owner

lmas commented Apr 18, 2020

Haven't been really active lately so sorry for late response, but sure sounds like an obvious optimization :)

Would you like to share your benchmark code?

Had a quick look at your link and the only obvious difference seems to be how da is initiated in the Calculate() functions, is that the excessive allocation you're talking about?

@lmas lmas added bug Something isn't working enhancement New feature or request labels Apr 18, 2020
@lmas
Copy link
Owner

lmas commented Apr 18, 2020

Hmm I'm kinda interested in what you say about correctness too, would you like to share your code and results for that too please?

@stapelberg
Copy link
Author

Sorry for the late reply.

Sure, the issue I uncovered is alextanhongpin/stringdist#3.

My reference implementation is documented at https://michael.stapelberg.ch/posts/2012-09-22-member_id/

The code I’m using to compare is:

// Script to assign a 6-digit identifier to each member. Each identifier has a
// levenshtein-damerau-distance ("edit distance") of 5 to every other existing
// identifier. Identifiers are represented by an alphabet of 16 characters which
// are valid in bank transfer subjects and have a low risk of being mistaken
// with a different character (e.g. O and 0 are not included).
package main

import (
	"crypto/sha1"
	"fmt"
	"log"
	"strconv"
	"strings"

	"github.com/alextanhongpin/stringdist"
	tdl "github.com/lmas/Damerau-Levenshtein"
)

var okay = [16]byte{'A', 'C', 'D', 'E', 'H', 'K', 'L', 'P', 'T', 'W', 'X', 'Y', '3', '4', '7', '9'}

// Generate a 6-character identifier by using the first 6 characters of the
// SHA1-hash of the argument. Since hash functions generate a very different
// output for slightly different inputs, this will quickly result in identifiers
// with a big-enough levenshtein-damerau-distance.
func generateNum(src int) string {
	dig := sha1.Sum([]byte(strconv.Itoa(src)))
	var out [6]byte
	out[0] = okay[dig[0]&0xF0>>4]
	out[1] = okay[dig[0]&0x0F]

	out[2] = okay[dig[1]&0xF0>>4]
	out[3] = okay[dig[1]&0x0F]

	out[4] = okay[dig[2]&0xF0>>4]
	out[5] = okay[dig[2]&0x0F]
	return fmt.Sprintf("%c%c%c%c%c%c",
		out[0], out[1],
		out[2], out[3],
		out[4], out[5])
}

func minDistance(dl *stringdist.TrueDamerauLevenshtein, others []string) func(string) int {
	o := make([]string, len(others))
	for idx, num := range others {
		o[idx] = strings.ReplaceAll(num, "-", "")
	}
	return func(num string) int {
		min := len(num)
		for _, other := range o {
			dist := dl.Calculate(num, other)
			odist := tdl.Distance(num, other)
			if dist != odist {
				log.Printf("params: (%v, %v), stringdist = %v, tdl = %v", num, other, dist, odist)
			}
			if dist < min {
				min = dist
			}
		}
		return min
	}
}

func findIdentifier(src int, minDistance func(string) int) string {
	for minDistance(generateNum(src)) < 5 {
		src++
	}
	num := generateNum(src)
	return num[:2] + "-" + num[2:4] + "-" + num[4:]
}

func main() {
	var existing []string
	dl := stringdist.NewTrueDamerauLevenshtein()
	src := 0
	for i := 0; i < 10; i++ {
		fmt.Printf("Generating identifier %d\n", i)
		minDist := minDistance(dl, existing)
		for minDist(generateNum(src)) < 5 {
			src++
		}
		num := generateNum(src)
		existing = append(existing, num)
		identifier := num[:2] + "-" + num[2:4] + "-" + num[4:]
		fmt.Printf("Done, assigning %s\n", identifier)
	}
}

@lmas
Copy link
Owner

lmas commented May 16, 2020

Alright thanks, that's a pretty interesting use case.

Also gave a quick look at that other library and lotsa other implementations in other languages too, but everyone differs in various ways and it's really difficult finding a known good implementation or test results. I'm going to open a separate issue to track that deep dive into the rabbit hole, finish it and continue with your issue and optimize allocation usage and stuff.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants