Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment issues after #986 (support timestamp for numbers) #1016

Open
MarkMLCode opened this issue Jan 23, 2025 · 4 comments · May be fixed by #1019
Open

Alignment issues after #986 (support timestamp for numbers) #1016

MarkMLCode opened this issue Jan 23, 2025 · 4 comments · May be fixed by #1019

Comments

@MarkMLCode
Copy link

It seems that the fix recently implemented in #986 (support timestamp for numbers) causes issues with the alignment of the last word in a segment. Whenever there is a sound at the end of the file, it seems that the entire space between the last word and the noise is now detected as the last word (about a second in my test). This even places the end of the word after the total duration of the file. In fact, I've noticed it doing this even when the file has no noise at the end (that is, it detects the last word a little after the end duration of the file).

File used : test.wav
test sound file.zip

File duration : 7.8
Last word detected in the list (before the change): {'word': 'back,', 'start': 6.903, 'end': 7.104, 'score': 1.0}
Last word detected in the list (after the change): {'word': 'back,', 'start': 6.941, 'end': 7.847, 'score': 0.966}

Transcribe function results :
{'segments': [{'text': ' I focus my energy, aiming a forceful magic missile at the remaining spectres and try to send them reeling back,', 'start': 0.031, 'end': 7.827}], 'language': 'en'}

Align (before the change) :

[{'word': 'I', 'start': 0.131, 'end': 0.232, 'score': 0.822}, {'word': 'focus', 'start': 0.373, 'end': 0.795, 'score': 0.821}, {'word': 'my', 'start': 0.835, 'end': 0.975, 'score': 0.953}, {'word': 'energy,', 'start': 1.136, 'end': 1.558, 'score': 0.72}, {'word': 'aiming', 'start': 1.98, 'end': 2.241, 'score': 0.857}, {'word': 'a', 'start': 2.301, 'end': 2.342, 'score': 0.5}, {'word': 'forceful', 'start': 2.442, 'end': 2.904, 'score': 0.893}, {'word': 'magic', 'start': 2.965, 'end': 3.286, 'score': 0.999}, {'word': 'missile', 'start': 3.366, 'end': 3.688, 'score': 0.748}, {'word': 'at', 'start': 3.748, 'end': 3.829, 'score': 0.744}, {'word': 'the', 'start': 3.849, 'end': 3.929, 'score': 0.825}, {'word': 'remaining', 'start': 3.969, 'end': 4.371, 'score': 0.719}, {'word': 'spectres', 'start': 4.431, 'end': 4.934, 'score': 0.851}, {'word': 'and', 'start': 5.295, 'end': 5.396, 'score': 0.817}, {'word': 'try', 'start': 5.456, 'end': 5.717, 'score': 0.937}, {'word': 'to', 'start': 5.757, 'end': 5.818, 'score': 0.777}, {'word': 'send', 'start': 5.878, 'end': 6.099, 'score': 0.89}, {'word': 'them', 'start': 6.119, 'end': 6.28, 'score': 0.865}, {'word': 'reeling', 'start': 6.421, 'end': 6.822, 'score': 0.925}, {'word': 'back,', 'start': 6.903, 'end': 7.104, 'score': 1.0}]

Align (after the change):

[{'word': 'I', 'start': 0.031, 'end': 0.373, 'score': 0.92}, {'word': 'focus', 'start': 0.394, 'end': 0.796, 'score': 0.846}, {'word': 'my', 'start': 0.857, 'end': 0.998, 'score': 0.953}, {'word': 'energy,', 'start': 1.159, 'end': 1.562, 'score': 0.831}, {'word': 'aiming', 'start': 1.602, 'end': 2.247, 'score': 0.912}, {'word': 'a', 'start': 2.327, 'end': 2.348, 'score': 0.997}, {'word': 'forceful', 'start': 2.469, 'end': 2.932, 'score': 0.88}, {'word': 'magic', 'start': 2.992, 'end': 3.315, 'score': 0.999}, {'word': 'missile', 'start': 3.395, 'end': 3.717, 'score': 0.83}, {'word': 'at', 'start': 3.778, 'end': 3.838, 'score': 0.865}, {'word': 'the', 'start': 3.858, 'end': 3.939, 'score': 0.826}, {'word': 'remaining', 'start': 4.0, 'end': 4.402, 'score': 0.749}, {'word': 'spectres', 'start': 4.463, 'end': 4.946, 'score': 0.823}, {'word': 'and', 'start': 5.329, 'end': 5.41, 'score': 0.983}, {'word': 'try', 'start': 5.49, 'end': 5.752, 'score': 0.937}, {'word': 'to', 'start': 5.792, 'end': 5.833, 'score': 0.936}, {'word': 'send', 'start': 5.913, 'end': 6.115, 'score': 0.971}, {'word': 'them', 'start': 6.155, 'end': 6.296, 'score': 0.947}, {'word': 'reeling', 'start': 6.336, 'end': 6.84, 'score': 0.941}, {'word': 'back,', 'start': 6.941, 'end': 7.847, 'score': 0.966}]

I cloned the project on the tag v3.3.1 and tested it with and without the fix. I also tried reducing the amount of changes made to alignment.py to a minimum pinpoint the issue. It would seem that the issue happens even when only the changes to the get_trellis and backtrack functions are applied, so it seems the problem lies there. I haven't been able to tell exactly what is causing such a discrepancy.

Minimal changes branch: https://github.com/MarkMLCode/whisperX/tree/minimal-changes

@bfs18
Copy link
Contributor

bfs18 commented Jan 25, 2025

Hi @MarkMLCode , thank you for your feedback. The updated versions of get_trellis and backtrack are adapted from the PyTorch Audio Forced Alignment Tutorial. The original code in the repository was based on an older version of the implementation. From my perspective, the new version is more accurate when implementing dynamic programming.

@MarkMLCode
Copy link
Author

Hello @bfs18 . While I haven't done exhaustive tests, in my experience the resulting alignment detection seems worse than it was before. The reason I noticed any change in the first place is that after updating WhisperX a few days ago, a feature that worked quite well before (detecting noise at the end of sentences to cut it out) basically stopped working. It's quite possible that the word alignments are better in some cases, but I cannot personally justify using the latest changes as it stands.

Maybe some more tests should be done to compare the alignments before and after the change and determine which is better overall? As can be seen in my example, the word alignments between the two versions are significantly different all over, not just for the last word. Whether it's because it's better or worse is hard to objectively tell for sure. Alternatively, the new implementation could be made optional if it actually perform better in some cases, but not all.

@bfs18
Copy link
Contributor

bfs18 commented Jan 26, 2025

Fixed in the pull request. #1019

With the added lines, the result is:

[{'start': 0.031, 'end': 7.182, 'text': ' I focus my energy, aiming a forceful magic missile at the remaining specters, and try to send them reeling back.', 'words': [{'word': 'I', 'start': 0.031, 'end': 0.373, 'score': 0.92}, {'word': 'focus', 'start': 0.394, 'end': 0.796, 'score': 0.846}, {'word': 'my', 'start': 0.857, 'end': 0.998, 'score': 0.953}, {'word': 'energy,', 'start': 1.159, 'end': 1.542, 'score': 0.802}, {'word': 'aiming', 'start': 1.582, 'end': 2.247, 'score': 0.906}, {'word': 'a', 'start': 2.327, 'end': 2.348, 'score': 0.997}, {'word': 'forceful', 'start': 2.469, 'end': 2.932, 'score': 0.88}, {'word': 'magic', 'start': 2.992, 'end': 3.315, 'score': 0.999}, {'word': 'missile', 'start': 3.395, 'end': 3.717, 'score': 0.83}, {'word': 'at', 'start': 3.778, 'end': 3.838, 'score': 0.865}, {'word': 'the', 'start': 3.858, 'end': 3.939, 'score': 0.826}, {'word': 'remaining', 'start': 4.0, 'end': 4.402, 'score': 0.749}, {'word': 'specters,', 'start': 4.463, 'end': 4.946, 'score': 0.837}, {'word': 'and', 'start': 5.329, 'end': 5.41, 'score': 0.983}, {'word': 'try', 'start': 5.49, 'end': 5.752, 'score': 0.937}, {'word': 'to', 'start': 5.792, 'end': 5.833, 'score': 0.936}, {'word': 'send', 'start': 5.913, 'end': 6.115, 'score': 0.971}, {'word': 'them', 'start': 6.155, 'end': 6.296, 'score': 0.947}, {'word': 'reeling', 'start': 6.336, 'end': 6.84, 'score': 0.941}, {'word': 'back.', 'start': 6.941, 'end': 7.182, 'score': 0.999}]}]

@bfs18
Copy link
Contributor

bfs18 commented Jan 26, 2025

Additionally, the previous implementations of get_trellis may fail to capture the final segment of the audio, even when using a wildcard. Specifically, in the following test case, the old versions tends to omit the last few words in audio.
--BhThOY2Ug_2.mp3
It is more appropriate to use the new get_trellis with added head and tail word boundaries in the text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants