Alignment issues after #986 (support timestamp for numbers) #1016

MarkMLCode · 2025-01-23T18:02:13Z

It seems that the fix recently implemented in #986 (support timestamp for numbers) causes issues with the alignment of the last word in a segment. Whenever there is a sound at the end of the file, it seems that the entire space between the last word and the noise is now detected as the last word (about a second in my test). This even places the end of the word after the total duration of the file. In fact, I've noticed it doing this even when the file has no noise at the end (that is, it detects the last word a little after the end duration of the file).

File used : test.wav
test sound file.zip

File duration : 7.8
Last word detected in the list (before the change): {'word': 'back,', 'start': 6.903, 'end': 7.104, 'score': 1.0}
Last word detected in the list (after the change): {'word': 'back,', 'start': 6.941, 'end': 7.847, 'score': 0.966}

Transcribe function results :
{'segments': [{'text': ' I focus my energy, aiming a forceful magic missile at the remaining spectres and try to send them reeling back,', 'start': 0.031, 'end': 7.827}], 'language': 'en'}

Align (before the change) :

[{'word': 'I', 'start': 0.131, 'end': 0.232, 'score': 0.822}, {'word': 'focus', 'start': 0.373, 'end': 0.795, 'score': 0.821}, {'word': 'my', 'start': 0.835, 'end': 0.975, 'score': 0.953}, {'word': 'energy,', 'start': 1.136, 'end': 1.558, 'score': 0.72}, {'word': 'aiming', 'start': 1.98, 'end': 2.241, 'score': 0.857}, {'word': 'a', 'start': 2.301, 'end': 2.342, 'score': 0.5}, {'word': 'forceful', 'start': 2.442, 'end': 2.904, 'score': 0.893}, {'word': 'magic', 'start': 2.965, 'end': 3.286, 'score': 0.999}, {'word': 'missile', 'start': 3.366, 'end': 3.688, 'score': 0.748}, {'word': 'at', 'start': 3.748, 'end': 3.829, 'score': 0.744}, {'word': 'the', 'start': 3.849, 'end': 3.929, 'score': 0.825}, {'word': 'remaining', 'start': 3.969, 'end': 4.371, 'score': 0.719}, {'word': 'spectres', 'start': 4.431, 'end': 4.934, 'score': 0.851}, {'word': 'and', 'start': 5.295, 'end': 5.396, 'score': 0.817}, {'word': 'try', 'start': 5.456, 'end': 5.717, 'score': 0.937}, {'word': 'to', 'start': 5.757, 'end': 5.818, 'score': 0.777}, {'word': 'send', 'start': 5.878, 'end': 6.099, 'score': 0.89}, {'word': 'them', 'start': 6.119, 'end': 6.28, 'score': 0.865}, {'word': 'reeling', 'start': 6.421, 'end': 6.822, 'score': 0.925}, {'word': 'back,', 'start': 6.903, 'end': 7.104, 'score': 1.0}]

Align (after the change):

[{'word': 'I', 'start': 0.031, 'end': 0.373, 'score': 0.92}, {'word': 'focus', 'start': 0.394, 'end': 0.796, 'score': 0.846}, {'word': 'my', 'start': 0.857, 'end': 0.998, 'score': 0.953}, {'word': 'energy,', 'start': 1.159, 'end': 1.562, 'score': 0.831}, {'word': 'aiming', 'start': 1.602, 'end': 2.247, 'score': 0.912}, {'word': 'a', 'start': 2.327, 'end': 2.348, 'score': 0.997}, {'word': 'forceful', 'start': 2.469, 'end': 2.932, 'score': 0.88}, {'word': 'magic', 'start': 2.992, 'end': 3.315, 'score': 0.999}, {'word': 'missile', 'start': 3.395, 'end': 3.717, 'score': 0.83}, {'word': 'at', 'start': 3.778, 'end': 3.838, 'score': 0.865}, {'word': 'the', 'start': 3.858, 'end': 3.939, 'score': 0.826}, {'word': 'remaining', 'start': 4.0, 'end': 4.402, 'score': 0.749}, {'word': 'spectres', 'start': 4.463, 'end': 4.946, 'score': 0.823}, {'word': 'and', 'start': 5.329, 'end': 5.41, 'score': 0.983}, {'word': 'try', 'start': 5.49, 'end': 5.752, 'score': 0.937}, {'word': 'to', 'start': 5.792, 'end': 5.833, 'score': 0.936}, {'word': 'send', 'start': 5.913, 'end': 6.115, 'score': 0.971}, {'word': 'them', 'start': 6.155, 'end': 6.296, 'score': 0.947}, {'word': 'reeling', 'start': 6.336, 'end': 6.84, 'score': 0.941}, {'word': 'back,', 'start': 6.941, 'end': 7.847, 'score': 0.966}]

I cloned the project on the tag v3.3.1 and tested it with and without the fix. I also tried reducing the amount of changes made to alignment.py to a minimum pinpoint the issue. It would seem that the issue happens even when only the changes to the get_trellis and backtrack functions are applied, so it seems the problem lies there. I haven't been able to tell exactly what is causing such a discrepancy.

Minimal changes branch: https://github.com/MarkMLCode/whisperX/tree/minimal-changes

bfs18 · 2025-01-25T19:02:45Z

Hi @MarkMLCode , thank you for your feedback. The updated versions of get_trellis and backtrack are adapted from the PyTorch Audio Forced Alignment Tutorial. The original code in the repository was based on an older version of the implementation. From my perspective, the new version is more accurate when implementing dynamic programming.

MarkMLCode · 2025-01-25T20:55:54Z

Hello @bfs18 . While I haven't done exhaustive tests, in my experience the resulting alignment detection seems worse than it was before. The reason I noticed any change in the first place is that after updating WhisperX a few days ago, a feature that worked quite well before (detecting noise at the end of sentences to cut it out) basically stopped working. It's quite possible that the word alignments are better in some cases, but I cannot personally justify using the latest changes as it stands.

Maybe some more tests should be done to compare the alignments before and after the change and determine which is better overall? As can be seen in my example, the word alignments between the two versions are significantly different all over, not just for the last word. Whether it's because it's better or worse is hard to objectively tell for sure. Alternatively, the new implementation could be made optional if it actually perform better in some cases, but not all.

bfs18 · 2025-01-26T04:25:43Z

Fixed in the pull request. #1019

With the added lines, the result is:

[{'start': 0.031, 'end': 7.182, 'text': ' I focus my energy, aiming a forceful magic missile at the remaining specters, and try to send them reeling back.', 'words': [{'word': 'I', 'start': 0.031, 'end': 0.373, 'score': 0.92}, {'word': 'focus', 'start': 0.394, 'end': 0.796, 'score': 0.846}, {'word': 'my', 'start': 0.857, 'end': 0.998, 'score': 0.953}, {'word': 'energy,', 'start': 1.159, 'end': 1.542, 'score': 0.802}, {'word': 'aiming', 'start': 1.582, 'end': 2.247, 'score': 0.906}, {'word': 'a', 'start': 2.327, 'end': 2.348, 'score': 0.997}, {'word': 'forceful', 'start': 2.469, 'end': 2.932, 'score': 0.88}, {'word': 'magic', 'start': 2.992, 'end': 3.315, 'score': 0.999}, {'word': 'missile', 'start': 3.395, 'end': 3.717, 'score': 0.83}, {'word': 'at', 'start': 3.778, 'end': 3.838, 'score': 0.865}, {'word': 'the', 'start': 3.858, 'end': 3.939, 'score': 0.826}, {'word': 'remaining', 'start': 4.0, 'end': 4.402, 'score': 0.749}, {'word': 'specters,', 'start': 4.463, 'end': 4.946, 'score': 0.837}, {'word': 'and', 'start': 5.329, 'end': 5.41, 'score': 0.983}, {'word': 'try', 'start': 5.49, 'end': 5.752, 'score': 0.937}, {'word': 'to', 'start': 5.792, 'end': 5.833, 'score': 0.936}, {'word': 'send', 'start': 5.913, 'end': 6.115, 'score': 0.971}, {'word': 'them', 'start': 6.155, 'end': 6.296, 'score': 0.947}, {'word': 'reeling', 'start': 6.336, 'end': 6.84, 'score': 0.941}, {'word': 'back.', 'start': 6.941, 'end': 7.182, 'score': 0.999}]}]

bfs18 · 2025-01-26T08:45:38Z

Additionally, the previous implementations of get_trellis may fail to capture the final segment of the audio, even when using a wildcard. Specifically, in the following test case, the old versions tends to omit the last few words in audio.
--BhThOY2Ug_2.mp3
It is more appropriate to use the new get_trellis with added head and tail word boundaries in the text

bfs18 linked a pull request Jan 26, 2025 that will close this issue

make sure the leading and tailing word boundary exists. #1019

Open

bfs18 mentioned this issue Jan 26, 2025

Fix subtitle overlaps #999

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment issues after #986 (support timestamp for numbers) #1016

Alignment issues after #986 (support timestamp for numbers) #1016

MarkMLCode commented Jan 23, 2025

bfs18 commented Jan 25, 2025 •

edited

Loading

MarkMLCode commented Jan 25, 2025

bfs18 commented Jan 26, 2025 •

edited

Loading

bfs18 commented Jan 26, 2025

Alignment issues after #986 (support timestamp for numbers) #1016

Alignment issues after #986 (support timestamp for numbers) #1016

Comments

MarkMLCode commented Jan 23, 2025

bfs18 commented Jan 25, 2025 • edited Loading

MarkMLCode commented Jan 25, 2025

bfs18 commented Jan 26, 2025 • edited Loading

bfs18 commented Jan 26, 2025

bfs18 commented Jan 25, 2025 •

edited

Loading

bfs18 commented Jan 26, 2025 •

edited

Loading