-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment issues after #986 (support timestamp for numbers) #1016
Comments
Hi @MarkMLCode , thank you for your feedback. The updated versions of get_trellis and backtrack are adapted from the PyTorch Audio Forced Alignment Tutorial. The original code in the repository was based on an older version of the implementation. From my perspective, the new version is more accurate when implementing dynamic programming. |
Hello @bfs18 . While I haven't done exhaustive tests, in my experience the resulting alignment detection seems worse than it was before. The reason I noticed any change in the first place is that after updating WhisperX a few days ago, a feature that worked quite well before (detecting noise at the end of sentences to cut it out) basically stopped working. It's quite possible that the word alignments are better in some cases, but I cannot personally justify using the latest changes as it stands. Maybe some more tests should be done to compare the alignments before and after the change and determine which is better overall? As can be seen in my example, the word alignments between the two versions are significantly different all over, not just for the last word. Whether it's because it's better or worse is hard to objectively tell for sure. Alternatively, the new implementation could be made optional if it actually perform better in some cases, but not all. |
Fixed in the pull request. #1019 With the added lines, the result is: [{'start': 0.031, 'end': 7.182, 'text': ' I focus my energy, aiming a forceful magic missile at the remaining specters, and try to send them reeling back.', 'words': [{'word': 'I', 'start': 0.031, 'end': 0.373, 'score': 0.92}, {'word': 'focus', 'start': 0.394, 'end': 0.796, 'score': 0.846}, {'word': 'my', 'start': 0.857, 'end': 0.998, 'score': 0.953}, {'word': 'energy,', 'start': 1.159, 'end': 1.542, 'score': 0.802}, {'word': 'aiming', 'start': 1.582, 'end': 2.247, 'score': 0.906}, {'word': 'a', 'start': 2.327, 'end': 2.348, 'score': 0.997}, {'word': 'forceful', 'start': 2.469, 'end': 2.932, 'score': 0.88}, {'word': 'magic', 'start': 2.992, 'end': 3.315, 'score': 0.999}, {'word': 'missile', 'start': 3.395, 'end': 3.717, 'score': 0.83}, {'word': 'at', 'start': 3.778, 'end': 3.838, 'score': 0.865}, {'word': 'the', 'start': 3.858, 'end': 3.939, 'score': 0.826}, {'word': 'remaining', 'start': 4.0, 'end': 4.402, 'score': 0.749}, {'word': 'specters,', 'start': 4.463, 'end': 4.946, 'score': 0.837}, {'word': 'and', 'start': 5.329, 'end': 5.41, 'score': 0.983}, {'word': 'try', 'start': 5.49, 'end': 5.752, 'score': 0.937}, {'word': 'to', 'start': 5.792, 'end': 5.833, 'score': 0.936}, {'word': 'send', 'start': 5.913, 'end': 6.115, 'score': 0.971}, {'word': 'them', 'start': 6.155, 'end': 6.296, 'score': 0.947}, {'word': 'reeling', 'start': 6.336, 'end': 6.84, 'score': 0.941}, {'word': 'back.', 'start': 6.941, 'end': 7.182, 'score': 0.999}]}] |
Additionally, the previous implementations of get_trellis may fail to capture the final segment of the audio, even when using a wildcard. Specifically, in the following test case, the old versions tends to omit the last few words in audio. |
It seems that the fix recently implemented in #986 (support timestamp for numbers) causes issues with the alignment of the last word in a segment. Whenever there is a sound at the end of the file, it seems that the entire space between the last word and the noise is now detected as the last word (about a second in my test). This even places the end of the word after the total duration of the file. In fact, I've noticed it doing this even when the file has no noise at the end (that is, it detects the last word a little after the end duration of the file).
File used : test.wav
test sound file.zip
File duration : 7.8
Last word detected in the list (before the change): {'word': 'back,', 'start': 6.903, 'end': 7.104, 'score': 1.0}
Last word detected in the list (after the change): {'word': 'back,', 'start': 6.941, 'end': 7.847, 'score': 0.966}
Transcribe function results :
{'segments': [{'text': ' I focus my energy, aiming a forceful magic missile at the remaining spectres and try to send them reeling back,', 'start': 0.031, 'end': 7.827}], 'language': 'en'}
Align (before the change) :
[{'word': 'I', 'start': 0.131, 'end': 0.232, 'score': 0.822}, {'word': 'focus', 'start': 0.373, 'end': 0.795, 'score': 0.821}, {'word': 'my', 'start': 0.835, 'end': 0.975, 'score': 0.953}, {'word': 'energy,', 'start': 1.136, 'end': 1.558, 'score': 0.72}, {'word': 'aiming', 'start': 1.98, 'end': 2.241, 'score': 0.857}, {'word': 'a', 'start': 2.301, 'end': 2.342, 'score': 0.5}, {'word': 'forceful', 'start': 2.442, 'end': 2.904, 'score': 0.893}, {'word': 'magic', 'start': 2.965, 'end': 3.286, 'score': 0.999}, {'word': 'missile', 'start': 3.366, 'end': 3.688, 'score': 0.748}, {'word': 'at', 'start': 3.748, 'end': 3.829, 'score': 0.744}, {'word': 'the', 'start': 3.849, 'end': 3.929, 'score': 0.825}, {'word': 'remaining', 'start': 3.969, 'end': 4.371, 'score': 0.719}, {'word': 'spectres', 'start': 4.431, 'end': 4.934, 'score': 0.851}, {'word': 'and', 'start': 5.295, 'end': 5.396, 'score': 0.817}, {'word': 'try', 'start': 5.456, 'end': 5.717, 'score': 0.937}, {'word': 'to', 'start': 5.757, 'end': 5.818, 'score': 0.777}, {'word': 'send', 'start': 5.878, 'end': 6.099, 'score': 0.89}, {'word': 'them', 'start': 6.119, 'end': 6.28, 'score': 0.865}, {'word': 'reeling', 'start': 6.421, 'end': 6.822, 'score': 0.925}, {'word': 'back,', 'start': 6.903, 'end': 7.104, 'score': 1.0}]
Align (after the change):
[{'word': 'I', 'start': 0.031, 'end': 0.373, 'score': 0.92}, {'word': 'focus', 'start': 0.394, 'end': 0.796, 'score': 0.846}, {'word': 'my', 'start': 0.857, 'end': 0.998, 'score': 0.953}, {'word': 'energy,', 'start': 1.159, 'end': 1.562, 'score': 0.831}, {'word': 'aiming', 'start': 1.602, 'end': 2.247, 'score': 0.912}, {'word': 'a', 'start': 2.327, 'end': 2.348, 'score': 0.997}, {'word': 'forceful', 'start': 2.469, 'end': 2.932, 'score': 0.88}, {'word': 'magic', 'start': 2.992, 'end': 3.315, 'score': 0.999}, {'word': 'missile', 'start': 3.395, 'end': 3.717, 'score': 0.83}, {'word': 'at', 'start': 3.778, 'end': 3.838, 'score': 0.865}, {'word': 'the', 'start': 3.858, 'end': 3.939, 'score': 0.826}, {'word': 'remaining', 'start': 4.0, 'end': 4.402, 'score': 0.749}, {'word': 'spectres', 'start': 4.463, 'end': 4.946, 'score': 0.823}, {'word': 'and', 'start': 5.329, 'end': 5.41, 'score': 0.983}, {'word': 'try', 'start': 5.49, 'end': 5.752, 'score': 0.937}, {'word': 'to', 'start': 5.792, 'end': 5.833, 'score': 0.936}, {'word': 'send', 'start': 5.913, 'end': 6.115, 'score': 0.971}, {'word': 'them', 'start': 6.155, 'end': 6.296, 'score': 0.947}, {'word': 'reeling', 'start': 6.336, 'end': 6.84, 'score': 0.941}, {'word': 'back,', 'start': 6.941, 'end': 7.847, 'score': 0.966}]
I cloned the project on the tag v3.3.1 and tested it with and without the fix. I also tried reducing the amount of changes made to alignment.py to a minimum pinpoint the issue. It would seem that the issue happens even when only the changes to the get_trellis and backtrack functions are applied, so it seems the problem lies there. I haven't been able to tell exactly what is causing such a discrepancy.
Minimal changes branch: https://github.com/MarkMLCode/whisperX/tree/minimal-changes
The text was updated successfully, but these errors were encountered: