You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Test-Case: Steps how to reproduce the issue.
Audio contents: "The full name of Donald is Donald J. Trump Jr"
prompt = "Donald Duck"
model = WhisperForConditionalGeneration.from_pretrained(model_dir).to("cuda")
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)
processor = WhisperProcessor.from_pretrained(model_dir)
prompt_ids = processor.get_prompt_ids(prompt)
input_features = feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to("cuda"), prompt_ids=prompt_ids, num_beams=4)
text = [processor.decode(predicted_id, skip_special_tokens=True) for predicted_id in predicted_ids]
transcript = text[0]
Output: The full name of Donald is Donald J. Trump Jr. Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Reproduction
Test-Case: Steps how to reproduce the issue.
Audio contents: "The full name of Donald is Donald J. Trump Jr"
prompt = "Donald Duck"
model = WhisperForConditionalGeneration.from_pretrained(model_dir).to("cuda")
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)
processor = WhisperProcessor.from_pretrained(model_dir)
prompt_ids = processor.get_prompt_ids(prompt)
input_features = feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to("cuda"), prompt_ids=prompt_ids, num_beams=4)
text = [processor.decode(predicted_id, skip_special_tokens=True) for predicted_id in predicted_ids]
transcript = text[0]
Output: The full name of Donald is Donald J. Trump Jr. Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck
Expected behavior
Expected behavior
It has to give either "The full name of Donald is Donald J. Trump" or "The full name of Donald is Donald Duck", not infinite no of prompt key words.
The text was updated successfully, but these errors were encountered:
System Info
System Info
Hi @sanchit-gandhi and @gante
Using Prompt Feature like it is mentioned here (#22395) causing the model output to have too many repetitions and too much of hallucinations.
I recorded an audio and gave it to the Whisper ASR model with prompt like as mentioned below.
More details:
Transformers Commit: 1c7e5e2
Test-Case: Steps how to reproduce the issue.
Audio contents: "The full name of Donald is Donald J. Trump Jr"
prompt = "Donald Duck"
model = WhisperForConditionalGeneration.from_pretrained(model_dir).to("cuda")
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)
processor = WhisperProcessor.from_pretrained(model_dir)
prompt_ids = processor.get_prompt_ids(prompt)
input_features = feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to("cuda"), prompt_ids=prompt_ids, num_beams=4)
text = [processor.decode(predicted_id, skip_special_tokens=True) for predicted_id in predicted_ids]
transcript = text[0]
Output: The full name of Donald is Donald J. Trump Jr. Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck
Link to the audio: https://drive.google.com/file/d/1ud-B0uepD8Sk6ArkvJdqPmFWYpCmAooi/view?usp=drive_link
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Reproduction
Test-Case: Steps how to reproduce the issue.
Audio contents: "The full name of Donald is Donald J. Trump Jr"
prompt = "Donald Duck"
model = WhisperForConditionalGeneration.from_pretrained(model_dir).to("cuda")
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)
processor = WhisperProcessor.from_pretrained(model_dir)
prompt_ids = processor.get_prompt_ids(prompt)
input_features = feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to("cuda"), prompt_ids=prompt_ids, num_beams=4)
text = [processor.decode(predicted_id, skip_special_tokens=True) for predicted_id in predicted_ids]
transcript = text[0]
Output: The full name of Donald is Donald J. Trump Jr. Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck
Expected behavior
Expected behavior
It has to give either "The full name of Donald is Donald J. Trump" or "The full name of Donald is Donald Duck", not infinite no of prompt key words.
The text was updated successfully, but these errors were encountered: