-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using preprocess_phi_3_new
in LAVIS/open_flamingo/train/sft_data_utils.py
gets labels all -100.
#776
Comments
Hi @JHW5981, Thank you for trying out with our code. In my local environment, the current code works with the latest phi3-mini model/tokenizer. The commented code in line 225-226 was working with a previous version of phi3 tokenizer (the said phi3 model update). Could you check if your local phi3 model is up-to-date? |
@azshue Thank you for your response. Replacing the tokenizer with the newest Phi-3 tokenizer does not solve my problem. I downloaded the weights from Salesforce/xgen-mm-phi3-mini-instruct-r-v1 and used the following code to load the tokenizer:
To update the tokenizer, I replaced the files 'special_tokens_map.json', 'tokenizer_config.json', 'tokenizer.json', and 'tokenizer.model' with those from microsoft/Phi-3-mini-4k-instruct. After reloading the tokenizer, the issue persists. I wonder if lines 225-266 in the code are essential. On line 221, the <|assistant|> token is already added. If lines 225-266 are not commented out, the <|assistant|> token is not masked. I believe this behavior is not consistent with how user-defined tokens should normally be masked. By the way, do you know why using the tokenizer to convert text to ids does not add special tokens, even when I explicitly set add_special_tokens=True?😂 |
Hello, thank you for your wonderful work.
I have a problem re-implementing LazySupervisedDataset and am stuck at the position of retrieving training labels. All labels are -100.
Below is a screenshot of my dataset:
I completely reuse your LazySupervisedDataset. When I initialize data_path, tokenizer, image_processor, and args, it runs without any issues. However, when I check the labels it generates, the tensor is entirely -100.
I debugged this strange behavior and found that the issue occurs because of the following piece of code:
First, when the if-clause above reaches the “user round,” the cur_len is absolutely not equal to total_len, so the line target[:] = IGNORE_INDEX is always executed.
Second, the code at line 226 does not skip the bos token but instead skips the "<|user|>" token. I don’t understand the reasoning behind this behavior.
The text was updated successfully, but these errors were encountered: