Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on Data Deduplication, Random Lower-Casing, and PoT Prompts Diversity #6

Open
lyf-00 opened this issue May 11, 2024 · 0 comments

Comments

@lyf-00
Copy link

lyf-00 commented May 11, 2024

Hello,

I truly admire your work on fine-tuning LLMs for mathematical reasoning and I have a few questions about the data preprocessing. I would appreciate some insights into the following aspects:

1. Data Deduplication Impact

  • In the data preparation phase, you mentioned that deduplication was applied. Could you specify what percentage of the data was removed through this process? How does this affect the overall quality and diversity of the final dataset?

2. Effects of Random Lower-Casing

  • The preprocessing steps include randomly lower-casing a certain percentage of inputs. What was the rationale behind this choice? Does the case variation of letters impact the fine-tuning results of the model?

3. Diversity of PoT Prompts

  • The training process incorporates a diverse set of Python prompts for the PoT. Could you share some insights on how this diversity compares to using a single prompt style in terms of model performance? What led to the decision to use such a varied approach?

I am also looking forward to any papers or further documentation you might release on this project, as I believe they would be incredibly informative.

Thank you for your dedication to this project and for taking the time to address my inquiries. Your work is truly inspiring.

Best regards,
lyf-00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant