Inquiry on Data Deduplication, Random Lower-Casing, and PoT Prompts Diversity #6

lyf-00 · 2024-05-11T03:50:36Z

Hello,

I truly admire your work on fine-tuning LLMs for mathematical reasoning and I have a few questions about the data preprocessing. I would appreciate some insights into the following aspects:

1. Data Deduplication Impact

In the data preparation phase, you mentioned that deduplication was applied. Could you specify what percentage of the data was removed through this process? How does this affect the overall quality and diversity of the final dataset?

2. Effects of Random Lower-Casing

The preprocessing steps include randomly lower-casing a certain percentage of inputs. What was the rationale behind this choice? Does the case variation of letters impact the fine-tuning results of the model?

3. Diversity of PoT Prompts

The training process incorporates a diverse set of Python prompts for the PoT. Could you share some insights on how this diversity compares to using a single prompt style in terms of model performance? What led to the decision to use such a varied approach?

I am also looking forward to any papers or further documentation you might release on this project, as I believe they would be incredibly informative.

Thank you for your dedication to this project and for taking the time to address my inquiries. Your work is truly inspiring.

Best regards,
lyf-00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry on Data Deduplication, Random Lower-Casing, and PoT Prompts Diversity #6

Inquiry on Data Deduplication, Random Lower-Casing, and PoT Prompts Diversity #6

lyf-00 commented May 11, 2024

Inquiry on Data Deduplication, Random Lower-Casing, and PoT Prompts Diversity #6

Inquiry on Data Deduplication, Random Lower-Casing, and PoT Prompts Diversity #6

Comments

lyf-00 commented May 11, 2024