Source of the exams came from http://www.zizzs.com/gk/shitiku/165462.html. They were first downloaded as image format, then manually combined into a single pdf file. The pdf download option from the site has too much extra stuff.
Unfortunately, currently the pdf files are semi-manually converted to markdown (github doesn't render it well but vscode does). I have tried marker to hopefully speed-up the process but the results were not usable. Probably due to the model was trained on mostly vectorized pdf data with much higher resolution. In this case, the exam pdf files were converted from jpg/png so the resolution is much smaller.
Then use convert.py
to generate a csv file from manually typed markdown which is used as inputs for
the models.
messages = [
{"role": "user", "content": "已知 $f(x) = \begin{cases}\sqrt{x}, x > 0 \\1, x \leq 0\end{cases}$,$f(3)=$"},
]
由于函数 $f(x)$ 在 $x > 0$ 的部分定义为 $f(x) = \sqrt{x}$,因此当 $x = 3$ 时,$3 > 0$,我们可以直接使用这个定义来计算 $f(3)$。
所以,$f(3) = \sqrt{3}$。
因此,$f(3)$ 的值是 $\sqrt{3}$。
All of the rules were based purely on observations. They are experimental and could lead to a pretty big variation in terms of model responses. Unfortunately this is also the biggest pain point of evaluating these models due to being closely related to BPE.
- Insert
$
for equations and numbers. - Use language specific punctuation characters -
,。?
for Chinese and,.?
for English except within the equations. - Use
newline
character for multiple choice questions and comprehensive questions.
All model generation were configured with temperature = 0
, and max_length = 4096
No partial credits and questions that involve image/graph/table understanding as well as proof are skipped.
model | shanghai | beijing | national1 | national2 |
---|---|---|---|---|
deepseek-coder | 77/99 | 57/133 | 63/106 | 92/118 |
claude-3-5-sonnet | 52/99 | 53/133 | 55/106 | 65/118 |
gpt-4o | 62/99 | 55/133 | 60/106 | 77/118 |
model | shanghai | beijing | national1 | national2 |
---|---|---|---|---|
deepseek-coder | 60/99 | 61/133 | 63/106 | 74/118 |
claude-3-5-sonnet | 65/99 | 34/133 | 54/106 | 71/118 |
gpt-4o | 65/99 | 60/133 | 60/106 | 86/118 |
itemization of grades can be found in this spreadsheet
- This is just a mock test to set everything up in an efficient manner, need to test on some bigger and better models.
- make the response generation as deterministic as possible for reproducibility.
- manually format all questions (actually the format is almost automated through better prompting claude 3.5 to extract the texts).
- add the rest of math exam
- automate the benchmark scoring.
- enable 150/150 generation and evaluation.
- Optimize the prompt. The answers could be very sensitive to the prompts. See notes for more details on testing with different prompt styles.