curious how good LLMs are at gaokao (高考)

Data Collection

Source of the exams came from http://www.zizzs.com/gk/shitiku/165462.html. They were first downloaded as image format, then manually combined into a single pdf file. The pdf download option from the site has too much extra stuff.

Unfortunately, currently the pdf files are semi-manually converted to markdown (github doesn't render it well but vscode does). I have tried marker to hopefully speed-up the process but the results were not usable. Probably due to the model was trained on mostly vectorized pdf data with much higher resolution. In this case, the exam pdf files were converted from jpg/png so the resolution is much smaller.

Then use convert.py to generate a csv file from manually typed markdown which is used as inputs for the models.

Eval example

Prompt:

messages = [
    {"role": "user",   "content": "已知 $f(x) = \begin{cases}\sqrt{x}, x > 0 \\1, x \leq 0\end{cases}$，$f(3)=$"},
]

Response (from Yi-1.5-9B-Chat):

由于函数 $f(x)$ 在 $x > 0$ 的部分定义为 $f(x) = \sqrt{x}$，因此当 $x = 3$ 时，$3 > 0$，我们可以直接使用这个定义来计算 $f(3)$。

所以，$f(3) = \sqrt{3}$。

因此，$f(3)$ 的值是 $\sqrt{3}$。

Question Formatting rules

All of the rules were based purely on observations. They are experimental and could lead to a pretty big variation in terms of model responses. Unfortunately this is also the biggest pain point of evaluating these models due to being closely related to BPE.

Insert before and after $ for equations and numbers.
Use language specific punctuation characters - ，。？ for Chinese and ,.? for English except within the equations.
Use newline character for multiple choice questions and comprehensive questions.

TODO: add some examples to illustrate this problem

Evaluations (graded by hand)

All model generation were configured with temperature = 0, and max_length = 4096

No partial credits and questions that involve image/graph/table understanding as well as proof are skipped.

Prompted in Chinese

model	shanghai	beijing	national1	national2
deepseek-coder	77/99	57/133	63/106	92/118
claude-3-5-sonnet	52/99	53/133	55/106	65/118
gpt-4o	62/99	55/133	60/106	77/118

Prompted in English (translated by Claude 3.5)

model	shanghai	beijing	national1	national2
deepseek-coder	60/99	61/133	63/106	74/118
claude-3-5-sonnet	65/99	34/133	54/106	71/118
gpt-4o	65/99	60/133	60/106	86/118

itemization of grades can be found in this spreadsheet

TODO:

This is just a mock test to set everything up in an efficient manner, need to test on some bigger and better models.
make the response generation as deterministic as possible for reproducibility.
manually format all questions (actually the format is almost automated through better prompting claude 3.5 to extract the texts).
add the rest of math exam
automate the benchmark scoring.
enable 150/150 generation and evaluation.
Optimize the prompt. The answers could be very sensitive to the prompts. See notes for more details on testing with different prompt styles.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
data		data
figures		figures
inspect_activations		inspect_activations
response		response
Dockerfile		Dockerfile
README.md		README.md
convert.py		convert.py
llm_gaokao_claude.py		llm_gaokao_claude.py
llm_gaokao_deepseek.py		llm_gaokao_deepseek.py
llm_gaokao_gemini.py		llm_gaokao_gemini.py
llm_gaokao_gpt4.py		llm_gaokao_gpt4.py
llm_gaokao_qwen.py		llm_gaokao_qwen.py
llm_gaokao_yi-large.py		llm_gaokao_yi-large.py
notes.md		notes.md
run.sh		run.sh
translate.py		translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

curious how good LLMs are at gaokao (高考)

Data Collection

Eval example

Prompt:

Response (from Yi-1.5-9B-Chat):

Question Formatting rules

TODO: add some examples to illustrate this problem

Evaluations (graded by hand)

Prompted in Chinese

Prompted in English (translated by Claude 3.5)

TODO:

About

Releases

Packages

Languages

g1y5x3/llm_gaokao

Folders and files

Latest commit

History

Repository files navigation

curious how good LLMs are at gaokao (高考)

Data Collection

Eval example

Prompt:

Response (from Yi-1.5-9B-Chat):

Question Formatting rules

TODO: add some examples to illustrate this problem

Evaluations (graded by hand)

Prompted in Chinese

Prompted in English (translated by Claude 3.5)

TODO:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages