Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quantized qwen2-0.5b #490

Merged
merged 2 commits into from
Jun 27, 2024
Merged

Add quantized qwen2-0.5b #490

merged 2 commits into from
Jun 27, 2024

Conversation

bil-ash
Copy link
Contributor

@bil-ash bil-ash commented Jun 26, 2024

Add quantized(q4f16) qwen2-0.5b to the list of supported models. PR must be merged before merging this.

to support quantized qwen2-0.5b
Copy link
Contributor

@CharlieFRuan CharlieFRuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution. Some minor changes, one on consistency on naming, and one on the required MB after calculation

src/config.ts Outdated
modelVersion +
"/Qwen2-0.5B-Instruct-q4f16_1-webgpu.wasm",
low_resource_required: true,
vram_required_MB: 500,//rough estimate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vram_required_MB: 500,//rough estimate
vram_required_MB: 944.62,

src/config.ts Outdated
@@ -601,6 +601,19 @@ export const prebuiltAppConfig: AppConfig = {
},
},
// Qwen-2
{
model: "https://huggingface.co/mlc-ai/Qwen2-0.5B-Instruct-q4f16_1-MLC",
model_id: "Qwen2-0.5B-Instruct-q4f16-MLC",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_id: "Qwen2-0.5B-Instruct-q4f16-MLC",
model_id: "Qwen2-0.5B-Instruct-q4f16_1-MLC",

@bil-ash
Copy link
Contributor Author

bil-ash commented Jun 27, 2024

By the way, what is the formula for calculating VRAM required?
Also, how does web-llm acquire VRAM- I mean does it take up the entire VRAM specified by vram_required_MB at initialization or does it take up limited amount of VRAM and then acquire more if required?

@CharlieFRuan
Copy link
Contributor

Thanks for making the changes!

For VRAM, it is mainly three parts: model size, intermediate buffer size (for various matrix multiplications, etc.), and KV cache size. The sum of the first two is estimated during python -m mlc_llm compile and shown in the output. For KV cache size, it is context_window_size * head_dim * num_kv_heads * num_layers * 2 (K and V) * 2 (num bytes for f16, 4 for f32).

The vram_required_MB is merely an estimation, so it does not play a role at all in runtime. I believe it takes up most (if not all) of the required VRAM at initialization. Therefore, prefill chunk size would play a role in limiting the amount of memory required for long prompts -- it would simply chunk them to keep the intermediate buffer size the same.

@CharlieFRuan CharlieFRuan merged commit 1da0f76 into mlc-ai:main Jun 27, 2024
@bil-ash
Copy link
Contributor Author

bil-ash commented Jun 27, 2024

Thanks for making the changes!

For VRAM, it is mainly three parts: model size, intermediate buffer size (for various matrix multiplications, etc.), and KV cache size. The sum of the first two is estimated during python -m mlc_llm compile and shown in the output. For KV cache size, it is context_window_size * head_dim * num_kv_heads * num_layers * 2 (K and V) * 2 (num bytes for f16, 4 for f32).

The vram_required_MB is merely an estimation, so it does not play a role at all in runtime. I believe it takes up most (if not all) of the required VRAM at initialization. Therefore, prefill chunk size would play a role in limiting the amount of memory required for long prompts -- it would simply chunk them to keep the intermediate buffer size the same.

So, VRAM=model+ intermediate buffer +KV cache.
So, for this case, please specify all the three components.
I am asking this because I would like to have Qwen2-0.5B but with 32k context. So, if we reduce the prefill chunk size to 1k, the calculation would be something like-
VRAM for 32k context= model+ (intermediate buffer)/2 + 8*KV cache
and may be there won't be much of a difference in VRAM

@CharlieFRuan
Copy link
Contributor

The head_dim etc. can be found in the mlc-chat-config.json, so it is ctx * 64 * 2 * 24 * 2 * 2 bytes. For 4K, it is 48MB, so the other two components is 944.62 - 48 = 896.62. For 32K, it would be 8 * 48MB. I'm personally not sure whether QWen2 0.5B would support such a long context.

@bil-ash
Copy link
Contributor Author

bil-ash commented Jun 27, 2024

Thanks for the info

@CharlieFRuan
Copy link
Contributor

I just ran compile again and got following info:

  • For 1024 chunk size: 896.62 MB (Parameters: 265.12 MB. Temporary buffer: 631.50 MB)
  • For 2048 chunk size: 1528.12 MB (Parameters: 265.12 MB. Temporary buffer: 1263.00 MB)

I guess I used the wrong estimation, it should be 1528.12 + 48 MB instead since the wasm you uploaded uses 2048 chunk size

jzhao62 pushed a commit to jzhao62/web-llm that referenced this pull request Dec 8, 2024
Add quantized(q4f16) qwen2-0.5b to the list of supported models.
[PR](mlc-ai/binary-mlc-llm-libs#128) must be
merged before merging this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants