Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

更新后的rwkv6,loss会nan #19

Closed
JL-er opened this issue May 16, 2024 · 17 comments
Closed

更新后的rwkv6,loss会nan #19

JL-er opened this issue May 16, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@JL-er
Copy link

JL-er commented May 16, 2024

我现在用的是前几天的版本loss正常

@yzhangcs
Copy link
Member

Oh, looks that you may need to switch back to logsigmoid, -exp is not stable yet

@JL-er
Copy link
Author

JL-er commented May 16, 2024

image
这是可行的loss非常稳定,基本没有误差

@JL-er
Copy link
Author

JL-er commented May 16, 2024

image
应该是这次更新的问题

@yzhangcs
Copy link
Member

This update fixes potential nans during inference, I think it's not the issue.
Possibly cuz of potential inf grad of -exp, would check it, thank you

@JL-er
Copy link
Author

JL-er commented May 16, 2024

RWKV-PEFT 添加fla,目前是可用的。但是一旦更换新fla loss就会nan,如果后续fla有更新可以告诉我 ,我可以进行测试

@JL-er
Copy link
Author

JL-er commented May 16, 2024

image
不知道为什么fla的rwkv6,竟然没有cuda快,我之前测试gla的时候会快很多

@yzhangcs
Copy link
Member

Have you compared the kernel speed

@JL-er
Copy link
Author

JL-er commented May 16, 2024

我找时间测一下,对了还有个问题,我在做state tuning的时候,替换上fla算子会出现报错
image
应该是state没有保存梯度的原因,所以想问一下怎么解决?

@yzhangcs
Copy link
Member

You can enable gradient for h0 mannually

@yzhangcs
Copy link
Member

Taking h0 as learnable params would be ok? like h0 = nn.Parameter(key_dim, head_dim)

@JL-er
Copy link
Author

JL-er commented May 16, 2024

image
image
image
我在使用cuda算子时是可以正常运行的,但是fla不行,正常情况state在算子计算的梯度会自动保存

@JL-er
Copy link
Author

JL-er commented May 16, 2024

还有一点是,我这里冻结了其他所有权重只保留state的梯度

@yzhangcs
Copy link
Member

ic, currently there is no access to grad of states.
we will add an option later

@JL-er
Copy link
Author

JL-er commented May 16, 2024

thank you

@yzhangcs yzhangcs pinned this issue May 17, 2024
@yzhangcs yzhangcs unpinned this issue May 17, 2024
@sustcsonglin sustcsonglin added the bug Something isn't working label May 18, 2024
@yzhangcs
Copy link
Member

yzhangcs commented May 24, 2024

@JL-er Hi, check it out 1547448

Now we do not truncate grad of h states for RWKV6 for ease of state tuning
Do contact us if you met any bugs or any numerical stability issues :-D

yzhangcs added a commit that referenced this issue May 25, 2024
@JL-er
Copy link
Author

JL-er commented May 27, 2024

rwkv-peft上测试非常完美,已经不需要clip了。不过之前infctx训练6000ctx len时偶尔会nan(我会重新测试) 非常感谢您

@sustcsonglin
Copy link
Collaborator

FYI we've recently fixed a bug that causes NaN when log decay is very small. #77 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants