-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disabling cache has poor generation results #17
Comments
Hey I'll take a look at this today, thanks. Probably an issue with masking |
Still unsolved, but I believe this is an issue with cuda only. Cpu doesn't have the same behavior. Additionally, the generations always start to go bogus once the sequence length is 32, which seems relevant somehow. |
It's definitely just with an issue CUDA, and occurs when the sequence length is 32. So far I've been unable to reproduce with unit tests for specific ops in dfdx. It's possible the issue is with matmul error accumulation in f16. Without kv cache, the dot product between rows and columns will start to accrue more error as the sequence length gets larger. I tried changing all the matmuls to convert to f32 before computation, and it does seem like it improves the 32nd/33rd token, but still craps out after that. This is a partial fix though, because the tensors still lose precision when converting from f32 -> f16. |
@coreylowman pretty cool analysis. Very nicely done. From your analysis, I am unclear how an overflow with 1 token generation (from 32 to 33, with the dotproduct change). Are there other overflow or bit-related operations that could be involved? |
More information: I don't think this is necessarily related to kv cache. Notably when you disable the cache, but generate really long sequences, you also see this problem. Running Generates well for a while, but ends up repeating text eventually. My biggest suspect are the reductions in dfdx. They currently don't handle grid strided indexing (i.e. using gridDim.x) in the kernels, which maybe once arrays get big enough, they fail. |
The weird part, is this test passes, so maybe it isn't softmax/reductions: #[test]
fn test_large_softmax() {
let dev: TestDevice = Default::default();
let t: Tensor<Rank3<64, 64, 64>, TestDtype, _> = dev.sample_normal();
let r = t.leaky_trace().softmax::<Axis<2>>();
let summed = r.sum::<_, Axis<2>>();
assert_close_to_literal!(summed, [[1.0; 64]; 64]);
} Where the generations start to get weird without cache is roughly |
There is a long history of problems with bf16 and f16 interaction in python frameworks. I am not an expert in the If so, that could be a problem. An excellent study of the float16 comparing bfloat16. |
with cache, while the answers are nonsense, at least they are coherent :)
The text was updated successfully, but these errors were encountered: