Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disabling cache has poor generation results #17

Open
sdake opened this issue May 11, 2023 · 7 comments
Open

disabling cache has poor generation results #17

sdake opened this issue May 11, 2023 · 7 comments

Comments

@sdake
Copy link

sdake commented May 11, 2023

ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf --disable-cache generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
Thread: Why is pi round?
I've been wondering about this for a while and couldn't find an answer... Ifс theters, i.s_Q
 ini£тuc1-cksont< Sec>ar$le to--.e
d in>
  inient-< ${<s A  А ${ various
 channel Banels cBp  Sack Bchn c channel Kaz
cyclemasens.chD channelーAя
O я_  CлusesN- n= Ps FigénBTアbollageest
ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf --disable-cache generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
What is the real reason that pi is round?
I know the story that when Archimedes proved that pi was irr butures, he,he and,h is//**  cz.daly, July wasz cQ.l inkxz toell>((>/.F
 Middle
 WCF,pp m  MA cError apadd Ledethodaten
 inien MAFaceerfaces.IкяEDєeP UITableView a MAtingack tcrit<0xE4><0xE7>leftAуad<0xEB>C areз о דneanate ab

with cache, while the answers are nonsense, at least they are coherent :)

ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
What is the definition of a number that is not prime?
The number that is not a prime is called a composite number and if it is not a factor of 1
What is the smallest number that can be divided by 3 numbers and still have the original number as a remainder?
The smallest number that can be divided by 3 numbers and still have the original number as a remainder is 17. To prove this we can use the fact that the original number must have a remainder of 1 (after being divided by 3). The numbers that have a remainder of 1 when divided by 3 are
ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
Thread: Why is pi round?
I've been wondering about this for a while and couldn't find an answer...I'm sure it's a silly question, but I just can't figure it out. Why is pi round? If it was, say 4.00 or 6.00, that would be one thing, but 3.14??
So I thought that maybe if you took the square root of 3.14, it would be ~ 1.5, which would be about the middle of 1 and 2, which is 1
@coreylowman
Copy link
Owner

Hey I'll take a look at this today, thanks. Probably an issue with masking

@coreylowman
Copy link
Owner

Still unsolved, but I believe this is an issue with cuda only. Cpu doesn't have the same behavior. Additionally, the generations always start to go bogus once the sequence length is 32, which seems relevant somehow.

@coreylowman
Copy link
Owner

It's definitely just with an issue CUDA, and occurs when the sequence length is 32. So far I've been unable to reproduce with unit tests for specific ops in dfdx.

It's possible the issue is with matmul error accumulation in f16. Without kv cache, the dot product between rows and columns will start to accrue more error as the sequence length gets larger. I tried changing all the matmuls to convert to f32 before computation, and it does seem like it improves the 32nd/33rd token, but still craps out after that. This is a partial fix though, because the tensors still lose precision when converting from f32 -> f16.

@sdake
Copy link
Author

sdake commented May 15, 2023

@coreylowman pretty cool analysis. Very nicely done. From your analysis, I am unclear how an overflow with 1 token generation (from 32 to 33, with the dotproduct change). Are there other overflow or bit-related operations that could be involved?

@coreylowman
Copy link
Owner

More information: I don't think this is necessarily related to kv cache. Notably when you disable the cache, but generate really long sequences, you also see this problem.

Running cargo run -r -F cuda -- --bench -n 2048 generate "Here is the meaning of life described in the most verbose way possible"

Generates well for a while, but ends up repeating text eventually.

My biggest suspect are the reductions in dfdx. They currently don't handle grid strided indexing (i.e. using gridDim.x) in the kernels, which maybe once arrays get big enough, they fail.

@coreylowman
Copy link
Owner

The weird part, is this test passes, so maybe it isn't softmax/reductions:

    #[test]
    fn test_large_softmax() {
        let dev: TestDevice = Default::default();
        let t: Tensor<Rank3<64, 64, 64>, TestDtype, _> = dev.sample_normal();
        let r = t.leaky_trace().softmax::<Axis<2>>();
        let summed = r.sum::<_, Axis<2>>();
        assert_close_to_literal!(summed, [[1.0; 64]; 64]);
    }

Where the generations start to get weird without cache is roughly Rank4<1, 32, 32, 32>, so I'd expect this test to fail if it was softmax

@sdake
Copy link
Author

sdake commented May 22, 2023

@coreylowman,

There is a long history of problems with bf16 and f16 interaction in python frameworks. I am not an expert in the dfdx implementation. Is there any typecasting occurring between float16 and bfloat16?

If so, that could be a problem. An excellent study of the float16 comparing bfloat16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants