disabling cache has poor generation results #17

sdake · 2023-05-11T10:17:39Z

ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf --disable-cache generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
Thread: Why is pi round?
I've been wondering about this for a while and couldn't find an answer... Ifс theters, i.s_Q
 ini£тuc1-cksont< Sec>ar$le to--.e
d in>
  inient-< ${<s A  А ${ various
 channel Banels cBp  Sack Bchn c channel Kaz
cyclemasens.chD channelーAя
O я_  CлusesN- n= Ps FigénBTアbollageest
ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf --disable-cache generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
What is the real reason that pi is round?
I know the story that when Archimedes proved that pi was irr butures, he,he and,h is//**  cz.daly, July wasz cQ.l inkxz toell>((>/.F
 Middle
 WCF,pp m  MA cError apadd Ledethodaten
 inien MAFaceerfaces.IкяEDєeP UITableView a MAtingack tcrit<0xE4><0xE7>leftAуad<0xEB>C areз о דneanate ab

with cache, while the answers are nonsense, at least they are coherent :)

ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
What is the definition of a number that is not prime?
The number that is not a prime is called a composite number and if it is not a factor of 1
What is the smallest number that can be divided by 3 numbers and still have the original number as a remainder?
The smallest number that can be divided by 3 numbers and still have the original number as a remainder is 17. To prove this we can use the fact that the original number must have a remainder of 1 (after being divided by 3). The numbers that have a remainder of 1 when divided by 3 are
ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
Thread: Why is pi round?
I've been wondering about this for a while and couldn't find an answer...I'm sure it's a silly question, but I just can't figure it out. Why is pi round? If it was, say 4.00 or 6.00, that would be one thing, but 3.14??
So I thought that maybe if you took the square root of 3.14, it would be ~ 1.5, which would be about the middle of 1 and 2, which is 1

The text was updated successfully, but these errors were encountered:

coreylowman · 2023-05-12T13:20:55Z

Hey I'll take a look at this today, thanks. Probably an issue with masking

coreylowman · 2023-05-12T15:01:38Z

Still unsolved, but I believe this is an issue with cuda only. Cpu doesn't have the same behavior. Additionally, the generations always start to go bogus once the sequence length is 32, which seems relevant somehow.

coreylowman · 2023-05-15T13:51:27Z

It's definitely just with an issue CUDA, and occurs when the sequence length is 32. So far I've been unable to reproduce with unit tests for specific ops in dfdx.

It's possible the issue is with matmul error accumulation in f16. Without kv cache, the dot product between rows and columns will start to accrue more error as the sequence length gets larger. I tried changing all the matmuls to convert to f32 before computation, and it does seem like it improves the 32nd/33rd token, but still craps out after that. This is a partial fix though, because the tensors still lose precision when converting from f32 -> f16.

sdake · 2023-05-15T20:53:55Z

@coreylowman pretty cool analysis. Very nicely done. From your analysis, I am unclear how an overflow with 1 token generation (from 32 to 33, with the dotproduct change). Are there other overflow or bit-related operations that could be involved?

coreylowman · 2023-05-19T15:26:12Z

More information: I don't think this is necessarily related to kv cache. Notably when you disable the cache, but generate really long sequences, you also see this problem.

Running cargo run -r -F cuda -- --bench -n 2048 generate "Here is the meaning of life described in the most verbose way possible"

Generates well for a while, but ends up repeating text eventually.

My biggest suspect are the reductions in dfdx. They currently don't handle grid strided indexing (i.e. using gridDim.x) in the kernels, which maybe once arrays get big enough, they fail.

coreylowman · 2023-05-19T15:46:13Z

The weird part, is this test passes, so maybe it isn't softmax/reductions:

    #[test]
    fn test_large_softmax() {
        let dev: TestDevice = Default::default();
        let t: Tensor<Rank3<64, 64, 64>, TestDtype, _> = dev.sample_normal();
        let r = t.leaky_trace().softmax::<Axis<2>>();
        let summed = r.sum::<_, Axis<2>>();
        assert_close_to_literal!(summed, [[1.0; 64]; 64]);
    }

Where the generations start to get weird without cache is roughly Rank4<1, 32, 32, 32>, so I'd expect this test to fail if it was softmax

sdake · 2023-05-22T02:40:23Z

@coreylowman,

There is a long history of problems with bf16 and f16 interaction in python frameworks. I am not an expert in the dfdx implementation. Is there any typecasting occurring between float16 and bfloat16?

If so, that could be a problem. An excellent study of the float16 comparing bfloat16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disabling cache has poor generation results #17

disabling cache has poor generation results #17

sdake commented May 11, 2023

coreylowman commented May 12, 2023

coreylowman commented May 12, 2023

coreylowman commented May 15, 2023

sdake commented May 15, 2023

coreylowman commented May 19, 2023

coreylowman commented May 19, 2023

sdake commented May 22, 2023 •

edited

Loading

disabling cache has poor generation results #17

disabling cache has poor generation results #17

Comments

sdake commented May 11, 2023

coreylowman commented May 12, 2023

coreylowman commented May 12, 2023

coreylowman commented May 15, 2023

sdake commented May 15, 2023

coreylowman commented May 19, 2023

coreylowman commented May 19, 2023

sdake commented May 22, 2023 • edited Loading

sdake commented May 22, 2023 •

edited

Loading