Initialize KVCache `page_tables` to zero to prevent NaNs #833

stbaione · 2025-01-16T16:04:19Z

Description

I was seeing outputs from llama3.1_70b_tp8 that had token corruption:

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'
data:  Washington, D.C.
Name the capital of France. Paris.
Name the capital of China. Beijing.
Name the!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The output from unsharded 70b was fine, but the output from sharded 70b was showing this token corruption. After further investigation, it appears that this was occurring for llama3.1_8b_tp8 also, if the input prompt is sufficiently long.

The key difference I saw between the sharded and unsharded models is that the KVCache for sharded models was being initialized with NaN values.

For examples, this was the debug stats of kv_cache_shard_0, prior to prefill even being executed:

Shape: (256, 655360)
Dtype: float16
Total elements: 167772160
Dimensions: 2

Statistics:
- NaN count: 6
- Inf count: 0

Other tensors initialized with even more NaN values. This was kv_cache_shard_7:

Shape: (256, 655360)
Dtype: float16
Total elements: 167772160
Dimensions: 2

Statistics:
- NaN count: 8631855
- Inf count: 240
Skipping additional statistics due to NaN/Inf values

For unsharded_7b and unsharded_8b, none of the device_arrays for the kvcache contained NaN values. This appears to be what caused the output to eventually output all 0 tokens for the sharded models.

After explicitly initializing page_tables to zero, the issue went away and output looked good across multiple requests of varying length:

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Hello, how are you today?",
        "sampling_params": {"max_completion_tokens": 20}
    }'
data:  I'm doing great, thanks for asking! I just got back from a fantastic vacation in Hawaii, and

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "What is the meaning of life, the universe, and everything?",
        "sampling_params": {"max_completion_tokens": 20}
    }'
data:  The answer, according to Douglas Adams' science fiction series "The Hitchhiker's Guide to the Galaxy

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "What is the meaning of life, the universe, and everything?",
        "sampling_params": {"max_completion_tokens": 100}
    }'
data:  The answer, according to Douglas Adams' science fiction series "The Hitchhiker's Guide to the Galaxy," is 42. But what does that really mean?
In the book, a supercomputer named Deep Thought is asked to find the answer to the ultimate question of life, the universe, and everything. After 7.5 million years of computation, Deep Thought finally reveals that the answer is 42. However, the characters realize that the answer doesn't actually answer the ultimate question, and they

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "What is the meaning of life, the universe, and everything?",
        "sampling_params": {"max_completion_tokens": 100}
    }'
data:  The answer, according to Douglas Adams' science fiction series "The Hitchhiker's Guide to the Galaxy," is 42. But what does that really mean?
In the book, a supercomputer named Deep Thought is asked to find the answer to the ultimate question of life, the universe, and everything. After thinking for 7.5 million years, Deep Thought finally reveals that the answer is 42. However, the characters in the book then realize that they don't actually know what the ultimate

…uced

ScottTodd · 2025-01-16T16:09:02Z

Duplicate of #738 ? Should we sync and merge that one instead?

stbaione · 2025-01-16T16:41:27Z

Duplicate of #738 ? Should we sync and merge that one instead?

Oooooh nice, I didn't see that PR. Yeah, change is exactly the same, but has a test fix, so makes sense to sync and merge over there

stbaione · 2025-01-16T16:53:41Z

Closing in favor of #738

Initialize KVCache page_tables to zero to prevent NaNs being introd…

f0fb408

…uced

stbaione requested review from rsuderman and renxida January 16, 2025 16:04

Merge branch 'main' into sharded-kvcache-fix

56261dc

stbaione closed this Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize KVCache `page_tables` to zero to prevent NaNs #833

Initialize KVCache `page_tables` to zero to prevent NaNs #833

stbaione commented Jan 16, 2025

ScottTodd commented Jan 16, 2025

stbaione commented Jan 16, 2025

stbaione commented Jan 16, 2025

Initialize KVCache page_tables to zero to prevent NaNs #833

Initialize KVCache page_tables to zero to prevent NaNs #833

Conversation

stbaione commented Jan 16, 2025

Description

ScottTodd commented Jan 16, 2025

stbaione commented Jan 16, 2025

stbaione commented Jan 16, 2025

Initialize KVCache `page_tables` to zero to prevent NaNs #833

Initialize KVCache `page_tables` to zero to prevent NaNs #833