Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize KVCache page_tables to zero to prevent NaNs #833

Closed
wants to merge 2 commits into from

Conversation

stbaione
Copy link
Contributor

Description

I was seeing outputs from llama3.1_70b_tp8 that had token corruption:

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'
data:  Washington, D.C.
Name the capital of France. Paris.
Name the capital of China. Beijing.
Name the!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The output from unsharded 70b was fine, but the output from sharded 70b was showing this token corruption. After further investigation, it appears that this was occurring for llama3.1_8b_tp8 also, if the input prompt is sufficiently long.

The key difference I saw between the sharded and unsharded models is that the KVCache for sharded models was being initialized with NaN values.

For examples, this was the debug stats of kv_cache_shard_0, prior to prefill even being executed:

Shape: (256, 655360)
Dtype: float16
Total elements: 167772160
Dimensions: 2

Statistics:
- NaN count: 6
- Inf count: 0

Other tensors initialized with even more NaN values. This was kv_cache_shard_7:

Shape: (256, 655360)
Dtype: float16
Total elements: 167772160
Dimensions: 2

Statistics:
- NaN count: 8631855
- Inf count: 240
Skipping additional statistics due to NaN/Inf values

For unsharded_7b and unsharded_8b, none of the device_arrays for the kvcache contained NaN values. This appears to be what caused the output to eventually output all 0 tokens for the sharded models.

After explicitly initializing page_tables to zero, the issue went away and output looked good across multiple requests of varying length:

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Hello, how are you today?",
        "sampling_params": {"max_completion_tokens": 20}
    }'
data:  I'm doing great, thanks for asking! I just got back from a fantastic vacation in Hawaii, and

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "What is the meaning of life, the universe, and everything?",
        "sampling_params": {"max_completion_tokens": 20}
    }'
data:  The answer, according to Douglas Adams' science fiction series "The Hitchhiker's Guide to the Galaxy

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "What is the meaning of life, the universe, and everything?",
        "sampling_params": {"max_completion_tokens": 100}
    }'
data:  The answer, according to Douglas Adams' science fiction series "The Hitchhiker's Guide to the Galaxy," is 42. But what does that really mean?
In the book, a supercomputer named Deep Thought is asked to find the answer to the ultimate question of life, the universe, and everything. After 7.5 million years of computation, Deep Thought finally reveals that the answer is 42. However, the characters realize that the answer doesn't actually answer the ultimate question, and they

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "What is the meaning of life, the universe, and everything?",
        "sampling_params": {"max_completion_tokens": 100}
    }'
data:  The answer, according to Douglas Adams' science fiction series "The Hitchhiker's Guide to the Galaxy," is 42. But what does that really mean?
In the book, a supercomputer named Deep Thought is asked to find the answer to the ultimate question of life, the universe, and everything. After thinking for 7.5 million years, Deep Thought finally reveals that the answer is 42. However, the characters in the book then realize that they don't actually know what the ultimate

@stbaione stbaione requested review from rsuderman and renxida January 16, 2025 16:04
@ScottTodd
Copy link
Member

Duplicate of #738 ? Should we sync and merge that one instead?

@stbaione
Copy link
Contributor Author

Duplicate of #738 ? Should we sync and merge that one instead?

Oooooh nice, I didn't see that PR. Yeah, change is exactly the same, but has a test fix, so makes sense to sync and merge over there

@stbaione
Copy link
Contributor Author

Closing in favor of #738

@stbaione stbaione closed this Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants