-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPS loss under bot #8
Comments
Thanks, I've been able to reproduce on my end. I made everything deterministic to see if there was a parameter I was missing, but with the same parameters, context, and sequence of inputs, I was able to produce the exact same response on both webui and the bot, the only difference being webui tokens were generated at 5.98 tokens/s and the bot tokens were generated at 2.30 tokens/s. Let me see where this thread goes. |
I removed all the async code, discord bot stuff, any other unnecessary code to run the prompt, and called the API directly. Now I'm getting 6.15 tokens/s. I guess it's going to have something to do with async. |
I found the issue. This line of code for streaming the text generation to Discord is blocking the token generation, slowing it down. Since Python async is concurrent and uses only one thread, throwing |
Since the Oh boy, I didn't realize this was going to be such a headache. I'm working on something else atm so I'm going to put this on the back-burner for a week or two. If you prefer to have performance over streaming, just remove that line I mentioned and you'll get the same performance as with webui. |
I appreciate your diligence in looking into this issue! |
Under
python3 server.py --wbits 4 --model ozcur_alpaca-native-4bit --verbose --listen --gpu-memory 5 --groupsize 128
via the UI, I get about 20 tokens per secondOutput generated in 9.68 seconds (20.56 tokens/s, 199 tokens, context 42)
Under the bot with the same flags, I get only 2
python3 bot.py --wbits 4 --model ozcur_alpaca-native-4bit --verbose --listen --gpu-memory 5 --groupsize 128
Output generated in 10.30 seconds (2.04 tokens/s, 21 tokens, context 170)
The context seems to be the issue? As adding more to the context decreased it to 17.
Output generated in 7.73 seconds (17.20 tokens/s, 133 tokens, context 70)
Can the input for this bot be optimized?
The text was updated successfully, but these errors were encountered: