Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPS loss under bot #8

Open
sfxworks opened this issue Apr 9, 2023 · 5 comments
Open

TPS loss under bot #8

sfxworks opened this issue Apr 9, 2023 · 5 comments

Comments

@sfxworks
Copy link

sfxworks commented Apr 9, 2023

Under python3 server.py --wbits 4 --model ozcur_alpaca-native-4bit --verbose --listen --gpu-memory 5 --groupsize 128 via the UI, I get about 20 tokens per second

Output generated in 9.68 seconds (20.56 tokens/s, 199 tokens, context 42)

Under the bot with the same flags, I get only 2
python3 bot.py --wbits 4 --model ozcur_alpaca-native-4bit --verbose --listen --gpu-memory 5 --groupsize 128
Output generated in 10.30 seconds (2.04 tokens/s, 21 tokens, context 170)

The context seems to be the issue? As adding more to the context decreased it to 17.

Output generated in 7.73 seconds (17.20 tokens/s, 133 tokens, context 70)

Can the input for this bot be optimized?

@xNul
Copy link
Owner

xNul commented Apr 9, 2023

Thanks, I've been able to reproduce on my end.

I made everything deterministic to see if there was a parameter I was missing, but with the same parameters, context, and sequence of inputs, I was able to produce the exact same response on both webui and the bot, the only difference being webui tokens were generated at 5.98 tokens/s and the bot tokens were generated at 2.30 tokens/s. Let me see where this thread goes.

@xNul
Copy link
Owner

xNul commented Apr 9, 2023

I removed all the async code, discord bot stuff, any other unnecessary code to run the prompt, and called the API directly. Now I'm getting 6.15 tokens/s. I guess it's going to have something to do with async.

@xNul
Copy link
Owner

xNul commented Apr 10, 2023

I found the issue. This line of code for streaming the text generation to Discord is blocking the token generation, slowing it down. Since Python async is concurrent and uses only one thread, throwing Message.edit calls to async won't work. The only option is to run Message.edit in a separate process which means doing some work with multiprocessing or a message queue. I'm looking into the different options.

@xNul
Copy link
Owner

xNul commented Apr 10, 2023

Since the Client object in discord.py can't be serialized, it can't be moved to another process and used to make edits. This means that in order to keep performance and response streaming, I'll need to move all LLM logic into another process. From that process, I can then send messages via IPC back to the process for Discord and make those message edits for streaming.

Oh boy, I didn't realize this was going to be such a headache. I'm working on something else atm so I'm going to put this on the back-burner for a week or two. If you prefer to have performance over streaming, just remove that line I mentioned and you'll get the same performance as with webui.

@sfxworks
Copy link
Author

I appreciate your diligence in looking into this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants