Skip to content

Optimizations

Justin Bush edited this page Apr 17, 2024 · 1 revision

Idle RAM management

Proposed in refactor: v1.9.0 + scheduler, idle RAM management, Observation rewrite (#88)

The current setup for all stable diffusion clients is as such: load the current model into RAM, use said model to generate existing prompt, leave model (and weights, prompt-dependencies, etc.) in memory until either: it is overridden by another model, or the process is shut down. This is beneficial for lower-end hardware, as it removes the need to reload model on each new prompt generation, saving anywhere from 30-90s in between prompts.

However, for higher-end hardware (especially the M3 Pro/Max), the time it takes to load an SDXL model into RAM usually takes a maximum of 2-3s. As a result, these clients will reserve 30-50GB of active memory for as long as the process is running—all to save this particular user a second or two of time (2-3s in worst case scenarios). Furthermore, you can restart the Python process, and load the previous model into RAM, which will only result in ~5GB of idle memory usage and add a measly 1-2s of time onto each generation queue.

As such, I propose two separate strategies that I plan to implement (as options) within SwiftDiffusion:

Setup Idle RAM usage (SDXL) Added time (per queue)
default (current) 30-50GB+ 0s
restartWithLoad 5-6GB 1-2s
startOnQueue 1-2GB 2-3s

After a generation queue has finished successfully:

  • restartWithLoad: end Py process, start Py process, load last model into RAM
  • startOnQueue: end Py process, start Py process with no loaded model

On new generation queue:

  • restartWithLoad: make generation request
  • startOnQueue: load model into RAM, make generation request
Clone this wiki locally