Skip to content

v0.2.0

Latest
Compare
Choose a tag to compare
@netaddi netaddi released this 24 Jun 06:31
· 987 commits to main since this release

We are release the new 0.2.0 version of rtp-llm, featuring some major updates:

  • rpc mode of scheduler
  • device backend implementation of models
  • more quantization methods

rpc mode

Rpc mode refactored inference scheduler with c++, eliminating the performance bottleneck of query batching.

To use rpc mode, start with env USE_RPC_MODEL=1.

device backend with fully managed gpu memory

The newly introduced device implementation preallocates all gpu memory and optimized gpu memory usage.

To use device backend, you must enable rpc mode, then start with env USE_NEW_DEVICE_IMPL=1 to enable.

Set DEVICE_RESERVE_MEMORY_BYTES to change the bytes of gpu memory reserved for rtp-llm. A negative value means reserving all available memory but leave these bytes free. Default is -134217728 (preallocate all gpu memories but leave 128MB free).

Set HOST_RESERVE_MEMORY_BYTES is similar but reserves host memory. This improves framework performance, default is 2GB.

quantization

Smoothquant and omniquant are supported on llama and qwen models.

Using smoothquant requires smoothquant.ini under checkpoint dir.

Using omniquant, GPTQ or AWQ requires adding quant fields in config:

"quantization_config": {
    "bits": 8,
    "quant_method": "omni_quant"
}

Now all quantization methods support start from SM70.

other improvements

  • GLM4, GLM4V, llava-next, Qwen2 supported
  • optimized performance on nvidia A100