Merge branch 'master' into cb-by-default-int8-respect-ir

openvinotoolkit · Jan 22, 2025 · 8654d1f · 8654d1f
2 parents 6cf44c2 + aa552d1
commit 8654d1f
Show file tree

Hide file tree

Showing 27 changed files with 382 additions and 212 deletions.
diff --git a/samples/cpp/text_generation/README.md b/samples/cpp/text_generation/README.md
@@ -2,7 +2,7 @@
 
 These samples showcase the use of OpenVINO's inference capabilities for text generation tasks, including different decoding strategies such as beam search, multinomial sampling, and speculative decoding. Each sample has a specific focus and demonstrates a unique aspect of text generation.
 The applications don't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU.
-There are also Jupyter notebooks for some samples. You can find links to them in the appropriate sample descritions.
+There are also Jupyter notebooks for some samples. You can find links to them in the appropriate sample descriptions.
 
 ## Table of Contents
 1. [Download and Convert the Model and Tokenizers](#download-and-convert-the-model-and-tokenizers)
@@ -11,25 +11,50 @@ There are also Jupyter notebooks for some samples. You can find links to them in
 4. [Support and Contribution](#support-and-contribution)
 
 ## Download and convert the model and tokenizers
-
 The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
-
-It's not required to install [../../export-requirements.txt](../../export-requirements.txt) for deployment if the model has already been exported.
-
+Install [../../export-requirements.txt](../../export-requirements.txt) if model conversion is required.
 ```sh
-pip install --upgrade-strategy eager -r ../../requirements.txt
+pip install --upgrade-strategy eager -r ../../export-requirements.txt
 optimim-cli export openvino --model <model> <output_folder>
 ```
+If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli.
+```sh
+pip install --upgrade-strategy eager -r ../../export-requirements.txt
+huggingface-cli download <model> --local-dir <output_folder>
+```
 
 ## Sample Descriptions
 ### Common information
 Follow [Get Started with Samples](https://docs.openvino.ai/2024/learn-openvino/openvino-samples/get-started-demos.html) to get common information about OpenVINO samples.
+Follow [build instruction](https://github.com/openvinotoolkit/openvino.genai/blob/master/src/docs/BUILD.md) to build GenAI samples
+
+GPUs usually provide better performance compared to CPUs. Modify the source code to change the device for inference to the GPU.
 
-Discrete GPUs (dGPUs) usually provide better performance compared to CPUs. It is recommended to run larger models on a dGPU with 32GB+ RAM. For example, the model meta-llama/Llama-2-13b-chat-hf can benefit from being run on a dGPU. Modify the source code to change the device for inference to the GPU.
+See https://github.com/openvinotoolkit/openvino.genai/blob/master/SUPPORTED_MODELS.md for the list of supported models.
 
-See https://github.com/openvinotoolkit/openvino.genai/blob/master/src/README.md#supported-models for the list of supported models.
+Install [../../deployment-requirements.txt](../../deployment-requirements.txt) to run samples
+```sh
+pip install --upgrade-strategy eager -r ../../deployment-requirements.txt
+```
 
-### 1. Greedy Causal LM (`greedy_causal_lm`)
+### 1. Chat Sample (`chat_sample`)
+- **Description:**
+Interactive chat interface powered by OpenVINO.
+Here is a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) that provides an example of LLM-powered text generation in Python.
+Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat-v1.0, etc
+- **Main Feature:** Real-time chat-like text generation.
+- **Run Command:**
+  ```bash
+  ./chat_sample <MODEL_DIR>
+  ```
+#### Missing chat template
+If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model.
+The following template can be used as a default, but it may not work properly with every model:
+```
+"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
+```
+
+### 2. Greedy Causal LM (`greedy_causal_lm`)
 - **Description:**
 Basic text generation using a causal language model.
 Here is a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-question-answering) that provides an example of LLM-powered text generation in Python.
@@ -40,7 +65,7 @@ Recommended models: meta-llama/Llama-2-7b-hf, etc
   ./greedy_causal_lm <MODEL_DIR> "<PROMPT>"
   ```
 
-### 2. Beam Search Causal LM (`beam_search_causal_lm`)
+### 3. Beam Search Causal LM (`beam_search_causal_lm`)
 - **Description:**
 Uses beam search for more coherent text generation.
 Here is a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-question-answering) that provides an example of LLM-powered text generation in Python.
@@ -51,23 +76,6 @@ Recommended models: meta-llama/Llama-2-7b-hf, etc
   ./beam_search_causal_lm <MODEL_DIR> "<PROMPT 1>" ["<PROMPT 2>" ...]
   ```
 
-### 3. Chat Sample (`chat_sample`)
-- **Description:**
-Interactive chat interface powered by OpenVINO.
-Here is a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) that provides an example of LLM-powered text generation in Python.
-Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat-v1.0, etc
-- **Main Feature:** Real-time chat-like text generation.
-- **Run Command:**
-  ```bash
-  ./chat_sample <MODEL_DIR>
-  ```
-#### Missing chat template
-If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model.
-The following template can be used as a default, but it may not work properly with every model:
-```
-"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
-```
-
 ### 4. Multinomial Causal LM (`multinomial_causal_lm`)
 - **Description:** Text generation with multinomial sampling for diversity.
 Recommended models: meta-llama/Llama-2-7b-hf, etc
@@ -104,7 +112,16 @@ Recommended models: meta-llama/Llama-2-13b-hf as main model and TinyLlama/TinyLl
   ./speculative_decoding_lm <MODEL_DIR> <DRAFT_MODEL_DIR> "<PROMPT>"
   ```
 
-### 7. Encrypted Model Causal LM (`encrypted_model_causal_lm`)
+### 7. LoRA Greedy Causal LM (`lora_greedy_causal_lm`)
+- **Description:**
+This sample demonstrates greedy decoding using Low-Rank Adaptation (LoRA) fine-tuned causal language models. LoRA enables efficient fine-tuning, reducing resource requirements for adapting large models to specific tasks.
+- **Main Feature:** Lightweight fine-tuning with LoRA for efficient text generation
+- **Run Command:**
+  ```bash
+  ./lora_greedy_causal_lm <MODEL_DIR> <ADAPTER_SAFETENSORS_FILE> "<PROMPT>"
+  ```
+
+### 8. Encrypted Model Causal LM (`encrypted_model_causal_lm`)
 - **Description:** 
 LLMPipeline and Tokenizer objects can be initialized directly from the memory buffer, e.g. when user stores only encrypted files and decrypts them on-the-fly. 
 The following code snippet demonstrates how to load the model from the memory buffer:
@@ -120,7 +137,7 @@ For the sake of brevity the code above does not include Tokenizer decryption. Fo
   ./encrypted_model_causal_lm <MODEL_DIR> "<PROMPT>"
   ```
 
-### 8. LLMs benchmarking sample (`benchmark_genai`)
+### 9. LLMs benchmarking sample (`benchmark_genai`)
 - **Description:** 
 This sample script demonstrates how to benchmark an LLMs in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text, and calculating various performance metrics.
 

diff --git a/samples/export-requirements.txt b/samples/export-requirements.txt
@@ -7,7 +7,7 @@ numpy<2.0.0; sys_platform == 'darwin'
 einops==0.8.0  # For Qwen
 transformers_stream_generator==0.0.5  # For Qwen
 diffusers==0.32.2 # For image generation pipelines
-timm==1.0.13  # For exporting InternVL2
+timm==1.0.14  # For exporting InternVL2
 torchvision  # For visual language models
 transformers>=4.43 # For Whisper
 hf_transfer # for faster models download, should used with env var HF_HUB_ENABLE_HF_TRANSFER=1
diff --git a/samples/python/text_generation/README.md b/samples/python/text_generation/README.md
@@ -2,7 +2,7 @@
 
 These samples showcase the use of OpenVINO's inference capabilities for text generation tasks, including different decoding strategies such as beam search, multinomial sampling, and speculative decoding. Each sample has a specific focus and demonstrates a unique aspect of text generation.
 The applications don't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU.
-There are also Jupyter notebooks for some samples. You can find links to them in the appropriate sample descritions.
+There are also Jupyter notebooks for some samples. You can find links to them in the appropriate sample descriptions.
 
 ## Table of Contents
 1. [Download and Convert the Model and Tokenizers](#download-and-convert-the-model-and-tokenizers)
@@ -11,25 +11,50 @@ There are also Jupyter notebooks for some samples. You can find links to them in
 4. [Support and Contribution](#support-and-contribution)
 
 ## Download and convert the model and tokenizers
-
 The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
-
-It's not required to install [../../export-requirements.txt](../../export-requirements.txt) for deployment if the model has already been exported.
-
+Install [../../export-requirements.txt](../../export-requirements.txt) if model conversion is required.
 ```sh
-pip install --upgrade-strategy eager -r ../../requirements.txt
+pip install --upgrade-strategy eager -r ../../export-requirements.txt
 optimim-cli export openvino --model <model> <output_folder>
 ```
+If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli.
+```sh
+pip install --upgrade-strategy eager -r ../../export-requirements.txt
+huggingface-cli download <model> --local-dir <output_folder>
+```
 
 ## Sample Descriptions
 ### Common information
 Follow [Get Started with Samples](https://docs.openvino.ai/2024/learn-openvino/openvino-samples/get-started-demos.html) to get common information about OpenVINO samples.
+Follow [build instruction](https://github.com/openvinotoolkit/openvino.genai/blob/master/src/docs/BUILD.md) to build GenAI samples
+
+GPUs usually provide better performance compared to CPUs. Modify the source code to change the device for inference to the GPU.
 
-Discrete GPUs (dGPUs) usually provide better performance compared to CPUs. It is recommended to run larger models on a dGPU with 32GB+ RAM. For example, the model meta-llama/Llama-2-13b-chat-hf can benefit from being run on a dGPU. Modify the source code to change the device for inference to the GPU.
+See https://github.com/openvinotoolkit/openvino.genai/blob/master/SUPPORTED_MODELS.md for the list of supported models.
 
-See https://github.com/openvinotoolkit/openvino.genai/blob/master/src/README.md#supported-models for the list of supported models.
+Install [../../deployment-requirements.txt](../../deployment-requirements.txt) to run samples
+```sh
+pip install --upgrade-strategy eager -r ../../deployment-requirements.txt
+```
 
-### 1. Greedy Causal LM (`greedy_causal_lm`)
+### 1. Chat Sample (`chat_sample`)
+- **Description:**
+Interactive chat interface powered by OpenVINO.
+Here is a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) that provides an example of LLM-powered text generation in Python.
+Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat-v1.0, etc
+- **Main Feature:** Real-time chat-like text generation.
+- **Run Command:**
+  ```bash
+  python chat_sample.py model_dir
+  ```
+#### Missing chat template
+If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model.
+The following template can be used as a default, but it may not work properly with every model:
+```
+"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
+```
+
+### 2. Greedy Causal LM (`greedy_causal_lm`)
 - **Description:**
 Basic text generation using a causal language model.
 Here is a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-question-answering) that provides an example of LLM-powered text generation in Python.
@@ -40,7 +65,7 @@ Recommended models: meta-llama/Llama-2-7b-hf, etc
   python greedy_causal_lm.py [-h] model_dir prompt
   ```
 
-### 2. Beam Search Causal LM (`beam_search_causal_lm`)
+### 3. Beam Search Causal LM (`beam_search_causal_lm`)
 - **Description:**
 Uses beam search for more coherent text generation.
 Here is a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-question-answering) that provides an example of LLM-powered text generation in Python.
@@ -51,23 +76,6 @@ Recommended models: meta-llama/Llama-2-7b-hf, etc
   python beam_search_causal_lm.py model_dir prompt [prompts ...]
   ```
 
-### 3. Chat Sample (`chat_sample`)
-- **Description:**
-Interactive chat interface powered by OpenVINO.
-Here is a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) that provides an example of LLM-powered text generation in Python.
-Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat-v1.0, etc
-- **Main Feature:** Real-time chat-like text generation.
-- **Run Command:**
-  ```bash
-  python chat_sample.py model_dir
-  ```
-#### Missing chat template
-If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model.
-The following template can be used as a default, but it may not work properly with every model:
-```
-"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
-```
-
 ### 4. Multinomial Causal LM (`multinomial_causal_lm`)
 - **Description:** Text generation with multinomial sampling for diversity.
 Recommended models: meta-llama/Llama-2-7b-hf, etc
@@ -104,7 +112,16 @@ Recommended models: meta-llama/Llama-2-13b-hf as main model and TinyLlama/TinyLl
   python speculative_decoding_lm.py model_dir draft_model_dir prompt
   ```
 
-### 7. LLMs benchmarking sample (`benchmark_genai`)
+### 7. LoRA Greedy Causal LM (`lora_greedy_causal_lm`)
+- **Description:**
+This sample demonstrates greedy decoding using Low-Rank Adaptation (LoRA) fine-tuned causal language models. LoRA enables efficient fine-tuning, reducing resource requirements for adapting large models to specific tasks.
+- **Main Feature:** Lightweight fine-tuning with LoRA for efficient text generation
+- **Run Command:**
+  ```bash
+  ./lora_greedy_causal_lm <MODEL_DIR> <ADAPTER_SAFETENSORS_FILE> "<PROMPT>"
+  ```
+
+### 8. LLMs benchmarking sample (`benchmark_genai`)
 - **Description:** 
 This sample script demonstrates how to benchmark an LLMs in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text, and calculating various performance metrics.
 

diff --git a/src/cpp/src/continuous_batching_impl.cpp b/src/cpp/src/continuous_batching_impl.cpp
@@ -168,13 +168,6 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::step() {
         m_pipeline_metrics.max_cache_usage = std::max(m_pipeline_metrics.max_cache_usage, scheduler_output.m_cache_usage);
         _register_step_cache_usage(scheduler_output.m_cache_usage);
         m_pipeline_metrics.avg_cache_usage = _get_current_running_average_cache_usage();
-
-        m_batch_size = 0; // total number of running sequences
-        for (size_t i = 0; i < scheduler_output.m_scheduled_sequence_groups_ids.size(); ++i) {
-            size_t seq_group_id = scheduler_output.m_scheduled_sequence_groups_ids[i];
-            SequenceGroup::CPtr sequence_group = m_requests[seq_group_id];
-            m_batch_size += sequence_group->num_running_seqs();
-        }
     }
 
     // if no tokens were scheduled, we are out of memory => free all requests and return
@@ -222,6 +215,7 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::step() {
         static ManualTimer timer("sample");
         timer.start();
         sampler_output = m_sampler->sample(m_requests, logits, m_is_validation_mode_enabled);
+        m_batch_size = sampler_output.num_generated_tokens;
         timer.end();
     }
 

diff --git a/src/cpp/src/continuous_batching_impl.hpp b/src/cpp/src/continuous_batching_impl.hpp
@@ -33,7 +33,7 @@ class ContinuousBatchingPipeline::ContinuousBatchingImpl : public ContinuousBatc
 
     // for perf metrics
     float m_load_time_ms = 0.0f;
-    size_t m_batch_size = 0; // stored number of scheduled sequences on last step
+    size_t m_batch_size = 0; // stored number of processed tokens on last step
 
     // flag to enable validation mode for sampler
     bool m_is_validation_mode_enabled = false;