🇨🇳Chinese | 🌐English | 📖Documentation | ❓Issues | 💬Discussions | ⚔️Arena
🤗 Hugging Face • 🤖 ModelScope • 🐿️ 机器之心SOTA!模型 • 🟣 wisemodel • 🤗 Online Demo
This project is developed based on Meta's newly released next-generation open-source large language model Llama-3 and is the third generation of the Chinese-LLaMA-Alpaca open-source LLM series (1st gen, 2nd gen). This project has open-sourced the Llama-3-Chinese base model and the Chinese Llama-3-Chinese-Instruct instruction-tuned large model. These models use large-scale Chinese data for continual pre-training on the original Llama-3, and are fine-tuned with selected instruction data to further enhance Chinese basic semantic and instruction understanding capabilities, significantly improving performance compared to the second-generation models.
- 🚀 Open-source Llama-3-Chinese base model and Llama-3-Chinese-Instruct instruction model (v1, v2, v3)
- 🚀 Released pre-training scripts and instruction fine-tuning scripts, allowing users to further train or fine-tune the model as needed
- 🚀 Released alpaca_zh_51k, stem_zh_instruction, ruozhiba_gpt4 (4o/4T) instruction data
- 🚀 Provides a tutorial for quickly quantizing and deploying large models locally using a personal computer's CPU/GPU
- 🚀 Supports 🤗transformers, llama.cpp, text-generation-webui, vLLM, Ollama and other Llama-3 ecosystems
Chinese Mixtral | Chinese LLaMA-2 & Alpaca-2 Large Models | Chinese LLaMA & Alpaca Large Models | Multimodal Chinese LLaMA & Alpaca Large Models | Multimodal VLE | Chinese MiniRBT | Chinese LERT | Chinese-English PERT | Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | Knowledge Distillation Tool TextBrewer | Model Pruning Tool TextPruner | Distillation and Pruning Integrated GRAIN
[2024/05/30] Release Llama-3-Chinese-8B-Instruct-v3, which has better performance on downstream tasks than v1/v2. For details, see: 📚Version 3.0 Release Log
[2024/05/08] Release Llama-3-Chinese-8B-Instruct-v2, which is directly tuned on Meta-Llama-3-8B-Instruct with 5M instructions. For details, see: 📚Version 2.0 Release Log
[2024/05/07] Add pre-training and SFT scripts. For details, see: 📚Version 1.1 Release Log
[2024/04/30] Released the Llama-3-Chinese-8B base model and Llama-3-Chinese-8B-Instruct instruction model. For details, see: 📚Version 1.0 Release Log
[2024/04/19] 🚀 Officially launched the Chinese-LLaMA-Alpaca-3 project
Section | Description |
---|---|
💁🏻♂️Model Introduction | Briefly introduces the technical features of the models related to this project |
⏬Model Download | Download addresses for the Chinese Llama-3 large models |
💻Inference and Deployment | Describes how to quantize the model and deploy it using a personal computer to experience the large model |
💯Model Performance | Introduces the effects of the model on some tasks |
📝Training and Fine-Tuning | Introduces how to train and fine-tune the Chinese Llama-3 large models |
❓Frequently Asked Questions | Replies to some common questions |
This project has launched the Chinese open-source large models Llama-3-Chinese and Llama-3-Chinese-Instruct based on Meta Llama-3. The main features are as follows:
- Llama-3 has significantly expanded its vocabulary from 32K to 128K and switched to a BPE vocabulary.
- Preliminary experiments have shown that the encoding efficiency of the Llama-3 vocabulary is comparable to our expanded vocabulary in Chinese LLaMA-2, with an efficiency of about 95% based on encoding efficiency tests on Wikipedia data.
- Based on our experience and experimental conclusions with Chinese Mixtral 1, we did not expand the vocabulary further.
- Llama-3 has increased the native context window length from 4K to 8K, allowing for further processing of longer context information.
- Users can also use methods like PI, NTK, and YaRN to extend the model's long context capabilities to support longer text processing.
- Llama-3 adopts the Grouped Query Attention (GQA) mechanism used in the large parameter version of Llama-2, which further enhances the model's efficiency.
- Llama-3-Instruct uses a new instruction template, which is not compatible with Llama-2-chat; it should be used strictly following the official instruction template. (See instruction template)
Here's a comparison of the models in this project and recommended usage scenarios. For chat interactions, please choose the Instruct version.
Comparison Item | Llama-3-Chinese-8B | Llama-3-Chinese-8B-Instruct |
---|---|---|
Model Type | Base Model | Instruction/Chat Model (similar to ChatGPT) |
Model Size | 8B | 8B |
Training Type | Causal-LM (CLM) | Instruction Fine-Tuning |
Training Method | LoRA + Full emb/lm-head | LoRA + Full emb/lm-head |
Initial Model | Meta-Llama-3-8B | v1: Llama-3-Chinese-8B v2: Meta-Llama-3-8B-Instruct v3: mix of inst/inst-v2/inst-meta |
Training Corpus | Unlabeled general corpus (approx. 120GB) | Labeled instruction data (approx. 5 million entries) |
Vocabulary Size | Original vocabulary (128,256) | Original vocabulary (128,256) |
Supported Context Length | 8K | 8K |
Input Template | Not required | Requires Llama-3-Instruct template |
Applicable Scenarios | Text continuation: Given a context, let the model generate the following text | Instruction understanding: Q&A, writing, chatting, interaction, etc. |
Here is a comparison between different versions of Instruct. Unless there is a clear preference, please prioritize using the Instruct-v3 version.
Comparison Item | Instruct-v1 | Instruct-v2 | Instruct-v3 |
---|---|---|---|
Release Date | 2024/4/30 | 2024/5/8 | 2024/5/30 |
Base Model | Original Meta-Llama-3-8B | Original Meta-Llama-3-8B-Instruct | (See Training Method) |
Training Method | First Stage: Pre-training with 120G Chinese Corpus Second Stage: Fine-tuning with 5 million instruction data |
Direct fine-tuning with 5 million instruction data | Model merging using inst-v1, inst-v2, and inst-meta, followed by fine-tuning with a small amount of instruction data |
Chinese Proficiency | 49.3 / 51.5 | 51.6 / 51.6 | 55.2 / 54.8 👍🏻 |
English Proficiency | 63.21 | 66.68 | 66.81 👍🏻 |
Long Text Capability | 29.6 | 46.4 👍🏻 | 40.5 |
LLM Arena Win Rate / Elo | 49.4% / 1430 | 66.1% / 1559 | 83.6% / 1627 👍🏻 |
Note
Chinese proficiency results are from C-Eval (valid); English proficiency results are from Open LLM Leaderboard (avg); long text capability results are from LongBench (avg). For detailed performance, please refer to the 💯 Model Performance section.
Model Name | Full Version | LoRA Version | GGUF Version |
---|---|---|---|
Llama-3-Chinese-8B-Instruct-v3 (chat model) |
[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel] |
N/A | [🤗Hugging Face] [🤖ModelScope] |
Llama-3-Chinese-8B-Instruct-v2 (chat model) |
[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel] |
[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel] |
[🤗Hugging Face] [🤖ModelScope] |
Llama-3-Chinese-8B-Instruct (chat model) |
[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel] |
[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel] |
[🤗Hugging Face] [🤖ModelScope] |
Llama-3-Chinese-8B (base model) |
[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel] |
[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel] |
[🤗Hugging Face] [🤖ModelScope] |
Model Type Description:
- Full Model: Can be used directly for training and inference, no other merging steps required.
- LoRA Model: Needs to be merged with the original base model to convert into a full version, merging steps: 💻 Model Merging Steps
- v1 base model: Meta-Llama-3-8B
- v2 base model: Meta-Llama-3-8B-Instruct
- GGUF Model: Quantization format released by llama.cpp, compatible with common large model inference tools like ollama, recommended for users who only need to perform inference deployment. The model name with
-im
suffix is generated with important matrix, which has generally better performance.
Note
If HF access is blocked, consider using mirror sites (like hf-mirror.com), please find the specific methods and solutions on your own.
The models in this project primarily support the following quantization, inference, and deployment methods. Please refer to the corresponding tutorials for detailed information.
Tool | Features | CPU | GPU | Quantization | GUI | API | vLLM | Tutorial |
---|---|---|---|---|---|---|---|---|
llama.cpp | Rich GGUF quantization options and efficient local inference | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [link] |
🤗transformers | Native transformers inference interface | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [link] |
Imitation OpenAI API Calls | Server demo with an interface similar to OpenAI API | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | [link] |
text-generation-webui | Front-end Web UI deployment method | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [link] |
LM Studio | Multi-platform chat software with interface | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [link] |
Ollama | Local large model inference | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | [link] |
To evaluate the effectiveness of the related models, this project conducted both generative performance evaluations and objective performance evaluations (NLU type), assessing the large models from different perspectives. Users are recommended to test on tasks of their interest and choose models suitable for those tasks.
- This project has launched an online model battle platform, modeled after the Fastchat Chatbot Arena, where users can browse and evaluate the quality of model responses. The battle platform provides metrics such as win rates and Elo scores, and allows viewing the win rates between different models. ⚔️ Model Arena: http://llm-arena.ymcui.com
- The examples directory provides output samples of Llama-3-Chinese-8B-Instruct and Chinese-Mixtral-Instruct, and compares scores using GPT-4-turbo, with Llama-3-Chinese-8B-Instruct averaging a score of 8.1 and Chinese-Mixtral-Instruct averaging 7.8. 📄 Output Sample Comparison: examples
- This project has joined the Machine Heart SOTA! Model platform, with online experiences to be implemented later: https://sota.jiqizhixin.com/project/chinese-llama-alpaca-3
C-Eval is a comprehensive Chinese fundamental model evaluation suite, with its validation and test sets comprising 1.3K and 12.3K multiple-choice questions respectively, covering 52 subjects. For C-Eval inference code, please refer to this project: 📖GitHub Wiki
Models | Valid (0-shot) | Valid (5-shot) | Test (0-shot) | Test (5-shot) |
---|---|---|---|---|
Llama-3-Chinese-8B-Instruct-v3 | 55.2 | 54.8 | 52.1 | 52.4 |
Llama-3-Chinese-8B-Instruct-v2 | 51.6 | 51.6 | 49.7 | 49.8 |
Llama-3-Chinese-8B-Instruct | 49.3 | 51.5 | 48.3 | 49.4 |
Llama-3-Chinese-8B | 47.0 | 50.5 | 46.1 | 49.0 |
Meta-Llama-3-8B-Instruct | 51.3 | 51.3 | 49.5 | 51.0 |
Meta-Llama-3-8B | 49.3 | 51.2 | 46.1 | 49.4 |
Chinese-Mixtral-Instruct (8x7B) | 51.7 | 55.0 | 50.0 | 51.5 |
Chinese-Mixtral (8x7B) | 45.8 | 54.2 | 43.1 | 49.1 |
Chinese-Alpaca-2-13B | 44.3 | 45.9 | 42.6 | 44.0 |
Chinese-LLaMA-2-13B | 40.6 | 42.7 | 38.0 | 41.6 |
CMMLU is another comprehensive Chinese evaluation dataset specifically designed to assess language models' knowledge and reasoning capabilities in a Chinese context, covering topics from basic subjects to advanced professional levels, with a total of 11.5K multiple-choice questions. For CMMLU inference code, please refer to this project: 📖GitHub Wiki
Models | Test (0-shot) | Test (5-shot) |
---|---|---|
Llama-3-Chinese-8B-Instruct-v3 | 54.4 | 54.8 |
Llama-3-Chinese-8B-Instruct-v2 | 51.8 | 52.4 |
Llama-3-Chinese-8B-Instruct | 49.7 | 51.5 |
Llama-3-Chinese-8B | 48.0 | 50.9 |
Meta-Llama-3-8B-Instruct | 53.0 | 53.5 |
Meta-Llama-3-8B | 47.8 | 50.8 |
Chinese-Mixtral-Instruct (8x7B) | 50.0 | 53.0 |
Chinese-Mixtral (8x7B) | 42.5 | 51.0 |
Chinese-Alpaca-2-13B | 43.2 | 45.5 |
Chinese-LLaMA-2-13B | 38.9 | 42.5 |
MMLU is an English evaluation dataset for assessing natural language understanding capabilities, one of the main datasets used today for evaluating large models' capabilities, with its validation and test sets comprising 1.5K and 14.1K multiple-choice questions respectively, covering 57 subjects. For MMLU inference code, please refer to this project: 📖GitHub Wiki
Models | Valid (0-shot) | Valid (5-shot) | Test (0-shot) | Test (5-shot) |
---|---|---|---|---|
Llama-3-Chinese-8B-Instruct-v3 | 64.7 | 65.0 | 64.8 | 65.9 |
Llama-3-Chinese-8B-Instruct-v2 | 62.1 | 63.9 | 62.6 | 63.7 |
Llama-3-Chinese-8B-Instruct | 60.1 | 61.3 | 59.8 | 61.8 |
Llama-3-Chinese-8B | 55.5 | 58.5 | 57.3 | 61.1 |
Meta-Llama-3-8B-Instruct | 63.4 | 64.8 | 65.1 | 66.4 |
Meta-Llama-3-8B | 58.6 | 62.5 | 60.5 | 65.0 |
Chinese-Mixtral-Instruct (8x7B) | 65.1 | 69.6 | 67.5 | 69.8 |
Chinese-Mixtral (8x7B) | 63.2 | 67.1 | 65.5 | 68.3 |
Chinese-Alpaca-2-13B | 49.6 | 53.2 | 50.9 | 53.5 |
Chinese-LLaMA-2-13B | 46.8 | 50.0 | 46.6 | 51.8 |
LongBench is a benchmark for evaluating large models' long-text understanding capabilities, composed of 6 categories and 20 different tasks. Most tasks have an average length between 5K-15K, totaling approximately 4.75K test data entries. Below are the evaluation results of this project's models on these Chinese tasks (including code tasks). For LongBench inference code, please refer to this project: 📖GitHub Wiki
Models | Single-doc QA | Multi-doc QA | Summarization | Few-Shot Learning | Code | Synthesis | Average |
---|---|---|---|---|---|---|---|
Llama-3-Chinese-8B-Instruct-v3 | 20.3 | 28.8 | 24.5 | 28.1 | 59.4 | 91.9 | 40.5 |
Llama-3-Chinese-8B-Instruct-v2 | 57.3 | 27.1 | 13.9 | 30.3 | 60.6 | 89.5 | 46.4 |
Llama-3-Chinese-8B-Instruct | 44.1 | 24.0 | 12.4 | 33.5 | 51.8 | 11.5 | 29.6 |
Llama-3-Chinese-8B | 16.4 | 19.3 | 4.3 | 28.7 | 14.3 | 4.6 | 14.6 |
Meta-Llama-3-8B-Instruct | 55.1 | 15.1 | 0.1 | 24.0 | 51.3 | 94.5 | 40.0 |
Meta-Llama-3-8B | 21.2 | 22.9 | 2.7 | 35.8 | 65.9 | 40.8 | 31.6 |
Chinese-Mixtral-Instruct (8x7B) | 50.3 | 34.2 | 16.4 | 42.0 | 56.1 | 89.5 | 48.1 |
Chinese-Mixtral (8x7B) | 32.0 | 23.7 | 0.4 | 42.5 | 27.4 | 14.0 | 23.3 |
Chinese-Alpaca-2-13B-16K | 47.9 | 26.7 | 13.0 | 22.3 | 46.6 | 21.5 | 29.7 |
Chinese-LLaMA-2-13B-16K | 36.7 | 17.7 | 3.1 | 29.8 | 13.8 | 3.0 | 17.3 |
Chinese-Alpaca-2-7B-64K | 44.7 | 28.1 | 14.4 | 39.0 | 44.6 | 5.0 | 29.3 |
Chinese-LLaMA-2-7B-64K | 27.2 | 16.4 | 6.5 | 33.0 | 7.8 | 5.0 | 16.0 |
Open LLM Leaderboard is an LLM benchmark (English) brought by HuggingFaceH4 team, including ARC, HellaSwag, MMLU, TruthfulQA, Winograde, GSM8K datasets. Below are the evaluation results of this project's models.
Models | ARC | HellaS | MMLU | TQA | WinoG | GSM8K | Average |
---|---|---|---|---|---|---|---|
Llama-3-Chinese-8B-Instruct-v3 | 63.40 | 80.51 | 67.90 | 53.57 | 76.24 | 59.21 | 66.81 |
Llama-3-Chinese-8B-Instruct-v2 | 62.63 | 79.72 | 66.48 | 53.93 | 76.72 | 60.58 | 66.68 |
Llama-3-Chinese-8B-Instruct | 61.26 | 80.24 | 63.10 | 55.15 | 75.06 | 44.43 | 63.21 |
Llama-3-Chinese-8B | 55.88 | 79.53 | 63.70 | 41.14 | 77.03 | 37.98 | 59.21 |
Meta-Llama-3-8B-Instruct | 60.75 | 78.55 | 67.07 | 51.65 | 74.51 | 68.69 | 66.87 |
Meta-Llama-3-8B | 59.47 | 82.09 | 66.69 | 43.90 | 77.35 | 45.79 | 62.55 |
Chinese-Mixtral-Instruct (8x7B) | 67.75 | 85.67 | 71.53 | 57.46 | 83.11 | 55.65 | 70.19 |
Chinese-Mixtral (8x7B) | 67.58 | 85.34 | 70.38 | 46.86 | 82.00 | 0.00 | 58.69 |
Note: MMLU resutls are different from the one that reported in our repo, as the evaluation scripts differ.
Under llama.cpp, the quantization performance of Llama-3-Chinese-8B (base model) was tested, as shown in the table below. The actual speed is slightly slower than the second-generation Llama-2-7B.
F16 | Q8_0 | Q6_K | Q5_K | Q5_0 | Q4_K | Q4_0 | Q3_K | Q2_K | |
---|---|---|---|---|---|---|---|---|---|
Size (GB) | 14.97 | 7.95 | 6.14 | 5.34 | 5.21 | 4.58 | 4.34 | 3.74 | 2.96 |
BPW | 16.00 | 8.50 | 6.56 | 5.70 | 5.57 | 4.89 | 4.64 | 4.00 | 3.16 |
PPL | 5.130 | 5.135 | 5.148 | 5.181 | 5.222 | 5.312 | 5.549 | 5.755 | 11.859 |
PP Speed | 5.99 | 6.10 | 7.17 | 7.34 | 6.65 | 6.38 | 6.00 | 6.85 | 6.43 |
TG Speed | 44.03 | 26.08 | 21.61 | 22.33 | 20.93 | 18.93 | 17.09 | 22.50 | 19.21 |
Note
- Model size: in GB
- BPW (Bits-Per-Weight): Per-parameter bit, for example, Q8_0 actual average precision is 8.50
- PPL (Perplexity): Measured with an 8K context (natively supported length), lower values are better
- PP/TG speed: Provides instruction processing (PP) and text generation (TG) speeds for the Apple M3 Max (Metal), in ms/token, lower values are faster
- Pre-training with unlabeled data: 📖Pre-training Scripts Wiki
- Fine-tuning with labeled data for instructions: 📖Instruction Fine-Tuning Scripts Wiki
Our Llama-3-Chinese-Instruct adopts original instruction template of Llama-3-Instruct. The following is a chat example.
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant. 你是一个乐于助人的助手。<|eot_id|><|start_header_id|>user<|end_header_id|>
你好<|eot_id|><|start_header_id|>assistant<|end_header_id|>
你好!有什么可以帮助你的吗?<|eot_id|>
Below are some of the command data made open source for this project. For more details, please see: 📚 Instruction Data
Data Name | Description | Quantity |
---|---|---|
alpaca_zh_51k | Alpaca data translated using gpt-3.5 | 51K |
stem_zh_instruction | STEM data scraped using gpt-3.5, including physics, chemistry, medicine, biology, earth sciences | 256K |
ruozhiba_gpt4 | ruozhiba Q&A data obtained using GPT-4o and GPT-4T | 2449 |
Please check the FAQ to see if a solution already exists before submitting an issue. For specific questions and answers, refer to the project's 📖GitHub Wiki
Question 1: Why is there no vocabulary expansion like in phases one and two?
Question 2: Will there be a 70B version released?
Question 3: Why is the instruction model no longer called Alpaca?
Question 4: Can the models from this repository be used commercially?
Question 5: Why not perform full pre-training instead of using LoRA?
Question 6: Why is the conversational performance of Llama-3-Chinese not good?
Question 7: Why does the instruction model reply saying it is ChatGPT?
Question 8: What are the differences between v1 and v2 of the Instruct model?
This project is developed based on Meta's Llama-3 model. Please strictly adhere to the Llama-3 open-source license agreement during use. If using third-party code, comply with the relevant open-source licenses. The accuracy of the model-generated content may be affected by computational methods, random factors, and loss of quantization precision, hence, no guarantees are provided regarding the accuracy of model outputs, nor will any liability be accepted for losses resulting from the use of related resources and outputs. If using the models for commercial purposes, developers must comply with local laws and regulations to ensure the legality of the model outputs. No responsibility will be taken for any products or services derived from this project.
If you have questions, please submit them in the GitHub Issues. Ask politely and help build a harmonious discussion community.
- Before submitting an issue, check if the FAQ addresses your question and consider reviewing past issues that might solve your problem.
- When submitting an issue, please use the project's issue template to help quickly identify specific problems.
- Duplicate or irrelevant issues will be handled by stable-bot, please understand.