add output examples (#19)

Co-authored-by: ymcui <16095339+ymcui@users.noreply.github.com>
ymcui · May 2, 2024 · 8c8a80f · 8c8a80f
1 parent 6834c31
commit 8c8a80f
Show file tree

Hide file tree

Showing 5 changed files with 88 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -126,6 +126,7 @@
 ### 生成效果评测
 
 - 本项目仿照[Fastchat Chatbot Arena](https://chat.lmsys.org/?arena)推出了模型在线对战平台，可浏览和评测模型回复质量。对战平台提供了胜率、Elo评分等评测指标，并且可以查看两两模型的对战胜率等结果。**⚔️ 模型竞技场：[http://llm-arena.ymcui.com](http://llm-arena.ymcui.com/)**
+- examples目录中提供了Llama-3-Chinese-8B-Instruct和Chinese-Mixtral-Instruct的输出样例，并通过GPT-4-turbo进行了打分对比，**Llama-3-Chinese-8B-Instruct平均得分为8.1、Chinese-Mixtral-Instruct平均得分为7.8**。**📄 输出样例对比：[examples](https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/blob/main/examples)**
 - 本项目已入驻机器之心SOTA!模型平台，后期将实现在线体验：https://sota.jiqizhixin.com/project/chinese-llama-alpaca-3
 
 

diff --git a/README_EN.md b/README_EN.md
@@ -126,6 +126,7 @@ To evaluate the effectiveness of the related models, this project conducted both
 ### Generative Performance Evaluation
 
 - This project has launched an online model battle platform, modeled after the [Fastchat Chatbot Arena](https://chat.lmsys.org/?arena), where users can browse and evaluate the quality of model responses. The battle platform provides metrics such as win rates and Elo scores, and allows viewing the win rates between different models. **⚔️ Model Arena: [http://llm-arena.ymcui.com](http://llm-arena.ymcui.com/)**
+- The examples directory provides output samples of Llama-3-Chinese-8B-Instruct and Chinese-Mixtral-Instruct, and compares scores using GPT-4-turbo, **with Llama-3-Chinese-8B-Instruct averaging a score of 8.1 and  Chinese-Mixtral-Instruct averaging 7.8**. **📄 Output Sample Comparison: [examples](https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/blob/main/examples)**
 - This project has joined the Machine Heart SOTA! Model platform, with online experiences to be implemented later: https://sota.jiqizhixin.com/project/chinese-llama-alpaca-3
 
 

diff --git a/examples/README.md b/examples/README.md
@@ -0,0 +1,7 @@
+### llama3-vs-mixtral.md
+
+Llama-3-Chinese-8B-Instruct和Chinese-Mixtral-Instruct的输出对比
+
+### gpt-4-turbo-rating.md
+
+GPT-4-turbo对Llama-3-Chinese-8B-Instruct和Chinese-Mixtral-Instruct的输出打分解析
diff --git a/examples/gpt-4-turbo-rating.md b/examples/gpt-4-turbo-rating.md
@@ -0,0 +1,47 @@
+# GPT-4-turbo 评分细节
+
+注：system1为Llama-3-Chinese-8B-Instruct (Q8_0)，system2为Chinese-Mixtral-Instruct（Q6_k）
+
+### 评分输出
+
+**No.1 Prompt:**
+
+- **System 1:** 8/10. The response is concise and logically sound, directly addressing the premise by using the categorical statement about cats and water to deduce that Xiao Ming's pet might not be a cat. The inclusion of alternative pet options like dogs or rabbits adds a thoughtful touch. However, the response could be enhanced by a more explicit connection between the general rule and the specific case of Xiao Ming's pet, which is implied but not deeply explored.
+- **System 2:** This system also correctly addresses the logic of the prompt, affirming that Xiao Ming's pet might not be a cat based on the given information. The explanation is clear and adheres closely to the premise. However, it slightly lacks depth compared to System 1, as it doesn't expand beyond the immediate logical conclusion to explore or suggest alternatives, making it slightly less informative.
+
+**No.2 Prompt:**
+- **System 1:** 7/10. This response offers a general but somewhat inaccurate explanation of Fermat's Last Theorem, incorrectly stating it remained unsolved until modern efforts. It rightly notes the theorem's complexity and the areas of mathematics involved but confuses some details about the resolution methods and historical context. The inclusion of different mathematical techniques is a good attempt to explain the proof's complexity but lacks precision in explaining why these specific areas are challenging.
+- **System 2:** 9/10. This explanation is more accurate, clearly stating that the theorem was proven by Andrew Wiles in 1994. It succinctly describes the theorem's statement and effectively communicates why it was hard to prove, linking the difficulty to high-dimensional geometry and the need for advanced mathematical tools. This response is well-rounded, providing historical context, an accurate description of the theorem, and a concise explanation of the complexities involved in its proof.
+
+**No.3 Prompt:**
+- **System 1:** 9/10. This response is straightforward and mathematically accurate, efficiently using the formula from Newton's second law to calculate the acceleration. The step-by-step calculation is clear, making it easy for readers to understand the derivation of the acceleration from the given force and mass.
+- **System 2:** 9/10. Like System 1, this response is correct and provides a clear, step-by-step calculation of the acceleration based on the given values. The approach is almost identical to System 1, effectively explaining the application of Newton's second law. Both systems demonstrate an excellent understanding of the physics concept and explain it in an accessible manner.
+
+**No.4 Prompt:**
+- **System 1:** 8/10. This system provides a very detailed process for balancing the chemical equation, breaking down each step in a way that is easy to follow. It clearly explains the need for comparing the atomic numbers on each side of the equation and adjusts the molecule ratios accordingly. It makes a small error in describing the balancing process (such as adding water molecules and then removing them, which is a bit confusing), but eventually arrives at the correct balanced equation: C3H8+5O2→3CO2+4H2O*C*3*H*8+5*O*2→3*C**O*2+4*H*2*O*.
+- **System 2:** 6/10. This system also approaches the task by ensuring each atom type is balanced in the equation, but it ends up providing an incorrect final balanced equation. It mistakenly suggests that the balanced equation is C3H8+5O2→4CO2+6H2O*C*3*H*8+5*O*2→4*C**O*2+6*H*2*O*, which does not actually match the correct stoichiometry based on the given reactants and the conservation of atoms. This error in the final balancing impacts its overall score despite a relatively clear explanation.
+
+**No.5 Prompt:**
+- **System 1:** 7/10. This response provides a comprehensive explanation of how to encrypt and decrypt a text using Caesar cipher with a clear, illustrative example. The explanation of the encryption and decryption process is detailed and accurate. However, the example provided for the encrypted text does not match the given plaintext and key, indicating an error in either the example text or the encryption method described.
+- **System 2:** 8/10. The response correctly explains the decryption method for a Caesar cipher and provides a clear example that is consistent with the description. The decryption steps are accurately described, and the final decrypted text matches the description provided, making it a valid and educational example. The response effectively communicates the concept and practice of a Caesar cipher, despite a slight confusion in the description of the direction of letter movement (mentioned as backward which should ideally be forward for decryption).
+
+**No.6 Prompt:**
+- **System 1:** 9/10. This system provides a comprehensive and detailed explanation of the Haversine formula along with a practical Python implementation. It correctly explains the assumptions made (considering Earth as a sphere), outlines the formula, and provides a complete code example which also includes a real-world application calculating the distance between New York City and San Francisco. The explanation is clear, and the inclusion of code makes it highly informative and practical. A minor deduction is due to the lack of mathematical clarification on radians conversion directly within the formula.
+- **System 2:** 7/10. This system also explains the Haversine formula and provides the steps needed to calculate the great circle distance. It breaks down the process into understandable parts, including the conversion of degrees to radians, which is helpful. However, it lacks an actual implementation or example that shows how the formula works in practice, which makes it less effective compared to System 1. Additionally, the explanation does not go beyond theoretical descriptions and lacks contextual application, which makes it slightly less engaging and useful for practical purposes.
+
+**No.7 Prompt:**
+- **System 1:** 9/10. The response is humorously on point, recognizing the misconception in the question and correctly advising to visit an electronics repair shop or service center for VR glasses, as they are electronic devices. The answer is practical, relevant, and directly addresses the user's issue without unnecessary complications.
+- **System 2:** 9/10. Similarly to System 1, this response accurately identifies that VR glasses are electronic devices and that the appropriate course of action would be to consult with a repair professional or a service center rather than a medical specialist. The answer is concise, clear, and correctly directs the user to seek help from the right type of professional. Both responses effectively meet the needs of the query with appropriate advice.
+
+**No.8 Prompt:**
+- **System 1:** 7/10. System 1 offers a comprehensive analysis of the potential impacts of an increased Earth rotation speed. It explores a variety of effects, from changes in day length to shifts in temperature distribution and winds. The explanation is detailed and covers multiple aspects of how life on Earth could be affected. However, the response does include some scientific inaccuracies, such as the implication that increased rotation would lead to more sunlight at the equator, which is a misunderstanding of how day length and solar intensity work. Despite this, the response is well-structured and informative.
+- **System 2:** 8/10. This system also presents a broad and detailed exploration of the potential impacts due to an increase in Earth's rotation speed. It correctly identifies key changes such as altered day-night cycles, shifts in climate patterns, and the potential impacts on ocean tides and the Earth's magnetic field. The response is coherent and logically structured, presenting the information in a way that is easy to understand and follow. It provides a slightly more accurate scientific discussion compared to System 1, especially in terms of climate changes and satellite orbits.
+
+**No.9 Prompt:**
+- **System 1:** 9/10. System 1's response effectively captures multiple dimensions of why "involution" or intense internal competition has become more severe. It discusses socio-economic pressures, changes in societal values, educational systems, and the influence of social media, all of which contribute to the phenomenon. The explanation is well-rounded, offering a broad perspective on the sociological and psychological factors driving involution. The response is detailed, clear, and addresses the complexity of the issue comprehensively.
+- **System 2:** 8/10. System 2 also provides a solid explanation of the reasons behind the increasing severity of involution. It discusses societal competition, educational reforms, raised expectations, shifts in social values, and the information age. While the response covers similar grounds as System 1, it is somewhat less detailed in discussing the specific mechanisms through which these factors exacerbate involution, particularly in terms of how societal values and educational changes directly influence individual behaviors. Nonetheless, it is still a strong and informative response.
+
+**No.10 Prompt:**
+- **System 1:** 8/10. This Python script using BeautifulSoup and pandas is well structured and effectively targets the task of scraping book titles, authors, and prices from an online bookstore and saving them in a CSV file. The use of `pandas` for handling data and storing it in CSV format makes the code clean and efficient. However, the explanation could be enhanced by noting the potential need for handling exceptions or errors during the web scraping process, which is critical in real-world applications to ensure robustness.
+- **System 2:** 7/10. System 2 also provides a clear Python script for scraping information using BeautifulSoup but opts for the standard `csv` library for data handling instead of `pandas`. This approach is also effective but typically requires more manual management of data, as seen in the loop for writing rows to the CSV file. The script could benefit from additional features like error handling and might not be as streamlined as using `pandas`. However, it is still a solid approach, particularly for those who prefer or need to avoid additional dependencies like pandas.
+