Skip to content

Commit

Permalink
Update Llama.cpp & Llama 3 Support (#55)
Browse files Browse the repository at this point in the history
# Update Llama.cpp & Llama 3 Support

## ⚙️ Release Notes 
- Updates llama.cpp
- Supports Llama 3

## 📝 Code of Conduct & Contributing Guidelines 

By submitting creating this pull request, you agree to follow our [Code
of
Conduct](https://github.com/StanfordSpezi/.github/blob/main/CODE_OF_CONDUCT.md)
and [Contributing
Guidelines](https://github.com/StanfordSpezi/.github/blob/main/CONTRIBUTING.md):
- [x] I agree to follow the [Code of
Conduct](https://github.com/StanfordSpezi/.github/blob/main/CODE_OF_CONDUCT.md)
and [Contributing
Guidelines](https://github.com/StanfordSpezi/.github/blob/main/CONTRIBUTING.md).

---------

Co-authored-by: Vishnu Ravi <vishnur@stanford.edu>
  • Loading branch information
PSchmiedmayer and vishnuravi authored May 13, 2024
1 parent cbaf204 commit 94f14f6
Show file tree
Hide file tree
Showing 11 changed files with 104 additions and 23 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/build-and-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -136,4 +136,6 @@ jobs:
needs: [buildandtest_ios, buildandtest_visionos, buildandtest_macos, buildandtestuitests_ios, buildandtestuitests_ipad, buildandtestuitests_visionos]
uses: StanfordSpezi/.github/.github/workflows/create-and-upload-coverage-report.yml@v2
with:
coveragereports: 'SpeziLLM-iOS.xcresult SpeziLLM-visionOS.xcresult SpeziLLM-macOS.xcresult TestApp-iOS.xcresult TestApp-iPad.xcresult TestApp-visionOS.xcresult'
coveragereports: 'SpeziLLM-iOS.xcresult SpeziLLM-visionOS.xcresult SpeziLLM-macOS.xcresult TestApp-iOS.xcresult TestApp-iPad.xcresult TestApp-visionOS.xcresult'
secrets:
token: ${{ secrets.CODECOV_TOKEN }}
2 changes: 1 addition & 1 deletion Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ let package = Package(
],
dependencies: [
.package(url: "https://github.com/StanfordBDHG/OpenAI", .upToNextMinor(from: "0.2.8")),
.package(url: "https://github.com/StanfordBDHG/llama.cpp", .upToNextMinor(from: "0.2.1")),
.package(url: "https://github.com/StanfordBDHG/llama.cpp", .upToNextMinor(from: "0.3.3")),
.package(url: "https://github.com/StanfordSpezi/Spezi", from: "1.2.1"),
.package(url: "https://github.com/StanfordSpezi/SpeziFoundation", from: "1.0.4"),
.package(url: "https://github.com/StanfordSpezi/SpeziStorage", from: "1.0.2"),
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Spezi LLM provides a number of targets to help developers integrate LLMs in thei
- [SpeziLLM](https://swiftpackageindex.com/stanfordspezi/spezillm/documentation/spezillm): Base infrastructure of LLM execution in the Spezi ecosystem.
- [SpeziLLMLocal](https://swiftpackageindex.com/stanfordspezi/spezillm/documentation/spezillmlocal): Local LLM execution capabilities directly on-device. Enables running open-source LLMs like [Meta's Llama2 models](https://ai.meta.com/llama/).
- [SpeziLLMLocalDownload](https://swiftpackageindex.com/stanfordspezi/spezillm/documentation/spezillmlocaldownload): Download and storage manager of local Language Models, including onboarding views.
- [SpeziLLMOpenAI](https://swiftpackageindex.com/stanfordspezi/spezillm/documentation/spezillmopenai): Integration with [OpenAIs GPT models](https://openai.com/gpt-4) via using OpenAIs API service.
- [SpeziLLMOpenAI](https://swiftpackageindex.com/stanfordspezi/spezillm/documentation/spezillmopenai): Integration with OpenAI's GPT models via using OpenAI's API service.
- [SpeziLLMFog](https://swiftpackageindex.com/stanfordspezi/spezillm/documentation/spezillmfog): Discover and dispatch LLM inference jobs to Fog node resources within the local network.

The section below highlights the setup and basic use of the [SpeziLLMLocal](https://swiftpackageindex.com/stanfordspezi/spezillm/documentation/spezillmlocal), [SpeziLLMOpenAI](https://swiftpackageindex.com/stanfordspezi/spezillm/documentation/spezillmopenai), and [SpeziLLMFog](https://swiftpackageindex.com/stanfordspezi/spezillm/documentation/spezillmfog) targets in order to integrate Language Models in a Spezi-based application.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -119,16 +119,6 @@ public struct LLMLocalContextParameters: Sendable {
wrapped.rope_freq_scale = newValue
}
}

/// Set the usage of experimental `mul_mat_q` kernels
var useMulMatQKernels: Bool {
get {
wrapped.mul_mat_q
}
set {
wrapped.mul_mat_q = newValue
}
}

/// If `true`, offload the KQV ops (including the KV cache) to GPU
var offloadKQV: Bool {
Expand Down Expand Up @@ -173,10 +163,10 @@ public struct LLMLocalContextParameters: Sendable {
/// If `true`, the mode is set to embeddings only
var embeddingsOnly: Bool {
get {
wrapped.embedding
wrapped.embeddings
}
set {
wrapped.embedding = newValue
wrapped.embeddings = newValue
}
}

Expand All @@ -191,7 +181,6 @@ public struct LLMLocalContextParameters: Sendable {
/// - threadCountBatch: Number of threads used by LLM for batch processing, defaults to the processor count of the device.
/// - ropeFreqBase: RoPE base frequency, defaults to `0` indicating the default from model.
/// - ropeFreqScale: RoPE frequency scaling factor, defaults to `0` indicating the default from model.
/// - useMulMatQKernels: Usage of experimental `mul_mat_q` kernels, defaults to `true`.
/// - offloadKQV: Offloads the KQV ops (including the KV cache) to GPU, defaults to `true`.
/// - kvKeyType: ``GGMLType`` of the key of the KV cache, defaults to ``GGMLType/f16``.
/// - kvValueType: ``GGMLType`` of the value of the KV cache, defaults to ``GGMLType/f16``.
Expand All @@ -205,7 +194,6 @@ public struct LLMLocalContextParameters: Sendable {
threadCountBatch: UInt32 = .init(ProcessInfo.processInfo.processorCount),
ropeFreqBase: Float = 0.0,
ropeFreqScale: Float = 0.0,
useMulMatQKernels: Bool = true,
offloadKQV: Bool = true,
kvKeyType: GGMLType = .f16,
kvValueType: GGMLType = .f16,
Expand All @@ -221,7 +209,6 @@ public struct LLMLocalContextParameters: Sendable {
self.threadCountBatch = threadCountBatch
self.ropeFreqBase = ropeFreqBase
self.ropeFreqScale = ropeFreqScale
self.useMulMatQKernels = useMulMatQKernels
self.offloadKQV = offloadKQV
self.kvKeyType = kvKeyType
self.kvValueType = kvValueType
Expand Down
80 changes: 80 additions & 0 deletions Sources/SpeziLLMLocal/LLMLocalSchema+PromptFormatting.swift
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,86 @@ import SpeziLLM
extension LLMLocalSchema {
/// Holds default prompt formatting strategies for [Llama2](https://ai.meta.com/llama/) as well as [Phi-2](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/) models.
public enum PromptFormattingDefaults {
/// Prompt formatting closure for the [Llama3](https://ai.meta.com/llama/) model
public static let llama3: (@Sendable (LLMContext) throws -> String) = { chat in // swiftlint:disable:this closure_body_length
/// BOS token of the LLM, used at the start of each prompt passage.
let BEGINOFTEXT = "<|begin_of_text|>"
/// The system identifier.
let SYSTEM = "system"
/// The user identifier.
let USER = "user"
/// The assistant identifier.
let ASSISTANT = "assistant"
/// The start token for enclosing the role of a particular message, e.g. <|start_header_id|>{role}<|end_header_id|>
let STARTHEADERID = "<|start_header_id|>"
/// The end token for enclosing the role of a particular message, e.g. <|start_header_id|>{role}<|end_header_id|>
let ENDHEADERID = "<|end_header_id|>"
/// The token that signifies the end of the message in a turn.
let EOTID = "<|eot_id|>"

guard chat.first?.role == .system else {
throw LLMLocalError.illegalContext
}

var systemPrompts: [String] = []
var initialUserPrompt: String = ""

for contextEntity in chat {
if contextEntity.role != .system {
if contextEntity.role == .user {
initialUserPrompt = contextEntity.content
break
} else {
throw LLMLocalError.illegalContext
}
}

systemPrompts.append(contextEntity.content)
}

/// Build the initial Llama3 prompt structure
///
/// Template of the prompt structure:
/// <|begin_of_text|>
/// <|start_header_id|>user<|end_header_id|>
/// {{ user_message }}<|eot_id|>
/// <|start_header_id|>assistant<|end_header_id|>
var prompt = """
\(BEGINOFTEXT)
\(STARTHEADERID)\(SYSTEM)\(ENDHEADERID)
\(systemPrompts.joined(separator: " "))\(EOTID)
\(STARTHEADERID)\(USER)\(ENDHEADERID)
\(initialUserPrompt)\(EOTID)
""" + " " // Add a spacer to the generated output from the model

for contextEntity in chat.dropFirst(2) {
if contextEntity.role == .assistant() {
/// Append response from assistant to the Llama3 prompt structure
prompt += """
\(STARTHEADERID)\(ASSISTANT)\(ENDHEADERID)
\(contextEntity.content)
\(EOTID)
"""
} else if contextEntity.role == .user {
/// Append response from user to the Llama3 prompt structure
prompt += """
\(STARTHEADERID)\(USER)\(ENDHEADERID)
\(contextEntity.content)
\(EOTID)
""" + " " // Add a spacer to the generated output from the model
}
}

prompt +=
"""
\(STARTHEADERID)\(ASSISTANT)\(ENDHEADERID)
"""

return prompt
}

/// Prompt formatting closure for the [Llama2](https://ai.meta.com/llama/) model
public static let llama2: (@Sendable (LLMContext) throws -> String) = { chat in // swiftlint:disable:this closure_body_length
/// BOS token of the LLM, used at the start of each prompt passage.
Expand Down
2 changes: 1 addition & 1 deletion Sources/SpeziLLMLocal/LLMLocalSession+Generation.swift
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ extension LLMLocalSession {
return
}

var nextStringPiece = String(llama_token_to_piece(self.modelContext, nextTokenId))
var nextStringPiece = String(llama_token_to_piece(self.modelContext, nextTokenId, true))
// As first character is sometimes randomly prefixed by a single space (even though prompt has an additional character)
if decodedTokens == 0 && nextStringPiece.starts(with: " ") {
nextStringPiece = String(nextStringPiece.dropFirst())
Expand Down
2 changes: 1 addition & 1 deletion Sources/SpeziLLMLocal/LLMLocalSession+Tokenization.swift
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ extension LLMLocalSession {
/// - Note: Used only for debug purposes
func detokenize(tokens: [LLMLocalToken]) -> [(LLMLocalToken, String)] {
tokens.reduce(into: [(LLMLocalToken, String)]()) { partialResult, token in
partialResult.append((token, String(llama_token_to_piece(self.modelContext, token))))
partialResult.append((token, String(llama_token_to_piece(self.modelContext, token, true))))
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,17 @@ import Foundation
extension LLMLocalDownloadManager {
/// Defaults of possible LLMs to download via the ``LLMLocalDownloadManager``.
public enum LLMUrlDefaults {
/// LLama 3 8B model with `Q4_K_M` quantization in its instruct variation (~5 GB)
public static var llama3InstructModelUrl: URL {
guard let url = URL(string: "https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf") else {
preconditionFailure("""
SpeziLLM: Invalid LLMUrlDefaults LLM download URL.
""")
}

return url
}

/// LLama 2 7B model with `Q4_K_M` quantization in its chat variation (~3.5GB)
public static var llama2ChatModelUrl: URL {
guard let url = URL(string: "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf") else {
Expand Down
3 changes: 2 additions & 1 deletion Tests/UITests/TestApp/LLMLocal/LLMLocalChatTestView.swift
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@ struct LLMLocalChatTestView: View {
with: LLMLocalSchema(
modelPath: .cachesDirectory.appending(path: "llm.gguf"),
parameters: .init(maxOutputLength: 512),
contextParameters: .init(contextWindowSize: 1024)
contextParameters: .init(contextWindowSize: 1024),
formatChat: LLMLocalSchema.PromptFormattingDefaults.llama3
)
)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ struct LLMLocalOnboardingDownloadView: View {
var body: some View {
LLMLocalDownloadView(
downloadDescription: "LLM_DOWNLOAD_DESCRIPTION",
llmDownloadUrl: LLMLocalDownloadManager.LLMUrlDefaults.llama2ChatModelUrl /// By default, download the Llama2 model
llmDownloadUrl: LLMLocalDownloadManager.LLMUrlDefaults.llama3InstructModelUrl /// By default, download the Llama3 model
) {
onboardingNavigationPath.nextStep()
}
Expand Down
2 changes: 1 addition & 1 deletion Tests/UITests/TestApp/Resources/Localizable.xcstrings
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
"en" : {
"stringUnit" : {
"state" : "translated",
"value" : "By default, the application downloads the Llama 2 7B model in its chat variation. The size of the model is around 3.5GB."
"value" : "By default, the application downloads the Llama 3 8B model in its instruct variation. The size of the model is around 5GB."
}
}
}
Expand Down

0 comments on commit 94f14f6

Please sign in to comment.