Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: lab 3307 aau i use the kilillm functions to import llm data static #1841

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

RuellePaul
Copy link
Contributor

@RuellePaul RuellePaul commented Jan 8, 2025

new method kili.llm.import_conversations

note for reviewers
You can try create a LLM static project and import labeled conversations with this snipper
from kili.client import Kili
from scripts.constants import LOCAL_KILI_API_KEY

kili = Kili(api_key=LOCAL_KILI_API_KEY, api_endpoint='http://localhost:4001/api/label/v2/graphql')

interface = {
  "jobs": {
      "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL": {
          "content": {
              "categories": {
                  "TOO_SHORT": {
                      "children": [],
                      "name": "Too short",
                      "id": "category1"
                  },
                  "JUST_RIGHT": {
                      "children": [],
                      "name": "Just right",
                      "id": "category2"
                  },
                  "TOO_VERBOSE": {
                      "children": [],
                      "name": "Too verbose",
                      "id": "category3"
                  }
              },
              "input": "radio"
          },
          "instruction": "Verbosity",
          "level": "completion",
          "mlTask": "CLASSIFICATION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_1": {
          "content": {
              "categories": {
                  "NO_ISSUES": {
                      "children": [],
                      "name": "No issues",
                      "id": "category4"
                  },
                  "MINOR_ISSUES": {
                      "children": [],
                      "name": "Minor issue(s)",
                      "id": "category5"
                  },
                  "MAJOR_ISSUES": {
                      "children": [],
                      "name": "Major issue(s)",
                      "id": "category6"
                  }
              },
              "input": "radio"
          },
          "instruction": "Instructions Following",
          "level": "completion",
          "mlTask": "CLASSIFICATION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_2": {
          "content": {
              "categories": {
                  "NO_ISSUES": {
                      "children": [],
                      "name": "No issues",
                      "id": "category7"
                  },
                  "MINOR_INACCURACY": {
                      "children": [],
                      "name": "Minor inaccuracy",
                      "id": "category8"
                  },
                  "MAJOR_INACCURACY": {
                      "children": [],
                      "name": "Major inaccuracy",
                      "id": "category9"
                  }
              },
              "input": "radio"
          },
          "instruction": "Truthfulness",
          "level": "completion",
          "mlTask": "CLASSIFICATION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_3": {
          "content": {
              "categories": {
                  "NO_ISSUES": {
                      "children": [],
                      "name": "No issues",
                      "id": "category10"
                  },
                  "MINOR_SAFETY_CONCERN": {
                      "children": [],
                      "name": "Minor safety concern",
                      "id": "category11"
                  },
                  "MAJOR_SAFETY_CONCERN": {
                      "children": [],
                      "name": "Major safety concern",
                      "id": "category12"
                  }
              },
              "input": "radio"
          },
          "instruction": "Harmlessness/Safety",
          "level": "completion",
          "mlTask": "CLASSIFICATION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "TRANSCRIPTION_JOB_AT_COMPLETION_LEVEL": {
          "content": {
              "input": "textField"
          },
          "instruction": "Additional comments...",
          "level": "completion",
          "mlTask": "TRANSCRIPTION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "TRANSCRIPTION_MARKDOWN_JOB_AT_COMPLETION_LEVEL": {
          "content": {
              "input": "markdown"
          },
          "instruction": "Additional comments...",
          "level": "completion",
          "mlTask": "TRANSCRIPTION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "COMPARISON_JOB": {
          "content": {
              "options": {
                  "IS_MUCH_BETTER": {
                      "children": [],
                      "name": "Is much better",
                      "id": "option13"
                  },
                  "IS_BETTER": {
                      "children": [],
                      "name": "Is better",
                      "id": "option14"
                  },
                  "IS_SLIGHTLY_BETTER": {
                      "children": [],
                      "name": "Is slightly better",
                      "id": "option15"
                  },
                  "TIE": {
                      "children": [],
                      "name": "Tie",
                      "mutual": True,
                      "id": "option16"
                  }
              },
              "input": "radio"
          },
          "instruction": "Pick the best answer",
          "mlTask": "COMPARISON",
          "required": 1,
          "isChild": False,
          "isNew": False
      },
      "CLASSIFICATION_JOB_AT_ROUND_LEVEL": {
          "content": {
              "categories": {
                  "BOTH_ARE_GOOD": {
                      "children": [],
                      "name": "Both are good",
                      "id": "category17"
                  },
                  "BOTH_ARE_BAD": {
                      "children": [],
                      "name": "Both are bad",
                      "id": "category18"
                  }
              },
              "input": "radio"
          },
          "instruction": "Overall quality",
          "level": "round",
          "mlTask": "CLASSIFICATION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "TRANSCRIPTION_JOB_AT_ROUND_LEVEL": {
          "content": {
              "input": "textField"
          },
          "instruction": "Additional comments...",
          "level": "round",
          "mlTask": "TRANSCRIPTION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "TRANSCRIPTION_MARKDOWN_JOB_AT_ROUND_LEVEL": {
          "content": {
              "input": "markdown"
          },
          "instruction": "Additional comments...",
          "level": "round",
          "mlTask": "TRANSCRIPTION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "CLASSIFICATION_JOB_AT_CONVERSATION_LEVEL": {
          "content": {
              "categories": {
                  "GLOBAL_GOOD": {
                      "children": [],
                      "name": "Globally good",
                      "id": "category19"
                  },
                  "BOTH_ARE_BAD": {
                      "children": [],
                      "name": "Globally bad",
                      "id": "category20"
                  }
              },
              "input": "radio"
          },
          "instruction": "Global",
          "level": "conversation",
          "mlTask": "CLASSIFICATION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "TRANSCRIPTION_JOB_AT_CONVERSATION_LEVEL": {
          "content": {
              "input": "textField"
          },
          "instruction": "Additional comments...",
          "level": "conversation",
          "mlTask": "TRANSCRIPTION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
      "TRANSCRIPTION_MARKDOWN_JOB_AT_CONVERSATION_LEVEL": {
          "content": {
              "input": "markdown"
          },
          "instruction": "Additional comments...",
          "level": "conversation",
          "mlTask": "TRANSCRIPTION",
          "required": 0,
          "isChild": False,
          "isNew": False
      },
  }
}

project_id = kili.create_project(
  title="LLM Static Project (created with SDK)",
  description="Project Description",
  input_type="LLM_STATIC",
  json_interface=interface
)["id"]

conversations = [
{
  "externalId": "Fibonacci python function",
  "chatItems": [
    {
      "externalId": "48cf1705-7be7-4b35-b4f4-3a13e9eadc1e",
      "role": "SYSTEM",
      "content": "You are an assistant helping the user by providing precise answers."
    },
    {
      "externalId": "9c43a0ae-5a6e-439b-94f6-83f7d5618b22",
      "role": "USER",
      "content": "Can you give me a function to compute Fibonacci numbers?"
    },
    {
      "externalId": "d5596f35-1b0f-42cb-b529-684e23138141",
      "role": "ASSISTANT",
      "content": "Here is a recursive implementation for calculating Fibonacci numbers:\n\n```python\ndef fibonacci_recursive(n):\n    if n <= 1:\n        return n\n    return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)\n```\n\nThis is simple to write but has a time complexity of \\( O(2^n) \\) due to redundant calculations. For large values of \\( n \\), this method becomes highly inefficient.",
      "modelName": "gpt-4o"
    },
    {
      "externalId": "94f676b7-c3be-45a8-bccf-36b56566b765",
      "role": "ASSISTANT",
      "content": "Here is an optimized implementation using matrix exponentiation, along with an explanation of why it is faster:\n\n```python\nimport numpy as np\n\ndef fibonacci_matrix(n):\n    def matrix_power(matrix, power):\n        result = np.identity(len(matrix), dtype=int)\n        while power:\n            if power % 2:\n                result = np.dot(result, matrix)\n            matrix = np.dot(matrix, matrix)\n            power //= 2\n        return result\n\n    if n == 0:\n        return 0\n    base_matrix = np.array([[1, 1], [1, 0]], dtype=int)\n    result_matrix = matrix_power(base_matrix, n - 1)\n    return result_matrix[0, 0]\n```\n\n### Why is Matrix Exponentiation Faster?\n\nThe recursive method recalculates values for overlapping subproblems, leading to exponential time complexity:\n\n\\[ T(n) = T(n-1) + T(n-2) + O(1), \\]\n\nwhich resolves to \\( O(2^n) \\). This inefficiency arises from redundant recalculations.\n\nThe matrix exponentiation method uses the property that Fibonacci numbers can be represented as a matrix multiplication:\n\n\\[ \\begin{bmatrix} F(n+1) & F(n) \\\\ F(n) & F(n-1) \\end{bmatrix} = \\begin{bmatrix} 1 & 1 \\\\ 1 & 0 \\end{bmatrix}^n. \\]\n\nUsing fast exponentiation (repeated squaring), the matrix power can be computed in \\( O(\\log n) \\) steps. For example, instead of multiplying the matrix \\( n \\) times, we:\n\n1. Square the matrix \\( \\lceil \\log_2 n \\rceil \\) times.\n2. Combine results for any remaining powers.\n\nThis drastically reduces the number of operations compared to the recursive method.\n\n### Comparison of Complexities\n\n- **Recursive Method**: \\( O(2^n) \\)\n- **Matrix Exponentiation**: \\( O(\\log n) \\)\n\nFor large \\( n \\), the difference is substantial. Consider \\( n = 50 \\):\n\n- Recursive method performs roughly \\( 2^{50} \\approx 10^{15} \\) operations.\n- Matrix exponentiation performs approximately \\( \\log_2(50) \\approx 6 \\) matrix multiplications, where each multiplication is \\( O(1) \\) for a fixed-size \\( 2 \\times 2 \\) matrix.\n\n### Example\nUsing \\( n = 10 \\):\n\n1. Recursive Method: Explores all possible paths in the Fibonacci tree, recalculating values like \\( F(8) \\) multiple times.\n2. Matrix Exponentiation: Squares the base matrix 4 times and performs 1 additional multiplication, yielding the result in far fewer steps.\n\nThis efficiency makes matrix exponentiation the preferred method for calculating Fibonacci numbers at scale.",
      "modelName": "gpt-o1"
    }
  ],
  "metadata": {}
},
{
  "externalId": "The sum of the integers",
  "chatItems": [
    {
      "externalId": "00000000-8b5c-4b6a-0000-5b232d8bcbb9",
      "role": "SYSTEM",
      "content": "You are an assistant helping the user by providing clear and concise answers."
    },
    {
      "externalId": "19d4a9e2-8b5c-4b6a-a27c-5b232d8bcbb9",
      "role": "USER",
      "content": "What is the sum of all integers up to infinity?"
    },
    {
      "externalId": "a82f76e4-7f47-4e45-a745-4b934f11f6f1",
      "role": "ASSISTANT",
      "content": "The series for the sum of all integers up to infinity, given by:\n\n\\[ S = 1 + 2 + 3 + 4 + \\dots \\]\n\ndiverges to infinity. This means the sum grows without bound and does not converge to a finite value.",
      "modelName": "gpt-o1"
    },
    {
      "externalId": "f3dcd3f0-5e0e-404c-9759-3c60a15e634e",
      "role": "ASSISTANT",
      "content": "While the series:\n\n\\[ S = 1 + 2 + 3 + 4 + \\dots \\]\n\ndiverges in the traditional sense, there exists a deeper mathematical interpretation through the **analytic continuation of the Riemann zeta function**. This assigns a finite value to this sum in a unique way.\n\nThe Riemann zeta function \\( \\zeta(s) \\) is initially defined for \\( \\text{Re}(s) > 1 \\) as:\n\n\\[ \\zeta(s) = \\sum_{n=1}^\\infty \\frac{1}{n^s}. \\]\n\nThrough a process called analytic continuation, \\( \\zeta(s) \\) can be extended to the entire complex plane, except at \\( s = 1 \\) (where it has a pole). For \\( s = -1 \\), the value of \\( \\zeta(-1) \\) is:\n\n\\[ \\zeta(-1) = -\\frac{1}{12}. \\]\n\nThus, using this extended framework:\n\n\\[ 1 + 2 + 3 + 4 + \\dots = -\\frac{1}{12}. \\]\n\n### Applications and Context\nThis surprising result is not meant to suggest that adding all positive integers in the traditional sense equals \\( -1/12 \\). Instead, it has profound implications in advanced mathematics and theoretical physics:\n\n1. **String Theory**: The value \\( -1/12 \\) arises in the computation of the energy of quantum strings, providing finite results where naive calculations yield divergences.\n2. **Quantum Field Theory**: Techniques like zeta regularization assign finite values to divergent sums, often leveraging \\( \\zeta(-1) \\).\n3. **Casimir Effect**: The Casimir effect calculates the force between two parallel plates due to quantum vacuum fluctuations. The series \\( 1 + 2 + 3 + \\dots \\) naturally appears in this context, and its regularized value of \\( -1/12 \\) provides a correction aligning with experimental results.\n\n### Example: Casimir Effect\nConsider two conducting plates placed in a vacuum. Quantum field theory predicts the energy density due to vacuum fluctuations, and summing these energies involves divergent series such as \\( S = 1 + 2 + 3 + \\dots \\). By assigning \\( \\zeta(-1) = -1/12 \\), the theory yields a finite and experimentally verified prediction of the force between the plates.\n\nThis demonstrates how the concept of \\( \\zeta(-1) \\) provides a mathematically consistent and physically meaningful interpretation in specific advanced scenarios.",
      "modelName": "gpt-4o"
    }
  ],
  "metadata": {}
},
{
  "externalId": "Caesar cipher decoding",
  "chatItems": [
    {
      "externalId": "48cf1705-7be7-4b35-b4f4-3a13e9eadc1e",
      "role": "SYSTEM",
      "content": "You are an assistant helping the user by providing clear and concise answers."
    },
    {
      "externalId": "7a4cde2c-95b4-4d3f-94fa-233a3c3a9171",
      "role": "USER",
      "content": "Can you decode this: Pmttw ewztl !"
    },
    {
      "externalId": "ab1e29bf-6b94-4b78-b920-8bfe5c2370f4",
      "role": "ASSISTANT",
      "content": "This seems like an encoded message, perhaps in a secret language of aliens or something far beyond human comprehension! The mystery is profound, but I'm unable to decode it. Maybe it contains a cosmic secret?",
      "modelName": "gpt-3"
    },
    {
      "externalId": "93dc1b4d-dc25-48cc-9ef6-13e01cbd45f2",
      "role": "ASSISTANT",
      "content": "The encoded message is a **Caesar cipher**, a simple encryption technique where each letter is shifted by a fixed number of places in the alphabet.\n\nTo decode \\( Pmttw\\ ewztl! \\), we determine that the shift key is **8** (each letter is shifted back by 8 positions). Decoding it gives:\n\n**Hello world!**\n\n### Python Code for Caesar Cipher Decryption\nYou can use the following Python code to decrypt any Caesar cipher by specifying the shift key:\n\n```python\ndef caesar_cipher_decrypt(ciphertext, shift):\n    decrypted_text = []\n    for char in ciphertext:\n        if char.isalpha():\n            # Determine if the character is uppercase or lowercase\n            base = ord('A') if char.isupper() else ord('a')\n            # Shift the character and wrap around using modulo\n            new_char = chr((ord(char) - base - shift) % 26 + base)\n            decrypted_text.append(new_char)\n        else:\n            # Non-alphabetic characters remain unchanged\n            decrypted_text.append(char)\n    return ''.join(decrypted_text)\n\n# Example usage\nciphertext = \"Pmttw ewztl !\"\nshift = 8\nprint(caesar_cipher_decrypt(ciphertext, shift))\n```\n\n### Explanation\n1. **Shift Key**: The Caesar cipher uses a fixed number to shift each letter. In this case, the shift key is \\( 8 \\).\n2. **Decoding Process**: Each letter is shifted backward by \\( 8 \\) positions in the alphabet, wrapping around from \\( A \\) to \\( Z \\) or \\( a \\) to \\( z \\) as needed.\n\n### Result\nRunning the code will correctly decode the message to:\n\n**Hello world!**",
      "modelName": "gpt-o1"
    }
  ],
  "label": {
    "completion": {
      "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL": {
        "ab1e29bf-6b94-4b78-b920-8bfe5c2370f4": {
          "categories": [
            "TOO_SHORT"
          ]
        },
        "93dc1b4d-dc25-48cc-9ef6-13e01cbd45f2": {
          "categories": [
            "JUST_RIGHT"
          ]
        }
      },
      "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_1": {
        "ab1e29bf-6b94-4b78-b920-8bfe5c2370f4": {
          "categories": [
            "MINOR_ISSUES"
          ]
        }
      },
      "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_2": {
        "ab1e29bf-6b94-4b78-b920-8bfe5c2370f4": {
          "categories": [
            "MINOR_INACCURACY"
          ]
        }
      },
      "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_3": {
        "ab1e29bf-6b94-4b78-b920-8bfe5c2370f4": {
          "categories": [
            "MINOR_SAFETY_CONCERN"
          ]
        }
      }
    },
    "round": {
      "CLASSIFICATION_JOB_AT_ROUND_LEVEL": {
        "0": {
          "categories": [
            "BOTH_ARE_GOOD"
          ]
        }
      },
      "COMPARISON_JOB": {
        "0": {
          "code": "Is much better",
          "firstId": "93dc1b4d-dc25-48cc-9ef6-13e01cbd45f2",
          "secondId": "ab1e29bf-6b94-4b78-b920-8bfe5c2370f4"
        }
      }
    },
    "conversation": {
      "CLASSIFICATION_JOB_AT_CONVERSATION_LEVEL": {
        "categories": [
          "GLOBAL_GOOD"
        ]
      },
      "TRANSCRIPTION_JOB_AT_CONVERSATION_LEVEL": {
        "text": "Great conversation!"
      }
    }
  },
  "labeler": "test+fx@kili-technology.com",
  "metadata": {}
}
]

response = kili.llm.import_conversations(
  project_id=project_id,
  conversations=conversations
)

refactored LLM export

LLM static & dynamic export is the same format as import
Renamed old llm_static related code to llm_rlhf for clarity

note for reviewers
Try to label & export LLM static or dynamic project , then import again into a LLM static project
from kili.client import Kili
from scripts.constants import LOCAL_KILI_API_KEY

kili = Kili(api_key=LOCAL_KILI_API_KEY, api_endpoint='http://localhost:4001/api/label/v2/graphql')

kili.llm.export(project_id="cm66gj5qe01kol70weiqpdyz3")

Add new tutorial for LLM static import

Capture d’écran 2025-01-21 à 14 03 07

Add unit test for both llm static & dynamic export

Add e2e test for import conversations

@RuellePaul RuellePaul self-assigned this Jan 8, 2025
@RuellePaul RuellePaul changed the title Feature/lab 3307 aau i use the kilillm functions to import llm data static feat: lab 3307 aau i use the kilillm functions to import llm data static Jan 14, 2025
same format for import static, export static, export dynamic
@RuellePaul RuellePaul force-pushed the feature/lab-3307-aau-i-use-the-kilillm-functions-to-import-llm-data-static branch from 486859e to ead637f Compare January 15, 2025 15:57
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@RuellePaul RuellePaul force-pushed the feature/lab-3307-aau-i-use-the-kilillm-functions-to-import-llm-data-static branch from bfe38e5 to b5d8c70 Compare January 17, 2025 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants