forked from dotnet/docs
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
178 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
--- | ||
title: Overview of ML.NET | ||
description: Discover how to use the ML.NET CLI tool to automatically train the best model from the command-line. | ||
ms.date: 04/15/2024 | ||
--- | ||
|
||
# Overview of ML.NET | ||
|
||
ML.NET is an open-source, cross-platform machine learning framework for .NET developers that enables integration of custom machine learning models into .NET applications. It encompasses an [API](how-does-mldotnet-work.md), which consists of different NuGet packages, a Visual Studio extension called [Model Builder](automate-training-with-model-builder.md), and a [command-line interface](automate-training-with-cli.md) that's installed as a .NET tool. | ||
|
||
ML.NET packages: | ||
|
||
- [Microsoft.ML](https://www.nuget.org/packages/Microsoft.ML) | ||
- [Microsoft.ML.AutoML](https://www.nuget.org/packages/Microsoft.ML.AutoML) | ||
- [Microsoft.ML.Probabilistic](https://www.nuget.org/packages/Microsoft.ML.Probabilistic) | ||
- [Microsoft.ML.Tokenizers](https://www.nuget.org/packages/Microsoft.ML.Tokenizers) | ||
- [Many other packages](https://www.nuget.org/profiles/MLNET) | ||
|
||
Visual Studio extension: | ||
|
||
- [Model Builder extension for Visual Studio](https://marketplace.visualstudio.com/items?itemName=MLNET.ModelBuilder2022) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
title: What's new in ML.NET | ||
titleSuffix: "" | ||
description: Discover what's new in ML.NET. | ||
ms.date: 04/15/2024 | ||
ms.topic: whats-new | ||
|
||
#Customer intent: As a developer, I want to know what the new features are in ML.NET. | ||
|
||
--- | ||
|
||
# What's new in ML.NET | ||
|
||
> [!NOTE] | ||
> This article is a work in progress. | ||
You can find all of the release notes for the ML.NET API in the [dotnet/machinelearning repo](https://github.com/dotnet/machinelearning/tree/main/docs/release-notes). | ||
|
||
## New deep-learning tasks | ||
|
||
ML.NET 3.0 added support for the following deep-learning tasks: | ||
|
||
- Object detection (backed by TorchSharp) | ||
- Named entity recognition (NER) | ||
- Question answering (QA) | ||
|
||
These trainers are included in the [Microsoft.ML.TorchSharp](https://www.nuget.org/packages/Microsoft.ML.TorchSharp) package. For more information, see [Announcing ML.NET 3.0](https://devblogs.microsoft.com/dotnet/announcing-ml-net-3-0/). | ||
|
||
## AutoML | ||
|
||
In ML.NET 3.0, the AutoML sweeper was updated to support the sentence similarity, question answering, and object detection tasks. For more information about AutoML, see [How to use the ML.NET Automated Machine Learning (AutoML) API](../how-to-guides/how-to-use-the-automl-api.md). | ||
|
||
## Additional tokenizer support | ||
|
||
[Microsoft.ML.Tokenizers](https://devblogs.microsoft.com/dotnet/announcing-ml-net-2-0/#tokenizer-support) is an open-source, cross-platform tokenization library. When it was introduced, the library was scoped to the [Byte-Pair Encoding (BPE)](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokenization strategy to satisfy the language set of scenarios in ML.NET. Version 4.0 Preview 1 added support for the `Tiktoken` tokenizer. | ||
|
||
The following examples show how to use the `Tiktoken` text tokenizer. | ||
|
||
:::code language="csharp" source="./snippets/csharp/Tiktoken.cs" id="Tiktoken"::: | ||
|
||
### About tokenization | ||
|
||
Tokenization is a fundamental component in the preprocessing of natural language text for AI models. Tokenizers are responsible for breaking down a string of text into smaller, more manageable parts, often referred to as *tokens*. When using services like Azure OpenAI, you can use tokenizers to get a better understanding of cost and manage context. When working with self-hosted or local models, tokens are the inputs provided to those models. | ||
|
||
## Model Builder (Visual Studio extension) | ||
|
||
Model Builder has been updated to consume the ML.NET 3.0 release. Model Builder version 17.18.0 added question answering (QA) and named entity recognition (NER) scenarios. | ||
|
||
You can find all of the Model Builder release notes in the [dotnet/machinelearning-modelbuilder repo](https://github.com/dotnet/machinelearning-modelbuilder/tree/main/docs/release-notes). | ||
|
||
## See also | ||
|
||
- [Blog post: Announcing ML.NET 3.0](https://devblogs.microsoft.com/dotnet/announcing-ml-net-3-0/) |
13 changes: 13 additions & 0 deletions
13
docs/machine-learning/whats-new/snippets/csharp/Project.csproj
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
<Project Sdk="Microsoft.NET.Sdk"> | ||
|
||
<PropertyGroup> | ||
<OutputType>Exe</OutputType> | ||
<TargetFramework>net9</TargetFramework> | ||
<Nullable>enable</Nullable> | ||
</PropertyGroup> | ||
|
||
<ItemGroup> | ||
<PackageReference Include="Microsoft.ML.Tokenizers" Version="0.21.1" /> | ||
</ItemGroup> | ||
|
||
</Project> |
65 changes: 65 additions & 0 deletions
65
docs/machine-learning/whats-new/snippets/csharp/Tiktoken.cs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
using System; | ||
using System.Collections.Generic; | ||
using Microsoft.ML.Tokenizers; | ||
|
||
internal class TiktokenExample | ||
{ | ||
public static void RunIt() | ||
{ | ||
// <Tiktoken> | ||
Tokenizer tokenizer = Tokenizer.CreateTiktokenForModel("gpt-4"); | ||
string text = "Hello, World!"; | ||
|
||
// Encode to IDs. | ||
IReadOnlyList<int> encodedIds = tokenizer.EncodeToIds(text); | ||
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}"); | ||
// encodedIds = {9906, 11, 4435, 0} | ||
|
||
// Decode IDs to text. | ||
string decodedText = tokenizer.Decode(encodedIds); | ||
Console.WriteLine($"decodedText = {decodedText}"); | ||
// decodedText = Hello, World! | ||
|
||
// Get token count. | ||
int idsCount = tokenizer.CountTokens(text); | ||
Console.WriteLine($"idsCount = {idsCount}"); | ||
// idsCount = 4 | ||
|
||
// Full encoding. | ||
EncodingResult result = tokenizer.Encode(text); | ||
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Tokens)}'}}"); | ||
// result.Tokens = {'Hello', ',', ' World', '!'} | ||
Console.WriteLine($"result.Offsets = {{{string.Join(", ", result.Offsets)}}}"); | ||
// result.Offsets = {(0, 5), (5, 1), (6, 6), (12, 1)} | ||
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Ids)}}}"); | ||
// result.Ids = {9906, 11, 4435, 0} | ||
|
||
// Encode up to number of tokens limit. | ||
int index1 = tokenizer.IndexOfTokenCount( | ||
text, | ||
maxTokenCount: 1, | ||
out string processedText1, | ||
out int tokenCount1 | ||
); // Encode up to one token. | ||
Console.WriteLine($"processedText1 = {processedText1}"); | ||
// processedText1 = Hello, World! | ||
Console.WriteLine($"tokenCount1 = {tokenCount1}"); | ||
// tokenCount1 = 1 | ||
Console.WriteLine($"index1 = {index1}"); | ||
// index1 = 5 | ||
|
||
int index2 = tokenizer.LastIndexOfTokenCount( | ||
text, | ||
maxTokenCount: 1, | ||
out string processedText2, | ||
out int tokenCount2 | ||
); // Encode from end up to one token. | ||
Console.WriteLine($"processedText2 = {processedText2}"); | ||
// processedText2 = Hello, World! | ||
Console.WriteLine($"tokenCount2 = {tokenCount2}"); | ||
// tokenCount2 = 1 | ||
Console.WriteLine($"index2 = {index2}"); | ||
// index2 = 12 | ||
// </Tiktoken> | ||
} | ||
} |