Releases: huggingface/transformers.js
3.3.2
What's new?
- Add support for Helium and Glm in #1156
- Improve build process and fix usage with certain bundlers in #1158
- Auto-detect wordpiece tokenizer when model.type is missing in #1151
- Update Moonshine config values for transformers v4.48.0 in #1155
- Support simultaneous tensor op execution in WASM in #1162
- Update react tutorial sample code in #1152
Full Changelog: 3.3.1...3.3.2
3.3.1
3.3.0
🔥 Transformers.js v3.3 — StyleTTS 2 (Kokoro) for state-of-the-art text-to-speech, Grounding DINO for zero-shot object detection
🤖 New models: StyleTTS 2, Grounding DINO
StyleTTS 2 for high-quality speech synthesis
See #1148 for more information and here for the list of supported models.
First, install the kokoro-js
library, which uses Transformers.js, from NPM using:
npm i kokoro-js
You can then generate speech as follows:
import { KokoroTTS } from "kokoro-js";
const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});
const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
// Use `tts.list_voices()` to list all available voices
voice: "af_bella",
});
audio.save("audio.wav");
Grounding DINO for zero-shot object detection
See #1137 for more information and here for the list of supported models.
Example: Zero-shot object detection with onnx-community/grounding-dino-tiny-ONNX
using the pipeline
API.
import { pipeline } from "@huggingface/transformers";
const detector = await pipeline("zero-shot-object-detection", "onnx-community/grounding-dino-tiny-ONNX");
const url = "http://images.cocodataset.org/val2017/000000039769.jpg";
const candidate_labels = ["a cat."];
const output = await detector(url, candidate_labels, {
threshold: 0.3,
});
See example output
[
{ score: 0.45316222310066223, label: "a cat", box: { xmin: 343, ymin: 23, xmax: 637, ymax: 372 } },
{ score: 0.36190420389175415, label: "a cat", box: { xmin: 12, ymin: 52, xmax: 317, ymax: 472 } },
]
🛠️ Other improvements
- Add the RawAudio class by @Th3G33k in #682
- Update React guide for v3 by @sroussey in #1128
- Add option to skip special tokens in TextStreamer by @sroussey in #1139
🤗 New contributors
Full Changelog: 3.2.4...3.3.0
3.2.4
What's new?
-
Add support for visualizing self-attention heatmaps in #1117
Example code
import { AutoProcessor, AutoModelForImageClassification, interpolate_4d, RawImage } from "@huggingface/transformers"; // Load model and processor const model_id = "onnx-community/dinov2-with-registers-small-with-attentions"; const model = await AutoModelForImageClassification.from_pretrained(model_id); const processor = await AutoProcessor.from_pretrained(model_id); // Load image from URL const image = await RawImage.read("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg"); // Pre-process image const inputs = await processor(image); // Perform inference const { logits, attentions } = await model(inputs); // Get the predicted class const cls = logits[0].argmax().item(); const label = model.config.id2label[cls]; console.log(`Predicted class: ${label}`); // Set config values const patch_size = model.config.patch_size; const [width, height] = inputs.pixel_values.dims.slice(-2); const w_featmap = Math.floor(width / patch_size); const h_featmap = Math.floor(height / patch_size); const num_heads = model.config.num_attention_heads; const num_cls_tokens = 1; const num_register_tokens = model.config.num_register_tokens ?? 0; // Visualize attention maps const selected_attentions = attentions .at(-1) // we are only interested in the attention maps of the last layer .slice(0, null, 0, [num_cls_tokens + num_register_tokens, null]) .view(num_heads, 1, w_featmap, h_featmap); const upscaled = await interpolate_4d(selected_attentions, { size: [width, height], mode: "nearest", }); for (let i = 0; i < num_heads; ++i) { const head_attentions = upscaled[i]; const minval = head_attentions.min().item(); const maxval = head_attentions.max().item(); const image = RawImage.fromTensor( head_attentions .sub_(minval) .div_(maxval - minval) .mul_(255) .to("uint8"), ); await image.save(`attn-head-${i}.png`); }
-
Add
min
,max
,argmin
,argmax
tensor ops fordim=null
-
Add support for nearest-neighbour interpolation in
interpolate_4d
-
Depth Estimation pipeline improvements (faster & returns resized depth map)
-
TypeScript improvements by @ocavue and @shrirajh in #1081 and #1122
-
Remove unused imports from tokenizers.js by @pratapvardhan in #1116
New Contributors
- @shrirajh made their first contribution in #1122
- @pratapvardhan made their first contribution in #1116
Full Changelog: 3.2.3...3.2.4
3.2.3
What's new?
- Fix setting of model_file_name for image feature extraction pipeline in #1114. Thanks @xitanggg for reporting the issue!
- Add support for dinov2 with registers in #1110. Example usage:
import { pipeline } from '@huggingface/transformers'; // Create image classification pipeline const classifier = await pipeline('image-classification', 'onnx-community/dinov2-with-registers-small-imagenet1k-1-layer'); // Classify an image const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg'; const output = await classifier(url); console.log(output); // [ // { label: 'tabby, tabby cat', score: 0.8135351538658142 }, // { label: 'tiger cat', score: 0.08967583626508713 }, // { label: 'Egyptian cat', score: 0.06800546497106552 }, // { label: 'radiator', score: 0.003501888597384095 }, // { label: 'quilt, comforter, comfort, puff', score: 0.003408448537811637 }, // ]
Full Changelog: 3.2.2...3.2.3
3.2.2
3.2.1
What's new?
-
Add support for ModernBert in #1104. Check out the blog post for more information!
Example:
import { pipeline } from '@huggingface/transformers'; const pipe = await pipeline('fill-mask', 'answerdotai/ModernBERT-base'); const answer = await pipe('The capital of France is [MASK].'); console.log(answer);
Full Changelog: 3.2.0...3.2.1
3.2.0
🔥 Transformers.js v3.2 — Moonshine for real-time speech recognition, Phi-3.5 Vision for multi-frame image understanding and reasoning, and more!
Table of contents:
🤖 New models: Moonshine, Phi-3.5 Vision, EXAONE
Moonshine for real-time speech recognition
Moonshine is a family of speech-to-text models optimized for fast and accurate automatic speech recognition (ASR) on resource-constrained devices. They are well-suited to real-time, on-device applications like live transcription and voice command recognition, and are perfect for in-browser usage (check out the online demo). See #1099 for more information and here for the list of supported models.
Example: Automatic speech recognition w/ Moonshine tiny.
import { pipeline } from "@huggingface/transformers";
const transcriber = await pipeline("automatic-speech-recognition", "onnx-community/moonshine-tiny-ONNX");
const output = await transcriber("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav");
console.log(output);
// { text: 'And so my fellow Americans ask not what your country can do for you as what you can do for your country.' }
See example using the MoonshineForConditionalGeneration API
import { MoonshineForConditionalGeneration, AutoProcessor, read_audio } from "@huggingface/transformers";
// Load model and processor
const model_id = "onnx-community/moonshine-tiny-ONNX";
const model = await MoonshineForConditionalGeneration.from_pretrained(model_id, {
dtype: "q4",
});
const processor = await AutoProcessor.from_pretrained(model_id);
// Load audio and prepare inputs
const audio = await read_audio("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav", 16000);
const inputs = await processor(audio);
// Generate outputs
const outputs = await model.generate({ ...inputs, max_new_tokens: 100 });
// Decode outputs
const decoded = processor.batch_decode(outputs, { skip_special_tokens: true });
console.log(decoded[0]);
// And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.
Phi-3.5 Vision for multi-frame image understanding and reasoning
Phi-3.5 Vision is a lightweight, state-of-the-art, open multimodal model that can be used for multi-frame image understanding and reasoning. See #1094 for more information and here for the list of supported models.
Examples:
See example code
Example: Single-frame (critique an image)
import {
AutoProcessor,
AutoModelForCausalLM,
TextStreamer,
load_image,
} from "@huggingface/transformers";
// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
dtype: {
vision_encoder: "q4", // 'q4' or 'q4f16'
prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
model: "q4f16", // 'q4f16'
},
});
// Load image
const image = await load_image("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/meme.png");
// Prepare inputs
const messages = [
{ role: "user", content: "<|image_1|>What's funny about this image?" },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
tokenize: false,
add_generation_prompt: true,
});
const inputs = await processor(prompt, image, { num_crops: 4 });
// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
skip_prompt: true,
skip_special_tokens: true,
});
// Generate response
const output = await model.generate({
...inputs,
streamer,
max_new_tokens: 256,
});
Or, decode the output at the end:
// Decode and display the answer
const generated_ids = output.slice(null, [inputs.input_ids.dims[1], null]);
const answer = processor.batch_decode(generated_ids, {
skip_special_tokens: true,
});
console.log(answer[0]);
Example: Multi-frame (summarize slides)
import {
AutoProcessor,
AutoModelForCausalLM,
TextStreamer,
load_image,
} from "@huggingface/transformers";
// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
dtype: {
vision_encoder: "q4", // 'q4' or 'q4f16'
prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
model: "q4f16", // 'q4f16'
},
});
// Load images
const urls = [
"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-1-2048.jpg",
"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-2-2048.jpg",
"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-3-2048.jpg",
];
const images = await Promise.all(urls.map(load_image));
// Prepare inputs
const placeholder = images.map((_, i) => `<|image_${i + 1}|>\n`).join("");
const messages = [
{ role: "user", content: placeholder + "Summarize the deck of slides." },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
tokenize: false,
add_generation_prompt: true,
});
const inputs = await processor(prompt, images, { num_crops: 4 });
// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
skip_prompt: true,
skip_special_tokens: true,
});
// Generate response
const output = await model.generate({
...inputs,
streamer,
max_new_tokens: 256,
});
EXAONE 3.5 for bilingual (English and Korean) text generation
EXAONE 3.5 is a collection of instruction-tuned bilingual (English and Korean) generative models, developed and released by LG AI Research. See #1084 for more information and here for the list of supported models.
Example: Text-generation w/ EXAONE-3.5-2.4B-Instruct
:
import { pipeline } from "@huggingface/transformers";
// Create a text generation pipeline
const generator = await pipeline(
"text-generation",
"onnx-community/EXAONE-3.5-2.4B-Instruct",
{ dtype...
3.1.2
🤖 New models
-
Add support for PaliGemma (& PaliGemma2) in #1074
Example: Image captioning with
onnx-community/paligemma2-3b-ft-docci-448
.import { AutoProcessor, PaliGemmaForConditionalGeneration, load_image } from '@huggingface/transformers'; // Load processor and model const model_id = 'onnx-community/paligemma2-3b-ft-docci-448'; const processor = await AutoProcessor.from_pretrained(model_id); const model = await PaliGemmaForConditionalGeneration.from_pretrained(model_id, { dtype: { embed_tokens: 'fp16', // or 'q8' vision_encoder: 'fp16', // or 'q4', 'q8' decoder_model_merged: 'q4', // or 'q4f16' }, }); // Prepare inputs const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg' const raw_image = await load_image(url); const prompt = '<image>caption en'; // Caption the image in English const inputs = await processor(raw_image, prompt); // Generate a response const output = await model.generate({ ...inputs, max_new_tokens: 100, }) const generated_ids = output.slice(null, [inputs.input_ids.dims[1], null]); const answer = processor.batch_decode( generated_ids, { skip_special_tokens: true }, ); console.log(answer[0]); // A side view of a light blue 1970s Volkswagen Beetle parked on a gray cement road. It is facing to the right. It has a reflection on the side of it. Behind it is a yellow building with a brown double door on the right. It has a white frame around it. Part of a gray cement wall is visible on the far left.
List of supported models: https://huggingface.co/models?library=transformers.js&other=paligemma
-
Add support for I-JEPA in #1073
Example: Image feature extraction with
onnx-community/ijepa_vith14_1k
.import { pipeline, cos_sim } from "@huggingface/transformers"; // Create an image feature extraction pipeline const extractor = await pipeline( "image-feature-extraction", "onnx-community/ijepa_vith14_1k", { dtype: "q8" }, ); // Compute image embeddings const url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg" const url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg" const output = await extractor([url_1, url_2]); const pooled_output = output.mean(1); // Apply mean pooling // Compute cosine similarity const similarity = cos_sim(pooled_output[0].data, pooled_output[1].data); console.log(similarity); // 0.5168613045518973
List of supported models: https://huggingface.co/models?library=transformers.js&other=ijepa
-
Add support for OLMo2 in #1076. List of supported models: https://huggingface.co/models?library=transformers.js&other=olmo2
🐛 Bug fixes
- Fix whisper timestamp extraction for tokenizers with added tokens by @aravindMahadevan in #804
- Add missing 'ready' status in the ProgressInfo type by @ocavue in #1070
🛠️ Other improvements
- Add function to apply mask to RawImage by @BritishWerewolf in #1020
- Bump versions + webpack improvements in #1075
🤗 New contributors
- @aravindMahadevan made their first contribution in #804
Full Changelog: 3.1.1...3.1.2
3.1.1
🤖 New models
-
Add support for Idefics3 (SmolVLM) in #1059
import { AutoProcessor, AutoModelForVision2Seq, load_image, } from "@huggingface/transformers"; // Initialize processor and model const model_id = "HuggingFaceTB/SmolVLM-Instruct"; const processor = await AutoProcessor.from_pretrained(model_id); const model = await AutoModelForVision2Seq.from_pretrained(model_id, { dtype: { embed_tokens: "fp16", // "fp32", "fp16", "q8" vision_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16" decoder_model_merged: "q4", // "q8", "q4", "q4f16" } }); // Load images const image1 = await load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"); const image2 = await load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg"); // Create input messages const messages = [ { role: "user", content: [ { type: "image" }, { type: "image" }, { type: "text", text: "Can you describe the two images?" }, ], }, ]; // Prepare inputs const text = processor.apply_chat_template(messages, { add_generation_prompt: true }); const inputs = await processor(text, [image1, image2], { // Set `do_image_splitting: true` to split images into multiple patches. // NOTE: This uses more memory, but can provide more accurate results. do_image_splitting: false, }); // Generate outputs const generated_ids = await model.generate({ ...inputs, max_new_tokens: 500, }); const generated_texts = processor.batch_decode( generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]), { skip_special_tokens: true }, ); console.log(generated_texts[0]); // ' In the first image, there is a green statue of liberty on a pedestal in the middle of the water. The water is surrounded by trees and buildings in the background. In the second image, there are pink and red flowers with a bee on the pink flower.'
🐛 Bug fixes
- Fix repetition penalty logits processor in #1062
- Fix optional chaining for batch size calculation in PreTrainedModel by @emojiiii in #1063
📝 Documentation improvements
- Add an example and type enhancement for TextStreamer by @seonglae in #1066
- The smallest typo fix for webgpu.md by @JoramMillenaar in #1068
🛠️ Other improvements
- Only log warning if type not explicitly set to "custom" in #1061
- Improve browser vs. webworker detection in #1067
🤗 New contributors
- @emojiiii made their first contribution in #1063
- @seonglae made their first contribution in #1066
- @JoramMillenaar made their first contribution in #1068
Full Changelog: 3.1.0...3.1.1