Skip to content

Releases: huggingface/transformers.js

3.3.2

22 Jan 15:13
6f43f24
Compare
Choose a tag to compare

What's new?

  • Add support for Helium and Glm in #1156
  • Improve build process and fix usage with certain bundlers in #1158
  • Auto-detect wordpiece tokenizer when model.type is missing in #1151
  • Update Moonshine config values for transformers v4.48.0 in #1155
  • Support simultaneous tensor op execution in WASM in #1162
  • Update react tutorial sample code in #1152

Full Changelog: 3.3.1...3.3.2

3.3.1

15 Jan 15:36
e1753ac
Compare
Choose a tag to compare

What's new?

  • hotfix: Copy missing ort-wasm-simd-threaded.jsep.mjs to dist folder (#1150)

Full Changelog: 3.3.0...3.3.1

3.3.0

15 Jan 13:28
e00ff3b
Compare
Choose a tag to compare

🔥 Transformers.js v3.3 — StyleTTS 2 (Kokoro) for state-of-the-art text-to-speech, Grounding DINO for zero-shot object detection

🤖 New models: StyleTTS 2, Grounding DINO

StyleTTS 2 for high-quality speech synthesis

See #1148 for more information and here for the list of supported models.

First, install the kokoro-js library, which uses Transformers.js, from NPM using:

npm i kokoro-js

You can then generate speech as follows:

import { KokoroTTS } from "kokoro-js";

const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
  dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
  // Use `tts.list_voices()` to list all available voices
  voice: "af_bella",
});
audio.save("audio.wav");

Grounding DINO for zero-shot object detection

See #1137 for more information and here for the list of supported models.

Example: Zero-shot object detection with onnx-community/grounding-dino-tiny-ONNX using the pipeline API.

import { pipeline } from "@huggingface/transformers";

const detector = await pipeline("zero-shot-object-detection", "onnx-community/grounding-dino-tiny-ONNX");

const url = "http://images.cocodataset.org/val2017/000000039769.jpg";
const candidate_labels = ["a cat."];
const output = await detector(url, candidate_labels, {
  threshold: 0.3,
});
See example output
[
  { score: 0.45316222310066223, label: "a cat", box: { xmin: 343, ymin: 23, xmax: 637, ymax: 372 } },
  { score: 0.36190420389175415, label: "a cat", box: { xmin: 12, ymin: 52, xmax: 317, ymax: 472 } },
]

🛠️ Other improvements

🤗 New contributors

Full Changelog: 3.2.4...3.3.0

3.2.4

28 Dec 12:03
307a490
Compare
Choose a tag to compare

What's new?

  • Add support for visualizing self-attention heatmaps in #1117

    Cats Attention Head 0 Attention Head 1 Attention Head 2
    Attention Head 3 Attention Head 4 Attention Head 5
    Example code
    import { AutoProcessor, AutoModelForImageClassification, interpolate_4d, RawImage } from "@huggingface/transformers";
    
    // Load model and processor
    const model_id = "onnx-community/dinov2-with-registers-small-with-attentions";
    const model = await AutoModelForImageClassification.from_pretrained(model_id);
    const processor = await AutoProcessor.from_pretrained(model_id);
    
    // Load image from URL
    const image = await RawImage.read("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg");
    
    // Pre-process image
    const inputs = await processor(image);
    
    // Perform inference
    const { logits, attentions } = await model(inputs);
    
    // Get the predicted class
    const cls = logits[0].argmax().item();
    const label = model.config.id2label[cls];
    console.log(`Predicted class: ${label}`);
    
    // Set config values
    const patch_size = model.config.patch_size;
    const [width, height] = inputs.pixel_values.dims.slice(-2);
    const w_featmap = Math.floor(width / patch_size);
    const h_featmap = Math.floor(height / patch_size);
    const num_heads = model.config.num_attention_heads;
    const num_cls_tokens = 1;
    const num_register_tokens = model.config.num_register_tokens ?? 0;
    
    // Visualize attention maps
    const selected_attentions = attentions
        .at(-1) // we are only interested in the attention maps of the last layer
        .slice(0, null, 0, [num_cls_tokens + num_register_tokens, null])
        .view(num_heads, 1, w_featmap, h_featmap);
    
    const upscaled = await interpolate_4d(selected_attentions, {
        size: [width, height],
        mode: "nearest",
    });
    
    for (let i = 0; i < num_heads; ++i) {
        const head_attentions = upscaled[i];
        const minval = head_attentions.min().item();
        const maxval = head_attentions.max().item();
        const image = RawImage.fromTensor(
            head_attentions
                .sub_(minval)
                .div_(maxval - minval)
                .mul_(255)
                .to("uint8"),
        );
        await image.save(`attn-head-${i}.png`);
    }
  • Add min, max, argmin, argmax tensor ops for dim=null

  • Add support for nearest-neighbour interpolation in interpolate_4d

  • Depth Estimation pipeline improvements (faster & returns resized depth map)

  • TypeScript improvements by @ocavue and @shrirajh in #1081 and #1122

  • Remove unused imports from tokenizers.js by @pratapvardhan in #1116

New Contributors

Full Changelog: 3.2.3...3.2.4

3.2.3

25 Dec 10:41
8e075f4
Compare
Choose a tag to compare

What's new?

  • Fix setting of model_file_name for image feature extraction pipeline in #1114. Thanks @xitanggg for reporting the issue!
  • Add support for dinov2 with registers in #1110. Example usage:
    import { pipeline } from '@huggingface/transformers';
    
    // Create image classification pipeline
    const classifier = await pipeline('image-classification', 'onnx-community/dinov2-with-registers-small-imagenet1k-1-layer');
    
    // Classify an image
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
    const output = await classifier(url);
    console.log(output);
    // [
    //   { label: 'tabby, tabby cat', score: 0.8135351538658142 },
    //   { label: 'tiger cat', score: 0.08967583626508713 },
    //   { label: 'Egyptian cat', score: 0.06800546497106552 },
    //   { label: 'radiator', score: 0.003501888597384095 },
    //   { label: 'quilt, comforter, comfort, puff', score: 0.003408448537811637 },
    // ]

Full Changelog: 3.2.2...3.2.3

3.2.2

23 Dec 15:05
da2c1e9
Compare
Choose a tag to compare

What's new?

  • Fix env.backends.onnx.wasm.proxy = true: Clone tensor if using onnx wasm proxy in #1108

Full Changelog: 3.2.1...3.2.2

3.2.1

19 Dec 17:02
074e97a
Compare
Choose a tag to compare

What's new?

  • Add support for ModernBert in #1104. Check out the blog post for more information!

    Example:

    import { pipeline } from '@huggingface/transformers';
    
    const pipe = await pipeline('fill-mask', 'answerdotai/ModernBERT-base');
    const answer = await pipe('The capital of France is [MASK].');
    console.log(answer);

    image

Full Changelog: 3.2.0...3.2.1

3.2.0

15 Dec 18:11
610391d
Compare
Choose a tag to compare

🔥 Transformers.js v3.2 — Moonshine for real-time speech recognition, Phi-3.5 Vision for multi-frame image understanding and reasoning, and more!

Table of contents:

🤖 New models: Moonshine, Phi-3.5 Vision, EXAONE

Moonshine for real-time speech recognition

Moonshine is a family of speech-to-text models optimized for fast and accurate automatic speech recognition (ASR) on resource-constrained devices. They are well-suited to real-time, on-device applications like live transcription and voice command recognition, and are perfect for in-browser usage (check out the online demo). See #1099 for more information and here for the list of supported models.

Example: Automatic speech recognition w/ Moonshine tiny.

import { pipeline } from "@huggingface/transformers";

const transcriber = await pipeline("automatic-speech-recognition", "onnx-community/moonshine-tiny-ONNX");
const output = await transcriber("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav");
console.log(output);
// { text: 'And so my fellow Americans ask not what your country can do for you as what you can do for your country.' }
See example using the MoonshineForConditionalGeneration API
import { MoonshineForConditionalGeneration, AutoProcessor, read_audio } from "@huggingface/transformers";

// Load model and processor
const model_id = "onnx-community/moonshine-tiny-ONNX";
const model = await MoonshineForConditionalGeneration.from_pretrained(model_id, {
    dtype: "q4",
});
const processor = await AutoProcessor.from_pretrained(model_id);

// Load audio and prepare inputs
const audio = await read_audio("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav", 16000);
const inputs = await processor(audio);

// Generate outputs
const outputs = await model.generate({ ...inputs, max_new_tokens: 100 });

// Decode outputs
const decoded = processor.batch_decode(outputs, { skip_special_tokens: true });
console.log(decoded[0]);
// And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Phi-3.5 Vision for multi-frame image understanding and reasoning

Phi-3.5 Vision is a lightweight, state-of-the-art, open multimodal model that can be used for multi-frame image understanding and reasoning. See #1094 for more information and here for the list of supported models.

Examples:

Input Output
"What's funny about this image?" The humor in this image stems from the exaggerated depiction of human evolution, using the Shiba Inu dog breed to represent both ancient and modern humans. The left side shows a muscular, hunter-like figure labeled as 'Humans 100,000 years ago' with the caption 'me hungry me hunt mammoth,' suggesting a time when humans were physically robust and actively hunting. The right side contrasts this with a modern, slim Shiba Inu labeled as 'Humans today' with the caption 'why food delivery slow,' humorously commenting on the modern human's reliance on convenience and technology, such as food delivery services, rather than hunting for sustenance. The use of a dog, which is often associated with loyalty and companionship, adds a layer of irony and humor as it portrays humans in a more diminished, dependent state.
"Summarize the deck of slides."

To summarize, the slides are composed of these sections:

  • Introduction to Azure:

    The presentation introduces Microsoft Azure, a cloud computing platform. It highlights Azure's three service tiers: Hyper-scale, Enterprise, and Hybrid. The presenter is Dinesh Kumar Wickramasinghe, a Senior Software Engineer from CMS Private Limited in Sri Lanka.

  • Azure Overview:

    Azure is described as Microsoft's cloud computing platform, continuously expanding to meet current and future business challenges. It offers freedom to build, manage, and deploy applications on a global network using preferred tools and frameworks.

  • Cloud Computing Services:

    The presentation outlines three types of cloud computing services provided by Azure: Infrastructure-as-a-Service (IaaS) with a 'host' component, Platform-as-a-Service (PaaS) with a 'build' component, and Software-as-a-Service (SaaS) with a 'consume' component.

See example code

Example: Single-frame (critique an image)

import {
  AutoProcessor,
  AutoModelForCausalLM,
  TextStreamer,
  load_image,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
  legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
  dtype: {
    vision_encoder: "q4", // 'q4' or 'q4f16'
    prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
    model: "q4f16", // 'q4f16'
  },
});

// Load image
const image = await load_image("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/meme.png");

// Prepare inputs
const messages = [
  { role: "user", content: "<|image_1|>What's funny about this image?" },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
  tokenize: false,
  add_generation_prompt: true,
});
const inputs = await processor(prompt, image, { num_crops: 4 });

// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
  skip_prompt: true,
  skip_special_tokens: true,
});

// Generate response
const output = await model.generate({
  ...inputs,
  streamer,
  max_new_tokens: 256,
});

Or, decode the output at the end:

// Decode and display the answer
const generated_ids = output.slice(null, [inputs.input_ids.dims[1], null]);
const answer = processor.batch_decode(generated_ids, {
  skip_special_tokens: true,
});
console.log(answer[0]);

Example: Multi-frame (summarize slides)

import {
  AutoProcessor,
  AutoModelForCausalLM,
  TextStreamer,
  load_image,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
  legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
  dtype: {
    vision_encoder: "q4", // 'q4' or 'q4f16'
    prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
    model: "q4f16", // 'q4f16'
  },
});

// Load images
const urls = [
  "https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-1-2048.jpg",
  "https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-2-2048.jpg",
  "https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-3-2048.jpg",
];
const images = await Promise.all(urls.map(load_image));

// Prepare inputs
const placeholder = images.map((_, i) => `<|image_${i + 1}|>\n`).join("");
const messages = [
  { role: "user", content: placeholder + "Summarize the deck of slides." },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
  tokenize: false,
  add_generation_prompt: true,
});
const inputs = await processor(prompt, images, { num_crops: 4 });

// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
  skip_prompt: true,
  skip_special_tokens: true,
});

// Generate response
const output = await model.generate({
  ...inputs,
  streamer,
  max_new_tokens: 256,
});

EXAONE 3.5 for bilingual (English and Korean) text generation

EXAONE 3.5 is a collection of instruction-tuned bilingual (English and Korean) generative models, developed and released by LG AI Research. See #1084 for more information and here for the list of supported models.

Example: Text-generation w/ EXAONE-3.5-2.4B-Instruct:

import { pipeline } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/EXAONE-3.5-2.4B-Instruct",
  { dtype...
Read more

3.1.2

07 Dec 21:56
9914e7a
Compare
Choose a tag to compare

🤖 New models

  • Add support for PaliGemma (& PaliGemma2) in #1074

    Example: Image captioning with onnx-community/paligemma2-3b-ft-docci-448.

    import { AutoProcessor, PaliGemmaForConditionalGeneration, load_image } from '@huggingface/transformers';
    
    // Load processor and model
    const model_id = 'onnx-community/paligemma2-3b-ft-docci-448';
    const processor = await AutoProcessor.from_pretrained(model_id);
    const model = await PaliGemmaForConditionalGeneration.from_pretrained(model_id, {
        dtype: {
            embed_tokens: 'fp16', // or 'q8'
            vision_encoder: 'fp16', // or 'q4', 'q8'
            decoder_model_merged: 'q4', // or 'q4f16'
        },
    });
    
    // Prepare inputs
    const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg'
    const raw_image = await load_image(url);
    const prompt = '<image>caption en'; // Caption the image in English
    const inputs = await processor(raw_image, prompt);
    
    // Generate a response
    const output = await model.generate({
        ...inputs,
        max_new_tokens: 100,
    })
    
    const generated_ids = output.slice(null, [inputs.input_ids.dims[1], null]);
    const answer = processor.batch_decode(
        generated_ids,
        { skip_special_tokens: true },
    );
    console.log(answer[0]);
    // A side view of a light blue 1970s Volkswagen Beetle parked on a gray cement road. It is facing to the right. It has a reflection on the side of it. Behind it is a yellow building with a brown double door on the right. It has a white frame around it. Part of a gray cement wall is visible on the far left.

    List of supported models: https://huggingface.co/models?library=transformers.js&other=paligemma

  • Add support for I-JEPA in #1073

    Example: Image feature extraction with onnx-community/ijepa_vith14_1k.

    import { pipeline, cos_sim } from "@huggingface/transformers";
    
    // Create an image feature extraction pipeline
    const extractor = await pipeline(
      "image-feature-extraction",
      "onnx-community/ijepa_vith14_1k",
      { dtype: "q8" },
    );
    
    // Compute image embeddings
    const url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
    const url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
    const output = await extractor([url_1, url_2]);
    const pooled_output = output.mean(1); // Apply mean pooling
    
    // Compute cosine similarity
    const similarity = cos_sim(pooled_output[0].data, pooled_output[1].data);
    console.log(similarity); // 0.5168613045518973

    List of supported models: https://huggingface.co/models?library=transformers.js&other=ijepa

  • Add support for OLMo2 in #1076. List of supported models: https://huggingface.co/models?library=transformers.js&other=olmo2

🐛 Bug fixes

  • Fix whisper timestamp extraction for tokenizers with added tokens by @aravindMahadevan in #804
  • Add missing 'ready' status in the ProgressInfo type by @ocavue in #1070

🛠️ Other improvements

🤗 New contributors

Full Changelog: 3.1.1...3.1.2

3.1.1

03 Dec 11:49
2ee715c
Compare
Choose a tag to compare

🤖 New models

  • Add support for Idefics3 (SmolVLM) in #1059

    import {
      AutoProcessor,
      AutoModelForVision2Seq,
      load_image,
    } from "@huggingface/transformers";
    
    // Initialize processor and model
    const model_id = "HuggingFaceTB/SmolVLM-Instruct";
    const processor = await AutoProcessor.from_pretrained(model_id);
    const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
      dtype: {
        embed_tokens: "fp16", // "fp32", "fp16", "q8"
        vision_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
        decoder_model_merged: "q4", // "q8", "q4", "q4f16"
      }
    });
    
    // Load images
    const image1 = await load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg");
    const image2 = await load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg");
    
    // Create input messages
    const messages = [
      {
        role: "user",
        content: [
          { type: "image" },
          { type: "image" },
          { type: "text", text: "Can you describe the two images?" },
        ],
      },
    ];
    
    // Prepare inputs
    const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
    const inputs = await processor(text, [image1, image2], {
      // Set `do_image_splitting: true` to split images into multiple patches.
      // NOTE: This uses more memory, but can provide more accurate results.
      do_image_splitting: false,
    });
    
    // Generate outputs
    const generated_ids = await model.generate({
      ...inputs,
      max_new_tokens: 500,
    });
    const generated_texts = processor.batch_decode(
      generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
      { skip_special_tokens: true },
    );
    console.log(generated_texts[0]);
    // ' In the first image, there is a green statue of liberty on a pedestal in the middle of the water. The water is surrounded by trees and buildings in the background. In the second image, there are pink and red flowers with a bee on the pink flower.'

🐛 Bug fixes

  • Fix repetition penalty logits processor in #1062
  • Fix optional chaining for batch size calculation in PreTrainedModel by @emojiiii in #1063

📝 Documentation improvements

🛠️ Other improvements

  • Only log warning if type not explicitly set to "custom" in #1061
  • Improve browser vs. webworker detection in #1067

🤗 New contributors

Full Changelog: 3.1.0...3.1.1