From 9f74935329dee85acb0addd792db9c8b853047b5 Mon Sep 17 00:00:00 2001 From: Hamza Imran Saeed Date: Thu, 30 Nov 2023 14:05:58 +0100 Subject: [PATCH] delete ipynb checkpoint --- ...ting_started_with_swebert-checkpoint.ipynb | 675 ------------------ 1 file changed, 675 deletions(-) delete mode 100644 serve-python/tests/model/notebooks/.ipynb_checkpoints/getting_started_with_swebert-checkpoint.ipynb diff --git a/serve-python/tests/model/notebooks/.ipynb_checkpoints/getting_started_with_swebert-checkpoint.ipynb b/serve-python/tests/model/notebooks/.ipynb_checkpoints/getting_started_with_swebert-checkpoint.ipynb deleted file mode 100644 index 2c4d984..0000000 --- a/serve-python/tests/model/notebooks/.ipynb_checkpoints/getting_started_with_swebert-checkpoint.ipynb +++ /dev/null @@ -1,675 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Swedish BERT models: SweBERT masked token prediction\n", - "\n", - "***\n", - "\n", - "## Introduction\n", - "\n", - "This example will check accessibility of and load a pre-trained SweBERT model from Arbetsförmedlingen (Swedens unemployment agency), load this model and the perform a simple word prediction task by removing one word from a sample sentence and predicting which word should be in its place.\n", - "\n", - "THIS NOTEBOOK IS BASED ON THE ORIGINAL NOTEBOOK PUBLISHED BY AF-AI at https://github.com/af-ai-center/SweBERT.git\n", - "\n", - "We have modified the prediction example at the end to use Tensorflow instead of PyTorch. \n", - "\n", - "#### Note: Make sure to run this notebook in a virtual environment with the required packages (see README) installed\n", - "\n", - "***" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setup" - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "import tensorflow as tf\n", - "from transformers import BertTokenizer, BertModel, TFBertModel, BertForMaskedLM \n", - "from tokenizers import BertWordPieceTokenizer\n", - "\n", - "import warnings; warnings.filterwarnings('ignore')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Choose SweBERT model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We have to choose one of the available pretrained SweBERT models. For demonstration purposes, the base model is sufficient:" - ] - }, - { - "cell_type": "code", - "execution_count": 70, - "metadata": {}, - "outputs": [], - "source": [ - "pretrained_model_name = 'af-ai-center/bert-base-swedish-uncased'\n", - "# pretrained_model_name = af-ai-center/bert-large-swedish-uncased" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Check SweBERT Model Accessibility" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, we are going to check that the chosen pretrained SweBERT model is accessible through the transformers library.\n", - "If it is, we should be able to instantiate a tokenizer and a (PyTorch/TensorFlow) model from it. \n", - "\n", - "Note that this may take a while the first time you run it as the model needs to be downloaded. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### a. Load a tokenizer" - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer = BertTokenizer.from_pretrained(pretrained_model_name, do_lower_case=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### b. Load a PyTorch model" - ] - }, - { - "cell_type": "code", - "execution_count": 72, - "metadata": {}, - "outputs": [], - "source": [ - "model = BertModel.from_pretrained(pretrained_model_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### c. (Load a TensorFlow model)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model = TFBertModel.from_pretrained(pretrained_model_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Masked Word Completion with SweBERT" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We are now going to apply the (PyTorch) SweBERT model on an example sentence, loosely following https://huggingface.co/transformers/quickstart.html#quick-tour-usage. To make BERT work with text strings we have to prepare the text a bit, then mask one of the words (aka 'tokens') and then finally use SweBERT to predict the masked word. \n", - "\n", - "We will:\n", - "1. Tokenize the example using BertTokenizer.\n", - "2. Tokenize the example using BertWordPieceTokenizer.\n", - "3. Mask one of the tokens.\n", - "4. Use SweBERT to predict back the masked token." - ] - }, - { - "cell_type": "code", - "execution_count": 73, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Av alla städer i världen, är du den stad som fått allt.'" - ] - }, - "execution_count": 73, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "example = 'Av alla städer i världen, är du den stad som fått allt.'\n", - "example" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Tokenize the example using BertTokenizer" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The pretrained SweBERT models are uncased. \n", - "\n", - "In principle, we could account for this by instantiating the BertTokenizer (https://huggingface.co/transformers/model_doc/bert.html#berttokenizer) with the parameter `do_lower_case=True`.\n", - "However, the BertTokenizer does not handle the Swedish letters `å, ä & ö` properly (they get replaced by `a & o`).\n", - "\n", - "To avoid this problem, we instruct the bert_tokenizer to not automatically change to lower case, and manually change all text to lowercase instead." - ] - }, - { - "cell_type": "code", - "execution_count": 74, - "metadata": {}, - "outputs": [], - "source": [ - "bert_tokenizer = BertTokenizer.from_pretrained(pretrained_model_name, do_lower_case=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### a. lowercase " - ] - }, - { - "cell_type": "code", - "execution_count": 75, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'av alla städer i världen, är du den stad som fått allt.'" - ] - }, - "execution_count": 75, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "example_uncased = example.lower()\n", - "example_uncased" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### b. add special tokens " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The input of BERT models needs to be provided with special tokens that identify the beginning end end of a string '[CLS]' and '[SEP]':" - ] - }, - { - "cell_type": "code", - "execution_count": 76, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'[CLS] av alla städer i världen, är du den stad som fått allt. [SEP]'" - ] - }, - "execution_count": 76, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "example_preprocessed = f'[CLS] {example_uncased} [SEP]'\n", - "example_preprocessed" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### c. tokenize\n", - "\n", - "Now we will tokenize the text, i.e. separate it to individual words and the special tokens." - ] - }, - { - "cell_type": "code", - "execution_count": 77, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "16 tokens:\n", - "['[CLS]', 'av', 'alla', 'städer', 'i', 'världen', ',', 'är', 'du', 'den', 'stad', 'som', 'fått', 'allt', '.', '[SEP]']\n" - ] - } - ], - "source": [ - "tokens = bert_tokenizer.tokenize(example_preprocessed)\n", - "\n", - "print(f'{len(tokens)} tokens:')\n", - "print(tokens)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### d. convert tokens to ids" - ] - }, - { - "cell_type": "code", - "execution_count": 78, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[101, 1101, 1186, 3548, 1045, 1596, 1010, 1100, 1153, 1108, 1767, 1099, 1302, 1223, 1012, 102]\n" - ] - } - ], - "source": [ - "indexed_tokens = bert_tokenizer.convert_tokens_to_ids(tokens)\n", - "print(indexed_tokens)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Tokenize the example using BertWordPieceTokenizer" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "An alternative is to use the BertWordPieceTokenizer from the tokenizers library (https://github.com/huggingface/tokenizers).\n", - "It handles the special Swedish letters properly if the parameters `lowercase=True` & `strip_accents=False` are used. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "bert_word_piece_tokenizer = BertWordPieceTokenizer(\"vocab_swebert.txt\", lowercase=True, strip_accents=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### a. lowercase, b. add special tokens, c. tokenize & d. convert tokens to ids" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "output = bert_word_piece_tokenizer.encode(example) # attributes: output.ids, output.tokens, output.offsets" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tokens_2 = output.tokens\n", - "\n", - "print(f'{len(tokens_2)} tokens:')\n", - "print(tokens_2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "indexed_tokens_2 = output.ids\n", - "print(indexed_tokens_2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# check that BertTokenizer & BertWordPieceTokenizer lead to the same results\n", - "assert tokens == tokens_2\n", - "assert indexed_tokens == indexed_tokens_2" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Mask one of the tokens" - ] - }, - { - "cell_type": "code", - "execution_count": 79, - "metadata": {}, - "outputs": [], - "source": [ - "masked_index = 10 # 'stad'" - ] - }, - { - "cell_type": "code", - "execution_count": 81, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['[CLS]', 'av', 'alla', 'städer', 'i', 'världen', ',', 'är', 'du', 'den', '[MASK]', 'som', 'fått', 'allt', '.', '[SEP]']\n" - ] - } - ], - "source": [ - "tokens[masked_index] = '[MASK]'\n", - "print(tokens)" - ] - }, - { - "cell_type": "code", - "execution_count": 82, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[101, 1101, 1186, 3548, 1045, 1596, 1010, 1100, 1153, 1108, 103, 1099, 1302, 1223, 1012, 102]\n" - ] - } - ], - "source": [ - "# Mask token with BertTokenizer\n", - "indexed_tokens[masked_index] = bert_tokenizer.convert_tokens_to_ids('[MASK]')\n", - "print(indexed_tokens)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Mask token with BertWordPieceTokenizer\n", - "indexed_tokens[masked_index] = bert_word_piece_tokenizer.token_to_id('[MASK]')\n", - "print(indexed_tokens)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4. Use SweBERT to predict back the masked token" - ] - }, - { - "cell_type": "code", - "execution_count": 83, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Some weights of the model checkpoint at af-ai-center/bert-base-swedish-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']\n", - "- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).\n", - "- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n" - ] - } - ], - "source": [ - "# instantiate model\n", - "model = BertForMaskedLM.from_pretrained(pretrained_model_name)\n", - "_ = model.eval()" - ] - }, - { - "cell_type": "code", - "execution_count": 84, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "torch.Size([1, 16, 30522])\n" - ] - } - ], - "source": [ - "# predict all tokens\n", - "with torch.no_grad():\n", - " outputs = model(torch.tensor([indexed_tokens]))\n", - "\n", - "predictions = outputs[0]\n", - "print(predictions.shape) # 1 example, 21 tokens, 30522 vocab size" - ] - }, - { - "cell_type": "code", - "execution_count": 85, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "tensor(1767)\n" - ] - } - ], - "source": [ - "# show prediction for masked token's index\n", - "predicted_index = torch.argmax(predictions[0, masked_index])\n", - "print(predicted_index)" - ] - }, - { - "cell_type": "code", - "execution_count": 86, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "stad\n" - ] - } - ], - "source": [ - "# show prediction for masked token\n", - "predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]\n", - "print(predicted_token)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Making this prediction was our goal, now let's just confirm that it is the same as the word we masked from the beggining." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "assert predicted_token == 'stad'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Conclusions\n", - "\n", - "- We have checked the accessibility of the SweBERT models through the transformers library. \n", - "- We have demonstrated a very simple model application, where the SweBERT model successfully predicts a masked token.\n", - "\n", - "For additional use cases and information, we refer to the documentation of the transformers library. \n", - "\n", - "## Next step\n", - "Now that we have a trained model we will create a model in STACKn and then deploy it in STACKn to give it an endpoint. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Appendix" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now the big question: Is the SweBERT model just trained on all Lasse Berghagen lyrics, and know we just did a travesty of 'Stockholm i mitt hjärta', or does the model evaluate potential substitutions for the masked word based on its training corpus of Swedish language texts?\n", - "\n", - "Instead of looking only at the top model prediction, let's have a look at the top 5 predictions." - ] - }, - { - "cell_type": "code", - "execution_count": 88, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "tensor([1767, 1532, 4394, 1192, 1630])" - ] - }, - "execution_count": 88, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# show top5 predictions for masked token's index\n", - "predicted_index_top5 = torch.argsort(predictions[0, masked_index], descending=True)[:5]\n", - "predicted_index_top5" - ] - }, - { - "cell_type": "code", - "execution_count": 89, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "stad\n", - "enda\n", - "ort\n", - "första\n", - "person\n" - ] - } - ], - "source": [ - "# show top5 predictions for masked token\n", - "for predicted_index in predicted_index_top5:\n", - " predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]\n", - " print(predicted_token)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Of these suggestions, the first three are actually fully reasonable but both 'enda' and 'ort' would give a lyric very far in style and meaning from the original text. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}