**Transfer learning tutorial for NLP enthusiasts, part 2.**

This notebook shows you basic usage examples of BERT-like models. Here no fine-tuning is needed - we will use the models as provided to:
- replace a mask with some word,
- predict if two sentences could be consecutive (there is some relation between them)


In [None]:
!pip install transformers datasets

import torch, os
from transformers import RobertaModel, AutoModel, PreTrainedTokenizerFast
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import pandas as pd
from datasets import Dataset, DatasetDict, load_dataset, load_metric
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.1 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 43.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 38.6 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 44.1 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 6.2 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 47.1 MB/s 
Collecting fsspec>=202

In [None]:
model_dir = "./roberta"
rmodel = AutoModel.from_pretrained(model_dir)
rtokenizer = AutoTokenizer.from_pretrained(model_dir)

Some weights of the model checkpoint at ./roberta were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at ./roberta and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and

In [None]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-large-uncased-whole-word-masking')
unmasker("Hello I'm a [MASK] model.")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.15813860297203064,
  'sequence': "hello i'm a fashion model.",
  'token': 4827,
  'token_str': 'fashion'},
 {'score': 0.10551052540540695,
  'sequence': "hello i'm a cover model.",
  'token': 3104,
  'token_str': 'cover'},
 {'score': 0.08340442180633545,
  'sequence': "hello i'm a male model.",
  'token': 3287,
  'token_str': 'male'},
 {'score': 0.036381796002388,
  'sequence': "hello i'm a super model.",
  'token': 3565,
  'token_str': 'super'},
 {'score': 0.03609578311443329,
  'sequence': "hello i'm a top model.",
  'token': 2327,
  'token_str': 'top'}]

In [None]:
unmasker("Warsaw is a capital city of [MASK].")

[{'score': 0.9998594522476196,
  'sequence': 'warsaw is a capital city of poland.',
  'token': 3735,
  'token_str': 'poland'},
 {'score': 0.00011802723747678101,
  'sequence': 'warsaw is a capital city of warsaw.',
  'token': 8199,
  'token_str': 'warsaw'},
 {'score': 5.752931429015007e-06,
  'sequence': 'warsaw is a capital city of lithuania.',
  'token': 9838,
  'token_str': 'lithuania'},
 {'score': 4.477401944313897e-06,
  'sequence': 'warsaw is a capital city of polish.',
  'token': 3907,
  'token_str': 'polish'},
 {'score': 2.492450903446297e-06,
  'sequence': 'warsaw is a capital city of belarus.',
  'token': 12545,
  'token_str': 'belarus'}]

In [None]:
unmasker("He was working as a [MASK].")

[{'score': 0.08348547667264938,
  'sequence': 'he was working as a waiter.',
  'token': 15610,
  'token_str': 'waiter'},
 {'score': 0.08295436948537827,
  'sequence': 'he was working as a mechanic.',
  'token': 15893,
  'token_str': 'mechanic'},
 {'score': 0.04259655624628067,
  'sequence': 'he was working as a detective.',
  'token': 6317,
  'token_str': 'detective'},
 {'score': 0.04058046266436577,
  'sequence': 'he was working as a courier.',
  'token': 18092,
  'token_str': 'courier'},
 {'score': 0.04013749212026596,
  'sequence': 'he was working as a carpenter.',
  'token': 10533,
  'token_str': 'carpenter'}]

In [None]:
unmasker("She was working as a [MASK].")

[{'score': 0.3991911709308624,
  'sequence': 'she was working as a waitress.',
  'token': 13877,
  'token_str': 'waitress'},
 {'score': 0.060626666992902756,
  'sequence': 'she was working as a nurse.',
  'token': 6821,
  'token_str': 'nurse'},
 {'score': 0.0557653084397316,
  'sequence': 'she was working as a secretary.',
  'token': 3187,
  'token_str': 'secretary'},
 {'score': 0.03929608687758446,
  'sequence': 'she was working as a teacher.',
  'token': 3836,
  'token_str': 'teacher'},
 {'score': 0.036160361021757126,
  'sequence': 'she was working as a courier.',
  'token': 18092,
  'token_str': 'courier'}]

In [None]:
#now on to Polish!
!wget https://github.com/sdadas/polish-roberta/releases/download/models-v2/roberta_base_transformers.zip
!mkdir roberta
!unzip roberta_base_transformers.zip -d roberta

model_dir = "./roberta"
#rmodel = AutoModel.from_pretrained(model_dir)
#rtokenizer = AutoTokenizer.from_pretrained(model_dir)

unmasker = pipeline('fill-mask', model=model_dir)

--2021-09-12 19:12:16--  https://github.com/sdadas/polish-roberta/releases/download/models-v2/roberta_base_transformers.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/247501435/bea4e000-8a5d-11eb-86cc-793bd6e126a7?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210912%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210912T191216Z&X-Amz-Expires=300&X-Amz-Signature=ad9ececca78cf063adbfc031e58cce2ca4c19c951593ae1022adeb6e6cc7e344&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=247501435&response-content-disposition=attachment%3B%20filename%3Droberta_base_transformers.zip&response-content-type=application%2Foctet-stream [following]
--2021-09-12 19:12:16--  https://github-releases.githubusercontent.com/247501435/bea4e000-8a5d-11eb-86cc-793bd6e126a7?X-Amz-Algorithm=AWS4-HMAC-SHA256&X

In [None]:
unmasker("Stolicą świata jest <mask>.")

[{'score': 0.15230736136436462,
  'sequence': 'Stolicą świata jest Tokio.',
  'token': 16442,
  'token_str': 'Tokio'},
 {'score': 0.07548365741968155,
  'sequence': 'Stolicą świata jest Warszawa.',
  'token': 212,
  'token_str': 'Warszawa'},
 {'score': 0.07368221879005432,
  'sequence': 'Stolicą świata jest Waszyngton.',
  'token': 27554,
  'token_str': 'Waszyngton'},
 {'score': 0.0681820958852768,
  'sequence': 'Stolicą świata jest Paryż.',
  'token': 22131,
  'token_str': 'Paryż'},
 {'score': 0.057608917355537415,
  'sequence': 'Stolicą świata jest Londyn.',
  'token': 19148,
  'token_str': 'Londyn'}]

In [None]:
unmasker("Premier <mask> wygłosił przemówienie do narodu.")

[{'score': 0.05876970291137695,
  'sequence': 'Premier rządu wygłosił przemówienie do narodu.',
  'token': 421,
  'token_str': 'rządu'},
 {'score': 0.051944468170404434,
  'sequence': 'Premier Morawiecki wygłosił przemówienie do narodu.',
  'token': 43592,
  'token_str': 'Morawiecki'},
 {'score': 0.05135193094611168,
  'sequence': 'Premier ponownie wygłosił przemówienie do narodu.',
  'token': 875,
  'token_str': 'ponownie'},
 {'score': 0.03170759603381157,
  'sequence': 'Premier Mikołajczyk wygłosił przemówienie do narodu.',
  'token': 45009,
  'token_str': 'Mikołajczyk'},
 {'score': 0.03117522969841957,
  'sequence': 'Premier Jaruzelski wygłosił przemówienie do narodu.',
  'token': 39182,
  'token_str': 'Jaruzelski'}]

In [None]:
unmasker("On pracował wtedy jako <mask>.")

[{'score': 0.03927898034453392,
  'sequence': 'On pracował wtedy jako lekarz.',
  'token': 3323,
  'token_str': 'lekarz'},
 {'score': 0.034357137978076935,
  'sequence': 'On pracował wtedy jako.',
  'token': 2,
  'token_str': '</s>'},
 {'score': 0.03013218380510807,
  'sequence': 'On pracował wtedy jako nauczyciel.',
  'token': 3708,
  'token_str': 'nauczyciel'},
 {'score': 0.020018966868519783,
  'sequence': 'On pracował wtedy jako urzędnik.',
  'token': 14633,
  'token_str': 'urzędnik'},
 {'score': 0.019063251093029976,
  'sequence': 'On pracował wtedy jako kierowca.',
  'token': 10028,
  'token_str': 'kierowca'}]

In [None]:
unmasker("Ona pracowała wtedy jako <mask>.")

[{'score': 0.07240116596221924,
  'sequence': 'Ona pracowała wtedy jako nauczycielka.',
  'token': 33261,
  'token_str': 'nauczycielka'},
 {'score': 0.04464877024292946,
  'sequence': 'Ona pracowała wtedy jako.',
  'token': 2,
  'token_str': '</s>'},
 {'score': 0.0378703810274601,
  'sequence': 'Ona pracowała wtedy jako pielęgniarka.',
  'token': 28574,
  'token_str': 'pielęgniarka'},
 {'score': 0.014810452237725258,
  'sequence': 'Ona pracowała wtedy jako lekarz.',
  'token': 3323,
  'token_str': 'lekarz'},
 {'score': 0.010674428194761276,
  'sequence': 'Ona pracowała wtedy jako gospodyni.',
  'token': 40219,
  'token_str': 'gospodyni'}]

In [None]:
#next sentence prediction - we cannot use Roberta, let's take BERT
from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
import torch

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForNextSentencePrediction.from_pretrained('bert-base-uncased')

encoding = tokenizer("I'm feeling a bit sick.",
                     "And so I will go to see my doctor.", 
                     return_tensors='pt')

model(**encoding)
#if logits[0, 0] < logits[0, 1] the next sentence is random

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


NextSentencePredictorOutput([('logits',
                              tensor([[ 5.2402, -4.3147]], grad_fn=<AddmmBackward>))])

In [None]:
encoding = tokenizer("In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.",
                     "That means you have to slice it yourself.", 
                     return_tensors='pt')

model(**encoding)

NextSentencePredictorOutput([('logits',
                              tensor([[ 5.6860, -5.1236]], grad_fn=<AddmmBackward>))])

In [None]:
encoding = tokenizer("In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.",
                     "Polish people are really nice.", 
                     return_tensors='pt')

model(**encoding)

NextSentencePredictorOutput([('logits',
                              tensor([[-2.7396,  5.8566]], grad_fn=<AddmmBackward>))])

In [None]:
encoding = tokenizer("In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.",
                     "The sky is blue due to the shorter wavelength of blue light.", 
                     return_tensors='pt')

model(**encoding)

NextSentencePredictorOutput([('logits',
                              tensor([[-3.0729,  5.9056]], grad_fn=<AddmmBackward>))])