From imaging algorithms to quantum methods Seminar

Europe/Warsaw
https://cern.zoom.us/j/66151941204?pwd=n7upvvZYibexBhbtyn5kvTpy36L0Wo.1 (Zoom)

https://cern.zoom.us/j/66151941204?pwd=n7upvvZYibexBhbtyn5kvTpy36L0Wo.1

Zoom

Konrad Klimaszewski (NCBJ), Wojciech Krzemien (NCBJ)

Participants:
Wojciech Krzemień (WK)
Konrad Klimaszewski (KK)
Dominik Strzelecki(DS)
Wojciech Wiślicki (WW)
Aleksander Ogonowski (AO)
Krzysztof Nawrocki(KN)
Lech Raczynski (LR)
Maciej Zajkowski(MZ)
Marcin Stolarski(MS)
Mateusz Bała (MB)
Michal Mazurek(MM)
Michał Obara (MO)
Roman Shopa (RS)

Questions/Remarks:

WW: How does the word embedding work? What is the metric of distance between words?
KK: The internal representation is not interpretable. The most common method is word2vec
KN: In calculating distance, the context where the word occurs is considered.

WK: If many attention blocks are used and the input is the same, what parameters are changed to calculate different blocks? Is there any randomness?
KK: For each head an additional linear learned mapping is used for Q, K, and V matrices:
head_i = Attention(QW^Q_i, KW^K_i ,VW^V_i)
e.g. W^K_i is a parameter matrix with dimensions d_model x d_k
In practice d_v = d_k = d_model

MM: In Vision Transformer will inference be also longer, or only the training phase when using ViT-Large or ViT-Huge?
KK: Also the inference phase will be longer.

MM: What was the main problem that the ViT was verified on? Only classification or also e.g. reconstruction of the position of the object?
KK: Only classification.

WK: Is this B parameter that the Swin Transformer softmax equation relates to the position of the patch in the window in a linear way? Is it ok that the patches are 2D and this parameter is just one number? I would expect that the diagonal patches differ by the same "distance".
KK: Yes it is done that way. In the Swin Transformer V2 it is improved by replacing the optimization of the bias parameters, the small network is used.

MM: Where do you want to use those models?
KK:  Segmentation of industrial CT images using Swin Unet. Secondly-> trial of recovering full radiogram information from the reduced one. Type of upscaling. The third usage may be for the classification of MRI images.

LR: How the hyper-parameter choice (d or C) e.g. how many features are used,  is crucial and sensitive for the outcome.
KK: To be studied. It is hard to say, in their studies more parameters are changed at the same time. So it is hard to disentangle the effect.  However, the best models use embeddings with the biggest dimension C (or d)


RS: Can we use attention weights, pre-trained on standard images like dogs or planes, for tomography images that are more abstract? Does it not lose sense?
KK: Pre-trained attention weights should still make sense. One can understand, that it learns some relative relations between parts of the shapes. Fine-tuning to the new dataset is of course required.
KN: I wanted to add that such transformer models have been used e.g. to search for dark galaxies. And it was used.

RS: I wanted to ask more about not abstract but e.g. transformed into other image representations e.g. Fourier transform etc
KK: There are some examples of cases where other representations are used.

WK: For me, it is almost a philosophical case, what invariants or common patterns exist among those very different types of images e.g. dark galaxies, dogs, cats, tomography images that the pretraining is kind of efficient for all cases.

MS: How does it work, that the transformer must be pre-trained on a large dataset and later small fine-tuning is enough?
KN: Without fine-tuning the hallucination effect occurs very often.
KK: I think if one uses large data sets, the model will learn global features and relations without over-training. When fine-tuning on smaller datasets those global relations are already present and the model can focus on the details of the new dataset.

MS: Why in the second phase there is no overfitting when we use so small amount of data (3000) in the fine-tuning case?
KK: We are still relying on the "abstract" concept learned in the pretraining phase. The fine-tuning doesn't destroy it. provided that it is done with care e.g. using smaller learning rate.

There are minutes attached to this event. Show them.
    • 10:00 11:00
      Recent Transformer based architectures for image analysis 1h

      Review of recent Trasformer and beyond architectures that are applicable to machine vision tasks:

      • Quick recap of the Transformer model
      • Vision Transfromer (ViT)
      • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
      • Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images
      • Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation
      Speaker: Dr Konrad Klimaszewski (NCBJ)
    • 11:00 11:30
      Discussion 30m
Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×