## Title: Multi-Object Tracking and Label Fusion in Automotive Sensor Data
## Presenter: Piotr Kalaczyński
## Date: 12.01.2026
## Participants:
Wojciech Krzemień (WK)
Konrad Klimaszewski (KK)
Roman Shopa (RS)
Lech Raczyński (LR)
Piotr Kalaczyński (PK)
Aleksander Ogonowski (AO)
Mikołaj Mrozowski (MM2)
Arkadiusz Ćwiek (AĆ)
Michał Mazurek (MM)
Piotr Gawron (PG)
Wojciech Zdeb (WZ)
Mateusz Bała (MB)
## Discussion:
WK: You have two sets of input data, e.g. lidar and cameras. How do you treat them during training?
PK: We use two models, which are trained separately using these input data.
WK: It seems that lidar works better overall. Have you checked whether there is a subset of cases in which the camera-based model performs better than the lidar-based model?
PK: Based on the available statistics, this is not visible.
WK: Right, that would rather require selecting events where, for example, the lidar-based model performs poorly and the camera-based model performs well.
KK: A question about the KITTI dataset. Are these data related to Toyota? How good are the labels, especially for lidar? What was the annotation methodology?
PK: They are not perfect. Sometimes annotations are missing, for example pedestrians. I believe the annotations were done manually.
Technically, I am not sure about the details.
KK: One idea would be to cross-check with camera data.
KK: Are both models YOLO-based?
PK: Yes.
KK: Has anyone tried combining lidar and camera data in a way similar to multispectral cameras (not by overlaying them, but by adding lidar as an additional channel)?
PK: Not to my knowledge, and we did not try it. YOLO would probably need to be modified.
KK: There is an issue with the frame rate of lidar. Is there a problem with synchronization between lidar and camera frame rates?
PK: Indeed, this is an issue. We only used frames where there was overlap between the sensors, and we discard some frames.
KK: The question is whether those additional frames could later help with tracking.
KK: Label fusion — where exactly is the optimization performed?
PK: In enabling or disabling the connections.
MM: I remember that in YOLOv3 there was a problem with localizing objects that are close to each other in space (e.g. a large number of pedestrians).
PK: This is exactly something we are targeting — overlapping objects. On this dataset, it seems to handle it reasonably well. I also tested a pretrained YOLO model on videos of a crowded street in Mumbai, and it performed surprisingly well.
RS: In all examples, the images are taken in good weather conditions. What about atmospheric effects, such as fog?
PK: In the tracking dataset, the scenes were mostly easier ones, but in general there are data with different weather conditions.
A single model is used for all conditions.
RS: What if we had data from CCD cameras?
PK: It is hard for me to say. This dataset dates back to 2011. We are potentially considering several newer datasets, but KITTI has been very well tested.
PG: The original motivation was that we were supposed to receive data, but we did not. What interests me most is the fusion approach and the application of QUBO.
WK: In the fusion model, how should we understand the local term in the Ising model — the self-interaction of spins? How do we map the problem of combining camera and lidar models onto this framework?
PG: This can be interpreted as information about how much we locally trust the classification result from a particular model (i.e. the bias).
There are minutes attached to this event.
Show them.