Restricted
Dissertation / PhD Thesis DKFZ-2025-02703

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Adapting Procedural Video Understanding and Controllable Video Synthesis for Surgical Applications



2025

Dissertation, Technische Universität Dresden, 2025  GO

Abstract: Applying Computer Vision methods to surgical applications faces many challenges such as small datasets, strong reflections and deformations, limited field of view and notably, many relevant tasks requiring video data. Due to the complexity of video as a data modality, however, video-based methods in the general Computer Vision domains are often developed under relaxed conditions and are not directly applicable to surgical settings. This work aims to bridge the gaps between general, video-based methods and surgical applications at the procedure as well as the scene level. At the procedure level, recognizing or anticipating surgical events facilitates timely, context-aware assistance and requires an understanding of the whole surgery. Existing approaches for general procedural video understanding are, however, not always suitable. They assume shorter, often trimmed videos, dense annotation or available off-the-shelf frame features for efficient temporal modeling, all of which cannot be assumed in surgical settings. At the scene level, video synthesis from surgical simulations represents potential for alleviating the lack of 3D ground truth in video datasets for 3D scene understanding. However, generating data which can potentially be useful for downstream applications poses strict requirements, namely control, temporal consistency and unsupervised training. While these topics are often tackled individually in video synthesis research, fulfilling all three is highly challenging. In this work, three publications aim to address domain-specific challenges of selected video-based tasks in these areas. The first publication challenges the predominance of 2-stage training approaches adopted from general procedural video understanding and shows benefits of end-to-end learning in surgical workflow tasks. The second publication proposes a novel task formulation for anticipating sparse events in untrimmed surgical videos, while previous formulations in general action anticipation either assume trimmed videos or dense annotations. The third publication combines the strengths of generative models and Neural Rendering to enable controllable synthesis of long-term consistent videos in an unsupervised setting. Solutions from general video-based Computer Vision are commonly adopted for similar surgical tasks, yet domain-specific properties of surgical video data often limit their applicability. This work highlights the importance of considering unique properties of surgical video data for designing methods, training strategies or task formulations and shows that simple and more suitable solutions tailored to surgery often exist. However, these insights can also be beneficial beyond surgical applications by identifying shortcomings of general video understanding methods for more realistic scenarios.


Note: Rivior, Dominik was a PhD student in our group (DD06). However, I can’t find the correct information for him in the system.
Note: Dissertation, Technische Universität Dresden, 2025

Contributing Institute(s):
  1. NCT DD Translationale Chirurgische Onkologie (DD06)
Research Program(s):
  1. 315 - Bildgebung und Radioonkologie (POF4-315) (POF4-315)
  2. DFG project G:(GEPRIS)390696704 - EXC 2050: Centre for Tactile Internet with Human-in-the-Loop (CeTI) (390696704) (390696704)

Click to display QR Code for this record

The record appears in these collections:
Relevant for Publication database
User submitted records

 Record created 2025-12-04, last modified 2025-12-04


Restricted:
Download fulltext PDF Download fulltext PDF (PDFA)
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)