Dengxin Dai receives the Golden Owl Award

December 23rd, 2021|

We are very proud to announce that Dr. Dengxin Dai has received the Golden Owl Award 2021 for exceptional teaching.

The Golden Owl honours excellent teachers. The Owl is awarded by VSETH, ETH Zurich’s student association.
All ETH members with a teaching assignment can be nominated for the Golden Owl. One lecturer […]


  • 13/05/2022:

    Dr. Dengxin Dai has joined the IJCV Editorial Board as Editor.

  • 13/05/2022:

    Lukas Hoyer has been honoured the ETH Outstanding Master’s Theses Award under Dr. Dengxin Dai’s supervision. Congratulations!

  • 11/03/2022:

    11 papers (3 Orals) accepted to CVPR 2022. Congratulations to all co-authors!

  • 11/03/2022:

    VAS has established new research projects with Toyota!

  • 11/03/2022:

    Dr. Dai is Area Chair of ECCV 2022.

  • 11/03/2022:

    Our work Binaural Soundnet has been accepted to TPAMI as is!

  • 17/01/2022:

    Our CVPR’22 Vision for All Seasons workshop has been accepted with excellent lineup of speakers, four challenges and CFP.

  • 16/01/2022:

    Dengxin Dai has received an academic gift funding from Facebook Reality Lab.

  • 16/01/2022:

    Four papers have been accepted to RAL: 3D LiDAR Semantic Segmentation, End2End LiDAR Beam Optimisation, 3D MOT, and Depth Estimation with HD Map Prior.

  • 31/10/2021:

    We are hiring PhDs and PostDocs!

Scientific Mission

The scientific mission of VAS is to develop robust and scalable perception systems for real-world applications. We focus on deep learning-based perception for autonomous systems such as autonomous driving. We are especially fascinated about scaling existing visual perception models to novel domains (e.g. adverse weather/lighting conditions, low-quality data),  to more data modality (e.g. LiDAR, Radar, Events, Audio, HD Maps), to unseen classes (e.g. rare classes), and to new tasks. The relevant research topics are summarised in the diagram shown on the right.


  • Authors: Dengxin Dai, Arun Balajee Vasudevan, Jiri Matas and Luc Van Gool

    This work develops an approach for scene understanding purely based on binaural sounds.

    Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, and the depth map of the scene. To this aim, we propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360-degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of multiple vision teacher methods and a sound student method -- the student method is trained to generate the same results as the teacher methods do. This way, the auditory system can be trained without using human annotations. To further boost the performance, we propose another novel auxiliary task, coined Spatial Sound Super-Resolution, to increase the directional resolution of sounds. We then formulate the four tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results show that 1) our method achieves good results for all four tasks, 2) the four tasks are mutually beneficial -- training them together achieves the best performance, 3) the number and orientation of microphones are both important, and 4) features learned from the standard spectrogram and features obtained by the classic signal processing pipeline are complementary for auditory perception tasks.
    Read More
  • A novel UDA method, DAFormer, consisting of a Transformer encoder and a multi-level context-aware feature fusion decoder, improve SOTA by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for Synthia->Cityscapes

    As acquiring pixel-wise annotations of real-world images for semantic segmentation is a costly process, a model can instead be trained with more accessible synthetic data and adapted to real images without requiring their annotations. This process is studied in unsupervised domain adaptation (UDA). Even though a large number of methods propose new adaptation strategies, they are mostly based on outdated network architectures. As the influence of recent network architectures has not been systematically studied, we first benchmark different network architectures for UDA and then propose a novel UDA method, DAFormer, based on the benchmark results. The DAFormer network consists of a Transformer encoder and a multi-level context-aware feature fusion decoder. It is enabled by three simple but crucial training strategies to stabilize the training and to avoid overfitting DAFormer to the source domain: While the Rare Class Sampling on the source domain improves the quality of pseudo-labels by mitigating the confirmation bias of self-training towards common classes, the Thing-Class ImageNet Feature Distance and a learning rate warmup promote feature transfer from ImageNet pretraining. DAFormer significantly improves the state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for Synthia->Cityscapes and enables learning even difficult classes such as train, bus, and truck well.
    Read More
  • This work presents a novel method for LiDAR-based 3D object detection in foggy weather by simulating foggy effects into standard LiDAR data..

    This work addresses the challenging task of LiDAR-based 3D object detection in foggy weather. Collecting and annotating data in such a scenario is very time, labor and cost intensive. In this paper, we tackle this problem by simulating physically accurate fog into clear-weather scenes, so that the abundant existing real datasets captured in clear weather can be repurposed for our task. Our contributions are twofold: 1) We develop a physically valid fog simulation method that is applicable to any LiDAR dataset. This unleashes the acquisition of large-scale foggy training data at no extra cost. These partially synthetic data can be used to improve the robustness of several perception methods, such as 3D object detection and tracking or simultaneous localization and mapping, on real foggy data. 2) Through extensive experiments with several state-of-the-art detection approaches, we show that our fog simulation can be leveraged to significantly improve the performance for 3D object detection in the presence of fog. Thus, we are the first to provide strong 3D object detection baselines on the Seeing Through Fog dataset. Our code is available at this http URL.
    Read More
  • ACDC is a large-scale dataset for training and testing semantic segmentation methods for four adverse visual conditions: fog, nighttime, rain, and snow.

    Level 5 autonomy for self-driving cars requires a robust visual perception system that can parse input images under any visual condition. However, existing semantic segmentation datasets are either dominated by images captured under normal conditions or are small in scale. To address this, we introduce ACDC, the Adverse Conditions Dataset with Correspondences for training and testing semantic segmentation methods on adverse visual conditions. ACDC consists of a large set of 4006 images which are equally distributed between four common adverse conditions: fog, nighttime, rain, and snow. Each adverse-condition image comes with a high-quality fine pixel-level semantic annotation, a corresponding image of the same scene taken under normal conditions, and a binary mask that distinguishes between intra-image regions of clear and uncertain semantic content. Thus, ACDC supports both standard semantic segmentation and the newly introduced uncertainty-aware semantic segmentation. A detailed empirical study demonstrates the challenges that the adverse domains of ACDC pose to state-of-the-art supervised and unsupervised approaches and indicates the value of our dataset in steering future progress in the field. Our dataset and benchmark are publicly available.
    Read More
  • A novel manner to learn end-to-end driving from a reinforcement learning expert that maps bird's-eye view images to continuous low-level actions, achieving the state-of-the-art perfomrance.

    End-to-end approaches to autonomous driving commonly rely on expert demonstrations. Although humans are good drivers, they are not good coaches for end-to-end algorithms that demand dense on-policy supervision. On the contrary, automated experts that leverage privileged information can efficiently generate large scale on-policy and off-policy demonstrations. However, existing automated experts for urban driving make heavy use of hand-crafted rules and perform suboptimally even on driving simulators, where ground-truth information is available. To address these issues, we train a reinforcement learning expert that maps bird's-eye view images to continuous low-level actions. While setting a new performance upper-bound on CARLA, our expert is also a better coach that provides informative supervision signals for imitation learning agents to learn from. Supervised by our reinforcement learning coach, a baseline end-to-end agent with monocular camera-input achieves expert-level performance. Our end-to-end agent achieves a 78% success rate while generalizing to a new town and new weather on the NoCrash-dense benchmark and state-of-the-art performance on the challenging public routes of the CARLA LeaderBoard.
    Read More
Go to Top