Decoupling Zero-Shot Semantic Segmentation
Jian Ding, Nan Xue, Gui-Song Xia and Dengxin Dai
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zeroshot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of language models pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-language models that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a classagnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter task performs at segment-level and provides a natural way to leverage large-scale vision-language models pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 22 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https: //github.com/dingjiansw101/ZegFormer

Figure 1. ZS3 aims to train a model merely on seen classes and generalize it to classes that have not been seen in the training (unseen classes). Existing methods formulate it as a pixel-level zeroshot classification problem (b), and use semantic features from a language model to transfer the knowledge from seen classes to unseen ones. In contrast, as in (c), we decouple ZS3 into two sub-tasks: 1) A class-agnostic grouping and 2) A segment-level zero-shot classification, which enables us to take full advantage of the pre-trained vision-language model.

Figure 2. The pipeline of our proposed ZegFormer for zero-shot semantic segmentation. We first feed N queries and feature maps to a transformer decoder to generate N segment embeddings. We then feed each segment embedding to a mask projection layer and a semantic projection layer to obtain a mask embedding and a semantic embedding. Mask embedding is multiplied with the output of pixel decoder to generate a class-agnostic binary mask, while The semantic embedding is classified by the text embeddings. The text embeddings are generated by putting the class names into a prompt template and then feeding them to a text encoder of a vision-language model. During training, only the seen classes are used to train the segment-level classification head. During inference, both the text embeddings of seen and unseen classes are used for segment-level classification. We can obtain two segment-level classification scores with semantic segment embeddings and image embeddings. Finally, we fuse the these two classification scores as our final class prediction of segments.

Table 1. Comparison with the previous GZS3 methods on PASCAL VOC and COCO-Stuff. The “Seen”, “Unseen”, and “Harmonic” denote mIoU of seen classes, unseen classes, and their harmonic mean.

Figure 3. Results on COCO-Stuff, using 171 class names in COCO-Stuff to generate text embeddings. ZegFormer-seg (our decoupling formulation of ZS3) is better than SPNet-FPN (pixel-level zero-shot classification) in segmenting unseen categories.

Figure 4. Results on COCO-Stuff, using 847 class names from ADE20k-Full to generate text embeddings. We can see that the SPNet-FPN (pixel-level zero-shot classification baseline) is very unstable when the number of unseen classes is large. We can also see that the set of 847 class names provide richer information than the set of 171 class names in COCO-Stuff. For example, in the yellow box of the second row, the predicted “garage” for a segment is labeled as “house” in COCO-Stuff.
Publication:
Decoupling Zero-Shot Semantic Segmentation
Jian Ding, Nan Xue, Gui-Song Xia, Dengxin Dai
CVPR 2022