Language Guided Image Segmentation

Language Guided Image Segmentation

Learning Object Interactions and Descriptions for Semantic Image Segmentation1

(a) visualizes an image in IDW and its raw description, searched by using ‘sheep’ and ‘human’ as keywords. We observe that the description contains unimportant details for object segmentation (e.g. ‘long blonde hair wearing hat happy time’), missing important details (e.g. number of people and sheep), and grammatical errors. As a side-by-side comparison, (b) shows an image and its per-pixel annotation of VOC12.

Recent advanced deep convolutional networks (CNNs) achieved great successes in many computer vision tasks, because of their compelling learning complexity and the presences of large-scale labeled data. However, as obtaining per-pixel annotations is expensive, performances of CNNs in semantic image segmentation are not fully exploited. This work significantly increases segmentation accuracy of CNNs by learning from an Image Descriptions in the Wild (IDW) dataset. Unlike previous image captioning datasets, where captions were manually and densely annotated, images and their descriptions in IDW are automatically downloaded from Internet without any manual cleaning and refinement. An IDW-CNN is proposed to jointly train IDW and existing image segmentation dataset such as Pascal VOC 2012 (VOC). It has two appealing properties. First, knowledge from different datasets can be fully explored and transferred from each other to improve performance. Second, segmentation accuracy in VOC can be constantly increased when selecting more data from IDW.

Deep Dual Learning for Semantic Image Segmentation2

Comparisons of recent semi-supervised learning settings. $I$, $L$, and $T$ denote an image, a per-pixel labelmap, and a vector of image-level tags respectively, where the labelmap $L$ can be missing in training. (a) treats $L$ as missing label in multitask learning, where its gradient is not computed in back-propagation (BP). (b) regards $L$ as latent variable that can be inferred by tags and used as ground truth in BP. We propose ©, which infers the missing label $L$ not only by recovering clean tags $\hat{T}$, but also by reconstructing the image to capture accurate object shape and boundary.

Deep neural networks have advanced many computer vision tasks, because of their compelling capacities to learn from large amount of labeled data. However, their performances are not fully exploited in semantic image segmentation as the scale of training set is limited, where perpixel labelmaps are expensive to obtain. To reduce labeling efforts, a natural solution is to collect additional images from Internet that are associated with image-level tags. Unlike existing works that treated labelmaps and tags as independent supervisions, we present a novel learning setting, namely dual image segmentation (DIS), which consists of two complementary learning problems that are jointly solved. One predicts labelmaps and tags from images, and the other reconstructs the images using the predicted labelmaps. DIS has three appealing properties. 1) Given an image with tags only, its labelmap can be inferred by leveraging the images and tags as constraints. The estimated labelmaps that capture accurate object classes and boundaries are used as ground truths in training to boost performance. 2) DIS is able to clean tags that have noises. 3) DIS significantly reduces the number of perpixel annotations in training, while still achieves state-ofthe-art performance.


  1. Guangrun Wang, Ping Luo, Liang Lin, Xiaogang Wang, Learning Object Interactions and Descriptions for Semantic Image Segmentation, CVPR, 2017 ^
  2. Ping Luo, Guangrun Wang, Liang Lin, Xiaogang Wang, Deep Dual Learning for Semantic Image Segmentation, ICCV 2017 ^