Abstract:--
The recent development in learning deep representations has demonstrated its wide applications in traditional vision tasks like classification and detection. However, there has been a little investigation on how we could build up a deep learning framework in a weakly supervised setting. In this paper, we attempt to model deep learning in a weakly supervised learning (multiple instance learning) frameworks. In our setting, each image follows a dual multi-instance assumption, where its object proposals and possible text annotations can be regarded as two instance sets. We thus design effective systems to exploit the MIL property with deep learning strategies from the two ends; we also try to jointly learn the relationship between object and annotation proposals. We conduct extensive experiments and prove that our weakly supervised deep learning framework not only achieves convincing performance in vision tasks including classification and image annotation, but also extracts reasonable region-keyword pairs with little supervision, on both widely used benchmarks like PASCAL VOC and MIT Indoor Scene 67, and also a dataset for image and patch-level annotations.
Introduction:--
Deep learning, as a recent breakthrough in artificial intelligence, has been successfully applied to multiple fields including speech recognition and visual recognition, mostly with full supervision. A typical deep learning architecture for visual recognition builds upon convolutional neural network (CNN) . Given large-scale training data and the power of high-performance computational infrastructure, deep learning has achieved tremendous improvement in visual recognition with thousands of categories. While deep learning shows superior performance on fully supervised learning tasks, research on learning deep representations with weak supervision is still in its early stage; i.e., human labels still play a key role in these popular frameworks. This is in a sense anathema to the very nature of large-scale web or real-world data — namely, big data is largely data with no labels or noisy labels. The emergence of image search engines like Google, social network sites like Facebook, and photo and video sharing sites like Flickr provides vision researchers with abundant visual data; unfortunately, strong labels for these images are in much shorter supply. Therefore, unsupervised or weakly supervised methods are particularly favored as they can better utilize the available large-scale web resources. Weakly supervised learning can, in general, be viewed as mechanisms to learn from sparse or noisy labels. As web data usually comes with high diversity but much noise, these weakly supervised methods have been successfully applied to learn effective visual representations for classification, detection, and segmentation, all using weak labels alone.
The recent development in learning deep representations has demonstrated its wide applications in traditional vision tasks like classification and detection. However, there has been a little investigation on how we could build up a deep learning framework in a weakly supervised setting. In this paper, we attempt to model deep learning in a weakly supervised learning (multiple instance learning) frameworks. In our setting, each image follows a dual multi-instance assumption, where its object proposals and possible text annotations can be regarded as two instance sets. We thus design effective systems to exploit the MIL property with deep learning strategies from the two ends; we also try to jointly learn the relationship between object and annotation proposals. We conduct extensive experiments and prove that our weakly supervised deep learning framework not only achieves convincing performance in vision tasks including classification and image annotation, but also extracts reasonable region-keyword pairs with little supervision, on both widely used benchmarks like PASCAL VOC and MIT Indoor Scene 67, and also a dataset for image and patch-level annotations.
Introduction:--
Deep learning, as a recent breakthrough in artificial intelligence, has been successfully applied to multiple fields including speech recognition and visual recognition, mostly with full supervision. A typical deep learning architecture for visual recognition builds upon convolutional neural network (CNN) . Given large-scale training data and the power of high-performance computational infrastructure, deep learning has achieved tremendous improvement in visual recognition with thousands of categories. While deep learning shows superior performance on fully supervised learning tasks, research on learning deep representations with weak supervision is still in its early stage; i.e., human labels still play a key role in these popular frameworks. This is in a sense anathema to the very nature of large-scale web or real-world data — namely, big data is largely data with no labels or noisy labels. The emergence of image search engines like Google, social network sites like Facebook, and photo and video sharing sites like Flickr provides vision researchers with abundant visual data; unfortunately, strong labels for these images are in much shorter supply. Therefore, unsupervised or weakly supervised methods are particularly favored as they can better utilize the available large-scale web resources. Weakly supervised learning can, in general, be viewed as mechanisms to learn from sparse or noisy labels. As web data usually comes with high diversity but much noise, these weakly supervised methods have been successfully applied to learn effective visual representations for classification, detection, and segmentation, all using weak labels alone.
No comments:
Post a Comment