As the amount of user generated content on the internet grows, it becomes ever more important to come up with vision systems that learn directly from weakly annotated and noisy data. We leverage a large scale collection of user generated content comprising of images, tags and title/captions of furniture inventory from an e-commerce website to discover and categorize learnable visual attributes. Furniture categories have long been the quintessential example of why computer vision is hard, and we make one of the first attempts to understand them through a large scale weakly annotated dataset. We focus on a handful of furniture categories that are associated with a large number of fine-grained attributes. We propose a set of localized feature representations built on top of state-of-the-art computer vision representations originally designed for fine-grained object categorization. We report a thorough empirical characterization on the visual identifiability of various fine-grained attributes using these representations and show encouraging results on finding iconic images and on multi-attribute prediction.
Discovering visual knowledge from weakly labeled data are crucial to scale up computer vision recognition system, since it is expensive to obtain fully labeled data for a large number of concept categories while the weakly labeled data could be collected from the Internet cheaply and massively.
In this paper we proposes a scalable approach to discover visual concepts from weakly labeled image collections, with thousands of visual concept detectors learned. Then we show that the learned detectors could be applied to recognize concepts at image-level and to detect concepts at image region-level accurately.
Under domain-selected supervision, we further evaluate the learned concepts for scene recognition on SUN database and for object detection on Pascal VOC 2007. It shows promising performance compared to the fully supervised and weakly supervised methods.