Deep learning networks have become one of the most promising architectures for image parsing tasks. Although existing deep networks consider global and local contextual information of the images to learn coarse features individually, they lack automatic adaptation to the contextual properties of scenes. In this work, we present a visual and contextual feature-based deep network for image parsing. The main novelty is in the 3-layer architecture which considers contextual information and each layer is independently trained and integrated. The network explores the contextual features along with the visual features for class label prediction with class-specific classifiers. The contextual features consider the prior information learned by calculating the co-occurrence of object labels both within a whole scene and between neighboring superpixels. The class-specific classifier deals with an imbalance of data for various object categories and learns the coarse features for every category individually. A series of weak classifiers in combination with boosting algorithms are investigated as classifiers along with the aggregated contextual features. The experiments were conducted on the benchmark Stanford background dataset which showed that the proposed architecture produced the highest average accuracy and comparable global accuracy.
Funding
Category 1 - Australian Competitive Grants (this includes ARC, NHMRC)