Inspired by the human way of place understanding, we present a novel indoor place perception network to overcome: 1). the simplicity of existing methods that only use the image features of object regions to recognize the indoor place, 2). insufficient consideration of the semantic information about object attributes and states. By utilizing multi-modal information containing the image and natural language, the proposed method can comprehensively express the attributes, state, and relationships of objects which are beneficial for indoor place understanding and recognition. Specifically, we first present a natural language generation framework based on a Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM) to imitate the process of place understanding. Next, a Convolutional Auto-Encoder (CAE) and a mixed CNN-LSTM are proposed to extract image features and semantic features, respectively. Then, two different fusion strategies, namely feature-level fusion and object-level fusion, are designed to integrate different types of features and features from different objects. The category of the indoor place is finally recognized based on fused information. Comprehensive experiments are conducted on public datasets, and the results verify the effectiveness of the proposed place perception method based on linguistic cues.