We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images. Through a unified framework, PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario. Specifically, we propose a Q-Former to construct the hierarchical relationship between objects and parts, parsing every object into corresponding semantic parts. By incorporating a large amount of object-level data, the hierarchical relationships can be extended, enabling PartGLEE to recognize a rich variety of parts. We conduct comprehensive studies to validate the effectiveness of our method, PartGLEE achieves the state-of-the-art performance across various part-level tasks and obtain competitive results on object-level tasks. Our further analysis indicates that the hierarchical cognitive ability of PartGLEE is able to facilitate a detailed comprehension in images for mLLMs.
1. We construct the hierarchical relationship between objects and parts via the Q-Former, facilitating part segmentation to acquire advantages from various object-level datasets.
2. We propose a unified pipeline for hierarchical detection and segmentation, where we first recognize objects and then parsing them into corresponding semantic parts. This algorithm enables us to jointly detect and segment both object-level and part-level instances.
3. We standardize the annotation granularity across various part-level datasets by incorporating corresponding object-level annotations, complementing the hierarchical correspondences for current part-level datasets, promoting the development of vision foundation models.
PartGLEE is comprised of an image encoder, a Q-Former, two independent decoders and a text encoder, as illustrated in the Figure below. We propose a Q-Former to establish the hierarchical relationship between objects and parts. A set of parsing queries are initialized in the Q-Former to interact with each object query, thus parsing objects into their corresponding parts. Our proposed Q-Former functions as a decomposer, extracting and representing parts from object queries. Hence, by training jointly on extensive object-level datasets and limited hierarchical datasets which contain object-part correspondences, our Q-Former obtains strong generalization ability to parse any novel object into its consitute parts.
Object-level and Part-level tasks. To endow our model with robust generalization capability, we perform joint training on various datasets and evaluate its performance on both object-level and part-level tasks. We compare our model with specialist and generalist models to evaluate its performance on object-level data. Additionally, we contrast it with VLPart to assess its performance on part-level datasets as well as the effectiveness of joint-training process on both types of datasets. PartGLEE significantly outperforms VLPart on both object-level and part-level tasks after joint-training, while achieving comparable performance on object-level tasks compared with previous SOTA.
Performance on Traditional Object-Level Tasks. To illustrate the versatility and effectiveness of our model, we further compare the performance of our model with recent specialist and generalist models on traditional object-level tasks. It turns out that our model achieves state-of-the-art performance on part-level tasks, while maintaining competitive performance on object-level tasks.