CAM is proposed to highlight the class-related activation regions for an image classification network, where feature positions related to the specific object class are activated and have higher scores while other regions are suppressed and have lower scores. For specific visual tasks, CAM can be used to infer the object bounding boxes in weakly-supervised object location(WSOL) and generate pseudo-masks of training images in weakly-supervised semantic segmentation (WSSS). Therefore, obtaining the high-quality CAM is very important to improve the recognition performance of weakly supervised pixel-wise dense prediction tasks.
To solve the problems, a research team led by Yanpeng SUN published their new research on 15 Feb 2025 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
This work aims to design a simple yet efficient method to expand CAM. Rethinking the classification network, to improve the probability of identifying objects, pixels belonging to the same category in the feature map have similar representations. To verify this assumption, as shown in Figure 1, we randomly select one pixel from the feature maps generated by different backbone stages to visualize the correlations with other pixels. It can be observed that as the network deepens, the correlation between pixels of the same category on the feature map is stronger. The above visualization results provide strong evidence for our hypothesis and the semantic correlation between pixels is defined as semantic structure information. It is worth noting that we employ feature points from various locations to compute and assess semantic correlations. This methodology facilitates a more precise comprehension and depiction of semantic associations between objects. In the context of semantic structure information, "structural information" refers to the description of relationships between objects. By analyzing and capturing these structural cues, we can better understand the semantic correlations and the general structure among the objects.
In the research, proposes a semantic structure aware inference (SSA) model by leveraging different scales of semantic structure information to generate high-quality CAM, and hence improve the recognition performance of downstream tasks. SSA is introduced in the model inference without any training cost. The overall network architecture is shown in Figure 2. Specifically, a seed CAM is first obtained by using the standard image classification network. Then, the semantic structure modeling module (SSM) is proposed and deployed on different backbone stages to generate the semantic relevance representation. After that, the obtained structured feature representations are used to polish the seed CAM via the dot product operation. Finally, the polished CAMs from different backbone stages are fused as the final CAM. To the best of our knowledge, this is the first work to improve the quality of CAM without parameters in the model inference step. Experimental results on both WSOL and WSSS demonstrate that SSA can achieve new state-of-the-art performance.
Future work will prioritize enhancing the generalization ability of semantic structure information. This involves developing methods to refine and augment the representation of semantic structures within our model. By improving the model's capacity to generalize this crucial information, we aim to boost overall performance and robustness in diverse scenarios.
DOI: 10.1007/s11704-024-3571-9