Abstract:
While Deep learning technology has significantly enhanced the object detection capability of remote sensing images. high resolution remote sensing images face challenges due to complex geographical environments, the mixed and dynamic ground features, phenomena like "spectral variability"or "spectral confusion" often result in misdetection of similar objects or missed detection of small objects. This paper proposes a multimodal high-resolution remote sensing image object detection model combining CLIP and YOLOV8, which enhances the understanding of image scenes by incorporating text information from visual foundation models. Experimental results show that YOLO-CLIP has a distinct advantage in distinguishing similar objects compared to models like YOLOV8, significantly improving the detection of small targets and enhancing generalization. Key words remote sensing, object detection, deep learning, visual foundation models