A Multimodal Vision: Language Framework for Intelligent Detection and Semantic Interpretation of Urban Waste
Journal
Informatics
Date Issued
2026-04-03
Author(s)
Jonuzi, Verda Misimi
DOI
10.3390/informatics13040057
Abstract
Urban waste management remains a significant challenge for achieving environmental sustainability and advancing smart city infrastructures. This study proposes a multimodal vision–language framework that integrates real-time object detection with automated semantic interpretation and structured semantic analysis for intelligent urban waste monitoring. A custom dataset including 2247 manually annotated images was constructed from publicly available sources (TrashNet and TACO), enabling robust multi-class detection across six waste categories. Two state-of-the-art object detection models, YOLOv8m and YOLOv10m, were trained and evaluated using a fixed 70/15/15 train–validation–test split. Under this configuration, YOLOv8m achieved a mAP@50 of 90.5% and a mAP@50–95 of 87.1%, slightly outperforming YOLOv10m (89.5% and 86.0%, respectively). Moreover, YOLOv8m demonstrated superior inference efficiency, reaching 120 FPS compared to 105 FPS for YOLOv10m. To obtain a more reliable estimate of performance stability across data partitions, stratified 5-Fold Cross-Validation was conducted. YOLOv8m achieved an average Precision of 0.9324 and an average mAP@50–95 of 0.9315 ± 0.0575 across folds, suggesting generally stable performance across data partitions, while also revealing variability associated with dataset heterogeneity. Beyond object detection, the framework integrates MiniGPT-4 to generate context-aware textual descriptions of detected waste items, thereby enhancing semantic interpretability and user engagement. Furthermore, GPT-5 Vision is incorporated as a structured auxiliary semantic classification and category-suggestion module that analyzes object crops and multi-class scenes, producing constrained JSON-formatted outputs that include category labels, concise descriptions, and recyclability indicators. Overall, the proposed YOLOv8–MiniGPT-4–GPT-5 Vision pipeline shows that combining accurate real-time detection with multimodal semantic reasoning can improve interpretability and support interactive, semantically enriched waste analysis in smart-city and environmental monitoring scenarios.
