Spectroscopy Letters, 2026 (SCI-Expanded, Scopus)
The nondestructive separation of fertile and Cytoplasmic Male Sterile (CMS) maize kernels is vital for both specialized line development and industrial hybrid seed production. This study investigates the synergistic impact of low-level data fusion (LL-DF), data augmentation (DA), and spectral pretreatments on near infrared (NIR) spectroscopy based classification performance. To overcome the biological constraints of limited seed availability, a physically-grounded DA framework was employed, utilizing controlled spectral transformations to enrich the training space and enhance model resilience against real-world distortions. A systematic benchmarking of 48 modeling scenarios compared individual spectral modes against fused datasets across four pretreatment levels (Raw, FD, SNV, and FD+SNV) and two machine learning architectures (SVM and XGBoost). Results revealed that while absorbance-based models provided the most stable average test accuracy (0.751), the LL-DF approach enabled the identification of “champion” configurations reaching a peak diagnostic accuracy of 87.9%. Specifically, XGBoost paired with FD+SNV preprocessed fusion data demonstrated superior sensitivity in identifying sterile seed samples. Variable importance analysis confirmed that the synergy between internal biochemical signatures (absorbance) and structural indicators (reflectance) provides a more comprehensive spectral fingerprint than single-mode analysis. Furthermore, reduced models utilizing only the top 10% of informative variables maintained high diagnostic integrity (81.8% accuracy), proving the viability of cost-effective, high -throughput seed sorting based on spectral analyses. These findings suggest that integrating LL-DF and DA strategies provides a reliable, scalable framework for CMS sorting, particularly in early-stage breeding programs where seed availability is a critical constraint.