Synergistic effects of data fusion, augmentation, and pretreatment on the discrimination of fertile and sterile maize kernels using NIR (near infrared) spectroscopy


Bachiyska B., Petrovska N., KAHRIMAN F.

Spectroscopy Letters, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1080/00387010.2026.2676999
  • Dergi Adı: Spectroscopy Letters
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Applied Science & Technology Source, Chemical Abstracts Core, Chimica, Compendex, INSPEC, Academic Search Ultimate (EBSCO), Engineering Source (EBSCO)
  • Anahtar Kelimeler: Classification, machine learning, seed production, Zea mays
  • Çanakkale Onsekiz Mart Üniversitesi Adresli: Evet

Özet

The nondestructive separation of fertile and Cytoplasmic Male Sterile (CMS) maize kernels is vital for both specialized line development and industrial hybrid seed production. This study investigates the synergistic impact of low-level data fusion (LL-DF), data augmentation (DA), and spectral pretreatments on near infrared (NIR) spectroscopy based classification performance. To overcome the biological constraints of limited seed availability, a physically-grounded DA framework was employed, utilizing controlled spectral transformations to enrich the training space and enhance model resilience against real-world distortions. A systematic benchmarking of 48 modeling scenarios compared individual spectral modes against fused datasets across four pretreatment levels (Raw, FD, SNV, and FD+SNV) and two machine learning architectures (SVM and XGBoost). Results revealed that while absorbance-based models provided the most stable average test accuracy (0.751), the LL-DF approach enabled the identification of “champion” configurations reaching a peak diagnostic accuracy of 87.9%. Specifically, XGBoost paired with FD+SNV preprocessed fusion data demonstrated superior sensitivity in identifying sterile seed samples. Variable importance analysis confirmed that the synergy between internal biochemical signatures (absorbance) and structural indicators (reflectance) provides a more comprehensive spectral fingerprint than single-mode analysis. Furthermore, reduced models utilizing only the top 10% of informative variables maintained high diagnostic integrity (81.8% accuracy), proving the viability of cost-effective, high -throughput seed sorting based on spectral analyses. These findings suggest that integrating LL-DF and DA strategies provides a reliable, scalable framework for CMS sorting, particularly in early-stage breeding programs where seed availability is a critical constraint.