An Assistant System for Speaker and Sentiment Recognition Using RAM and a Hybrid AI Model


Bozyiğit F., AYGÜN İ., Sağlam O., Özcan E., BORANDAĞ E., KARASULU B.

Electronics (Switzerland), cilt.15, sa.8, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 15 Sayı: 8
  • Basım Tarihi: 2026
  • Doi Numarası: 10.3390/electronics15081731
  • Dergi Adı: Electronics (Switzerland)
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC
  • Anahtar Kelimeler: ASR, assistant systems, AutoML, DL, emotion recognition, feature selection, microservice, ML, RAM, speech recognition CNN
  • Çanakkale Onsekiz Mart Üniversitesi Adresli: Evet

Özet

In the age of remote communication and digital archiving, automated analysis of voice data has become increasingly important in various application areas. Despite significant advances in the field of Automatic Speech Recognition, integrating speaker recognition, textual sentiment analysis, and acoustic sentiment detection within a unified real-time processing pipeline remains a challenging task. Current approaches are often limited to monolithic designs or operate in batch processing modes, which restricts their scalability and real-time applicability. To address this gap, this work proposes a novel feature selection method called RAM, along with a hybrid decision-level merging approach combining Conv1D CNN and AutoML-based models. The proposed hybrid framework enables independent model training and integrates its probabilistic outputs through a weighted merging strategy for performance improvement. Furthermore, a scalable microservice-based software architecture has been developed to support real-time processing, feature selection, and model deployment. This design enhances system modularity, flexibility, and integration capability in practical applications. Experimental results show that when the proposed RAM method is used in conjunction with a hybrid AI model, it achieves over 97% accuracy in speaker recognition and over 82% accuracy in emotion classification, even with short audio samples. These findings demonstrate that the proposed approach provides a robust and efficient solution for real-time speech analysis tasks.