IEEE Transactions on Affective Computing, 2026 (SCI-Expanded, Scopus)
Clinical records are inaccessible due to legal restrictions, and genuine suicide notes are scarce, limiting the availability of verified suicidal expression data. Consequently, most Natural Language Processing (NLP) studies rely on widely accessible social media datasets containing expert annotations or heuristic labels. However, it remains unclear how expressions of suicidal ideation in these datasets relate to the ideation in genuine suicide notes. In this study, we address this gap by comparing manually annotated social media datasets in English and Turkish, as well as automatically labeled English datasets, directly against genuine suicide notes using a combination of linguistic and statistical analyses, classification models, and zero shot embedding-based representations. Results show that expert labeled social media data exhibit limited overlap with suicide notes across both languages while remaining distinguishable, whereas automatically labeled data diverge substantially from the notes in all analyses. These findings indicate that the annotation method critically shapes the signals learned by the computational models and that social media data capture different stages or forms of ideation than those in suicide notes. Overall, our findings emphasize the importance of careful dataset evaluation and caution against misinterpreting model performance in suicide risk detection research, given the shared objective: supporting global suicide prevention efforts.