A new study shows that artificial intelligence can match human clinicians at identifying anxiety and depression symptoms in written text—achieving roughly 74% accuracy, the same level at which expert clinicians agree with each other.1 This represents an important proof-of-concept: the model performed at the practical limit of the task itself, since even trained professionals don’t perfectly agree on whether isolated sentences indicate symptoms.
The lack of objective tests or biomarkers for mental health disorders continues to limit the scalability of mental health services. Many systems rely on triage models that assign patients to appropriate levels of care, but without definitive tests, triage typically requires time-intensive one-on-one evaluations that are difficult to scale. Researchers publishing in NPP — Digital Psychiatry and Neuroscience explored whether a fine-tuned transformer model could help bridge this gap.1
What did researchers do?
The team trained a RoBERTa-based transformer model on approximately 3,600 sentences from online mental health forums, each labeled by an expert clinician as symptomatic or not. To expand their limited dataset, they used back-translation—converting sentences to other languages and back to English—which preserved meaning while creating additional training examples. They then tested the model on nearly 900 sentences from actual psychotherapy sessions, independently labeled by two different clinicians.

The training dataset is augmented and then used to fine-tune a transformer model. The model is optimized through hyperparameter tuning before the final model is obtained.
The key finding
The model’s performance—74-75% accuracy—matched the 76% agreement rate between the two human raters. Remarkably, the model achieved this using only written text, without access to tone of voice, facial expressions, or other cues clinicians typically rely on. Importantly, mental status exams aren’t usually done this way—clinicians assess through conversation with full context, not isolated sentences. But for the specific task of flagging symptom-relevant language in written communication, the model performed comparably to experts.
Potential applications
While not a diagnostic tool, this approach could support mental health care delivery in several ways. In high-volume telehealth or stepped-care programs with asynchronous communication, such models could help identify clinically important language patterns quickly and consistently. They might assist with triage decisions, tracking symptom changes over time, or flagging concerning language in patient messages between appointments—all while helping stretch limited clinical resources to serve more patients.
Learn how Trayt Health’s data team can support your initiatives.
