The data source
Data from the Objective Measurements Project Using Computational Psychiatry Technology (PROMPT)30 was used in this study. This study was a prospective multicenter observational study performed in Japan to identify objective markers using voice and speech, body movement, facial expression, and daily activity data for mood and Madness. Participants were recruited from the departments of psychiatry at 10 different medical institutions, and each ethics committee, including that of Keio University School of Medicine, approved the study. All participants provided written informed consent before participating in this study, which was designed in accordance with ethical principles based on the Declaration of Helsinki. The recruitment period was March 9, 2016 to March 31, 2019. This study used data from patients diagnosed with either a major neurocognitive disorder or a mild neurocognitive disorder according to the criteria of the Diagnostic and Statistical Manual of Mental Disorders 5 (DSM-5 ) and from participants recruited as cognitively healthy controls (CHCs). CHCs were also screened for a history of mental disorders using the Mini-International Neuropsychiatric Interview (MINI), and CHCs with a history of psychiatric disorders were excluded. People with apparent speech problems, including aphasia and dysarthria, were also excluded.
Participants were given 10 minutes to have an unstructured conversation with a psychiatrist or psychologist, such as an interview about their mood or daily life. During this time, their speeches were recorded with a microphone. After interview, Clinical Dementia Assessment (CDR), Mini Mental State Examination (MMSE), Wechsler Memory Scale Revised Logic Memory and Geriatric Depression Scale (GDS) ) were evaluated. If the participant agreed, a similar interview was conducted and the above data was collected again after a minimum interval of 4 weeks.
In the present study, we analyzed the data described above. To eliminate the effect of depressive symptoms on cognitive function, the resulting data were excluded from analysis if a participant’s GDS was 10 or greater. Additionally, data from subjects younger than 45, data with missing conversational or rating data, and data in cases where the subject spoke with a strong dialect were excluded from the analysis.
In this study, we sought to develop a system capable of detecting dementia. Therefore, we attempted to build a machine learning model to distinguish between dementia and non-dementia, including HCC and MCI. Dementia and non-dementia were determined by three neuropsychological tests, namely CDR, MMSE and Logical Memory II. The threshold for the Logical Memory II test was based on educational background: subjects with 0-9 years of education scored 2 points or less, subjects with 10-15 years of education scored 4 points or less, and subjects aged 16 or over. of education scored 8 points or less. Dementia was defined as (1) CDR ≥ 1 and MMSE ≤ 23, (2) CDR ≥ 1, MMSE ≥ 24 and below Logical Memory Threshold II, or (3) CDR = 0.5, MMSE ≤ 23 and lower than the logical cut-off threshold of memory II. Non-dementia (including MCI) was defined as CDR ≤ 0.5 and MMSE ≥ 24. If the patient had patterns other than these categories, we labeled him as dementia or non-dementia based on his clinical diagnosis . The clinical labeling procedure based on the results of neuropsychological tests is presented in the supplementary table.
To improve machine learning accuracy, we decided to use data that reflects typical training symptoms. Therefore, data belonging to the following categories were used not only as test data, but also as training data: dementia with CDR ≥ 1, MMSE ≤ 23, and logical memory II below threshold; non-dementia (MCI) with CDR = 0.5, MMSE ≥ 24 and logical memory II below threshold; and non-dementia (CHC) with CDR=0, MMSE ≥ 24, and logic II memory above threshold. Data that did not meet these criteria were used only as test data.
In this study, data was acquired multiple times from the same participant, so it was possible for the same participant to have different states depending on when the conversational data was acquired (e.g., after conversion from MCI to dementia). Thus, each data was labeled using the result of the cognitive evaluation performed during the recording of the conversational data.
From the recorded data, only the subject’s speech was transcribed into textual data, including charges, and compiled into a single document. This document was transformed into a vector represented by 150-dimensional features using previously reported technology31. In the present study, we set the negative sampling value to 5 and the number of dimensions to 150, and finally obtained a 150-dimensional document vector from the morpheme elements. Additionally, the same method was used to create a 50-dimensional vector using part-of-speech bi-grams as input features, for a total of 200 dimensions for morphemes and parts-of-speech.
Automatic learning procedure
In this study, we constructed a DNN-based prediction model that distinguished two classes of dementia and non-dementia. The DNN model was built using Python 3.6, the tensorflow 2.20 library, and a five-layer neural network consisting of an input layer, three hidden layers, and an output layer. The various hyperparameters have been optimized using Optuna 2.0.0. Leave-One-Out cross-validation (LOOCV) was used for model construction and performance evaluation. Since it was possible to obtain multiple data acquisitions from the same subject in this study, there was a risk that the voice data from the same subject could be used in both the validation and training sets. which would improve the apparent accuracy. To avoid this effect, we added a process to exclude textual data from subjects who had provided validation data from being used as training data. The details are as follows. The architecture of the machine learning and validation methods is depicted in Fig. 4.
Extract a test data from all data.
Of the remaining data, exclude data from the same subjects as the test data and data that does not meet the training criteria.
Randomly separate the remaining data such that the ratio of dementia to non-dementia is held constant and the ratio of training data to validation data is 3:1. To account for the effect of random splitting, create 10 sets of training and validation datasets with different splits.
Build 10 prediction models with the 10 sets of training and validation datasets.
Repeat the above steps from i to iv for the number of samples.
The prediction accuracy of the voting results by the 10 prediction models was calculated for a test datum. The threshold of the number of votes that determines the prediction for the whole model was adopted in order to obtain the greatest accuracy. Subsequently, precision, sensitivity and specificity in this context were used as evaluation indices for prediction models. For the purpose of AUC calculation, Receiver Operating Characteristic (ROC) curves were created for the 10 models used for voting, and the average AUC was calculated.
As a sub-analysis, we also assessed the prediction accuracy when the data was split into two groups by gender and age (75+ vs less than 75).
Relationship between number of letters and prediction accuracy
To examine the effect of utterance length on prediction accuracy, we prepared text data with different document lengths in units of 100 letters from the beginning of each text and converted each of them into a 200-dimensional vector. Predictions were made on this vector using a model built with LOOCV, and the document length and prediction accuracy were assessed. To predict each vector, we used a model designed to predict the original document before changing the document length.
Verification of vectorization and machine learning algorithms
To compare our document integration as well as machine learning procedure with other methods, we calculated prediction accuracy using TF-IDF and BERT for vectorization and using Naive Bayes, logistic regression, machine support vector and XGBoost for machine learning, respectively. In the XGboost model, 10 sets of training and validation models were created for a test data extracted by LOOCV, and voting prediction using 10 models was performed. We also performed vectorization using TF-IDF and Japanese BERT, and calculated the vote prediction accuracy using the 10 DNN-trained models.