T1D Risk Score: MicroRNA Biomarker Discovery

Study population

The methods for plasma collection across several collaborative centers were checked to be as uniform as possible, and all plasma samples were freeze-thawed ≤1 time before this analysis. Details of the samples (number, age, sex) included in different datasets of this study are provided in Supplementary Table 2.

The wet lab discovery set of samples included developing (fetal) human pancreatic tissues (n = 55), adult human islets and nonpancreatic tissues (total, n = 189), plasma from individuals within 96 h of acute presentation with T1D (n = 5) and age, as well as sex-matched healthy controls (n = 5). The miRNAs identified from the literature are provided in Fig. 1b and Supplementary Table 1.

The plasma samples for model development were from individuals without T1D (controls), or siblings (genetic risk) of individuals with T1D (siblings) or individuals with T1D (T1D) from four contexts: AUS (total n= 728, T1D n = 519, controls n= 209); DNK (Total n= 608, T1D n = 316, siblings n = 292), hkg (total n= 238, t1d n = 120, controls n = 118) and IND (total n= 630, t1d n = 497, controls n = 133), as presented in Fig. 2A. None of the siblings had stage 3 T1D. Most of these were stage 1 T1D and ~2% of the siblings developed T1D at ~12 years from sample collection.

Diabetes diagnosis was carried out pragmatically for all study participants through their teams of clinicians. The diagnostic criteria for T1D varied across the four geographical contexts but adhered to established standards: AUS: diagnosis was based on the presence of autoantibodies and clinical presentation, including diabetic ketoacidosis, with genetic analyses conducted if the patient presented under 1 year of age; DNK: the diagnostic criteria were similar to those used in AUS, with the inclusion of autoantibody data for patients with T1D and their siblings; HKG: diagnosis was based on the presence of diabetic ketoacidosis or the need for continuous insulin use within 1 year of diagnosis. Because of the low prevalence of anti-GAD and anti-islet cell antigen antibodies among Chinese patients with T1D⁶⁶autoantibodies were not a prerequisite for defining T1D; IND: diagnosis involved testing for glucose and C-peptide levels as well as islet autoantibodies, including GAD65, IA2 and ZNT8.

The performance of the risk score in a separate case-control validation dataset (total n = 662, with T1D n = 298, without T1D n = 364) from AUS, CAN, IND, NZL and the USA was assessed; 4% of participants without T1D in this validation dataset had gestational diabetes and none had T2D.

The eDRS4C models (1,000, 10,000 and 100,000) developed through this study were tested on plasma samples from CAN islet transplant recipients (31 recipients before transplantation, 1 h after transplantation and 24 h after transplantation samples) to assess future T1D status. PREDICT T1D miRNAs were also assessed for their capacity to stratify drug efficacy in the imatinib intervention study¹⁴where individuals with recent-onset T1D were treated with imatinib (n = 43) or placebo (n = 21) for 26 weeks¹⁴where outcomes were assessed at the end of 1 year.

Subanalyses for benchmarking the PREDICT T1D signature were carried out on the DNK cohort with T1D (n = 96 of 316) and siblings where data for GRS (n = 163 of 292) and autoantibodies (n = 168 of 292) were available. Duration of diabetes and age at onset in the HKG cohort of individuals with T1D (n = 101 of 120), and the DNK cohort of participants with T1D (n = 237 of 316), were analyzed. Subanalyses for discriminating T1D from T2D were carried out on participants with T2D from HKG (n= 119) and out (n = 12) with the participants with T1D as described in the ‘T1D versus T2D discrimination’ section below.

Ethical approval

Each study population was recruited and consented (oral and written consent) by the lead collaborators at each site according to the Declaration of Helsinki and approved by their respective institutional human ethics committees. Since samples have been collected at different times from 2000 until before 2024, the relevant version of the Declaration of Helsinki at the time of approvals for collection was followed; for most samples, this was version 2013. Ethical approvals for the use of developing human pancreas, adult islets and nonpancreatic tissues were obtained from the human ethics committees of the National Centre for Cell Science (India) and Sydney Local Health District, as detailed in our earlier publications^27,67,68. Profiling of circulating miRNAs was approved through the human ethics committees at St Vincent’s Hospital Melbourne (HREC A052-10), Sydney Local Health District University of Sydney (HREC/13/RPAH/83) and Western Sydney University (H14213).

miRNA profiling

Tissue and plasma samples were collected using standardized procedures, received coded (re-identifiable) and stored frozen at −80 °C until use. Plasma samples were thawed and a 100-µl aliquot was used for RNA isolation using an automated QIAcube HT platform (QIAGEN)^26,69. Samples with extensive (visible) hemolysis were excluded. Initially, 500 μl of TRIzol reagent (Thermo Fisher Scientific), 10 µg of glycogen, spike-in miRNA control (ath-miR-172a) and 100 µl chloroform (all from Merck Millipore) were added to each sample. After centrifugation, the aqueous phase was transferred to an RNeasy column (RNeasy HT Kit, QIAGEN) for total RNA isolation. For the solid tissue samples (islet, non-islet pancreas and developing pancreas) a TRIzol-based manual RNA isolation workflow was implemented^23,27,68. RNA was quantitated on a NanoDrop spectrophotometer (Thermo Fisher Scientific); 10 ng or 100 ng of RNA was used for reverse transcription using the microRNA RT kit (Thermo Fisher Scientific), spike-in miRNA control (ath-miR-159a) and Megaplex RT primer pools A and B (discovery samples profiled for 754 miRNAs). Custom reverse transcription primers were used for all samples profiled for the signature of 50 selected miRNAs and six (negative, internal and spike-in) control miRNAs. Preamplification and TaqMan-based miRNA quantitative PCR (qPCR) were performed using the low-sample input protocol recommended by the manufacturer and described earlier²⁶ with TaqMan OpenArray Human miRNA inventoried (754 miRNAs) or custom panels (specifically designed for the 56-miRNA panel) on the QuantStudio 12K Flex platform (Thermo Fisher Scientific). MiRNAs for hemolysis and repeats were assessed independently. Raw data files obtained after the qPCR were checked for sample loading and uploaded to the ThermoConnect tool for global normalization (for the discovery set). MiRNA data that did not pass quality control (with an amplification score of less than 1.24 or a cycle value confidence interval (CI) of less than 0.6) were converted to undetectable (Ct = 39). Other samples assessed on custom panels (training-testing, validation, T1D therapy) were normalized using spike-in controls added during RNA isolation and reverse transcription. Normalized Ct values were converted to fold-over-detectable (FOD) or abundance values⁷⁰ and were used in further analyses.

Data analysis

Discovery set

The relative abundance of the miRNAs across two groups (islet versus non-islet; T1D versus controls) in the discovery samples was calculated using ({2}^{{DeltaDelta}{rm{C}}_{rm{t}}}) (FOD difference) and significantly different miRNAs (P ≤ 0.05) were identified using a Welch’s t-test. Although lenient, this analytical approach selected a set of 50 most significant miRNAs from our discovery cohort (n = 254 wet lab discovery samples and n = 2,583 participants from previous studies) that were tested and validated in a multicontext set of samples using Random Forest machine learning workflow. Random forest algorithm (used in developing the DRS4C) is inherently robust in feature selection. Additionally, testing these selected features (miRNAs) in the test dataset (n = 661), a separate multicontext validation set (n = 662) and in two longitudinal T1D clinical studies and trials further endorsed the validity of our selected miRNA signature. Methods to address class imbalance, for example, using the synthetic minority over-sampling technique (SMOTE) and sampling bias (for example, bootstrapping) were used to retain the desired analytical rigor.

Training and testing datasets

Samples from four different contexts

The miRNA abundance of the 50 PREDICT T1D miRNAs was compared between participants with T1D and controls (for AUS, HKG and IND), or participants with T1D and siblings (for DNK) and for all contexts together, using a Kruskal–Wallis test followed by an uncorrected Dunn’s test comparing the mean ranks.

Model training and testing

This analysis was performed on samples from the four different contexts using the 50 PREDICT T1D miRNA signature data along with the participant’s age at sample collection as continuous independent variables. Participants’ T1D status was used as a categorical dependent variable, where ‘controls/siblings’ was set as the reference/control group and ‘T1D’ was set as a case group. Datasets from each context were individually randomized, aggregating to 70% training (n = 1,543) and 30% testing (n = 661) samples (Fig. 2A) using the sample.split() command in the caTools (v.1.18.2) R package⁷¹. The samples selected in the training and testing datasets were kept the same throughout the analyses in R and Python and for independent validation (see below). Classification models were built on the training datasets of four contexts together using the random forest workflow, using the packages randomForest (v.4.7-1.1) and Caret (v.6.0-94)⁷² and r (V.4.2.2)⁷³ and RStudio (v.2023.12.1-402)⁷⁴. To settle class imbalances in the training datasets, four different sampling methods, that is, up, down, SMOTE and random over-sampling examples, were assessed with three different cross-validations (fivefold, tenfold or 50-fold). Fifty-fold cross-validation, using SMOTE sampling and 2,000 decision trees, was selected as the optimally tuned workflow for eDRS4C-100,000, based on accuracy, the AUC, sensitivity, specificity, and F₁ score (scripts made available via GitHub/islet biology). Predicted probabilities were calculated for individuals in the training datasets using the predict() command from the package stats in R. The ROC curve analyses, AUCs and CIs were computed using the roc() and ci.auc() commands from the pROC package in R to obtain threshold values. The best threshold value for classification was determined using the Youden index method, using the coords() command from the pROC package in R. Individuals with a predicted probability greater than the Youden index threshold were classified as T1D, while those with lower probabilities were classified as non-T1D. The importance of each feature and variable used in the prediction model was assessed using MDG, which was computed using the importance() command in randomForest^72,75. Model performance was also independently validated in Python. The PyCaret (https://pycaret.org/) framework in the Jupyter Notebook environment or Visual Studio release 15.9.67 was used for model training, tuning and evaluation. The trained model was evaluated on the testing dataset using several performance metrics provided by the PyCaret and scikit-learn libraries such as accuracy, precision, recall, F₁ score and AUC. AUC plots and detailed classification metrics were generated using the evaluate_model() function in PyCaret and scikit-learn. The validation was performed using StratifiedKFold cross-validation with tenfold cross-validation. To visualize the impact of each feature on the predictions made by the model, SHAP values were computed and visualized. A SHAP explainer object was created for the trained random forest model using the explainer class of the shap package and any outliers were removed. The AUC and model accuracies were not notably different with the outliers removed or retained. SHAP values were visualized using the shap.plots.beeswarm function.

GAI-based workflows

T1D-free (control) participants from the training datasets of four contexts (AUS, DNK, HKG, IND) were used to generate three different synthetic datasets of age and miRNA expression profiles for 1,000, 10,000 or 100,000 synthetic individuals using the SingleTableMetadata module from the SDV library. A Gaussian Copula Synthesizer model from the SDV library was used. The model was then trained on the original control dataset to capture the underlying dependencies and distributions of the data. Synthetic data for the control dataset were generated using the trained model, ensuring compliance with the specified constraints, including age limitations between 2 and 64 years. The three negative control miRNAs were confirmed and set to be undetectable in the synthetic dataset. The suitability of synthetic data for downstream tasks such as model training and evaluation was confirmed with TableEvaluator. These synthetic controls were pooled with the original real-life training samples from the four contexts (AUS, DNK, HKG, IND) to create an augmented training dataset for the generation of eDRS4C models. Random forest models were trained on these three augmented datasets using the same tuning parameters mentioned above.

Parallelization and independent data validation

All analyses were independently validated by collaborating data scientists using different platforms. To reduce computational time in model training, we used the doParallel package (v.1.0.17) and its dependencies in R to parallelize model training across 20 central processing units (CPUs) on an Ubuntu server (total memory = 510 GB, CPUs = 62). A cluster of 20 CPUs (except for the eDRS4C-100,000 model, where the cluster was of six CPUs) was created using makePSOCKCluster(), which was registered using the registerDoParallel() function. Upon completion of model training, the connections to these CPUs were closed using the stopCluster() function. Output agreement was validated across at least three users, who worked on separate workstations using R or Python.

Validation dataset

The different models of eDRS4C (from the 1,000, 10,000 and 100,000 synthetic sample sets) were used to classify individuals as T1D and non-T1D in an independent validation dataset (n = 662). Predicted probabilities and the class for this dataset were calculated using the methodologies described above.

T1D therapy dataset

Islet transplant samples

The eDRS4C risk score was used to predict T1D management (need for exogenous insulin) based on the eDRS4C (100,000) before transplantation, and 1 h and 24 h after transplantation. Based on exogenous insulin requirement at 1-month after transplantation, baseline clinical parameters were compared using a Welch’s t-test. Dimensionality reduction was performed on all available baseline clinical parameters in R using the uwot package (v.0.2.2) (UMAP = 20 threads, seed = 1), and data were plotted using the ggplot2 package (v.3.5.1).

Imatinib trial

Baseline miRNA expression of the 50 PREDICT T1D miRNAs was used in a dimensionality reduction UMAP algorithm using the Entropy criterion and Minkowski metric for a best-fit method with the BioVinci data visualization package (BioTuring). We also assessed the performance of an eDRS4C-100,000 model (with or without age) based on the key miRNAs associated with imatinib therapy (Fig. 5C) at the study baseline to predict responders (UQ) and nonresponders (LQ) to therapy. Baseline clinical parameters and miRNAs were compared across UQ and LQ participants using a Welch’s t-test. Dimensionality reduction was performed on all available baseline clinical parameters in R using the uwot package (v.0.2.2) (UMAP = 20 threads, seed = 1), and data were plotted using the ggplot2 package (v.3.5.1).

T1D versus T2D discrimination

The FOD values of all 50 miRNAs and age (in years) for individuals with T1D (n = 1,314) or with T2D (n = 131) were imported into RStudio using the package openxlsx. The distributions of each miRNA were compared between individuals using a Wilcoxon rank-sum test; the adjusted Pvalues were generated using Bonferroni correction (package rstatix). Data were visualized as sina plots using the packages ggforce (v.0.4.2) and ggprism (v.1.0.5). The eDRS4C-100,000 risk score was used to stratify participants with T1D from those with T2D. Two additional risk scores were derived using these datasets wherein a control versus T2D (training dataset n = 527 controls, n = 92 T2D; testing dataset n = 364 controls and n = 39 T2D) or a T1D versus T2D (training dataset n = 1,016 T1D and n = 92 T2D; testing dataset n = 298 T1D and n = 39 T2D) risk score was derived using all 50 miRNAs and age at sampling using the random forest algorithm; SHAP values were plotted as mentioned above.

Genomic location and collinearity

The genomic location spanning from the position of the starting base pair for each miRNA (based on assay IDs from Thermo Fisher Scientific) and verified through the NCBI Gene database (Supplementary Table 4) was used to mark the start of the gene encoding the corresponding miRNAs. The shinyCircos⁷⁶ tool (https://venyao.xyz/shinycircos/) was used to visualize the genomic locations for the 50 PREDICT T1D miRNAs. Links connecting these miRNAs represent their context-specific coexpression (Spearman rank correlation coefficient greater than 0.9 and P < 0.05), which is colored separately for each context. The width of each connecting line is arbitrarily set for visual enhancement.

Autoantibody measurements

Autoantibodies were measured using labeled recombinant proteins according to standard radioimmunoassay protocols described elsewhere⁷⁷.

Genotyping and GRS

Participants were genotyped on the Illumina Infinium Global Screening Array (v.24 and v2.0). Before quality control, the Infinium Global Screening Array had 722,659 single-nucleotide polymorphisms (SNPs). After quality control and frequency pruning (–geno 0.02 –maf 0.05 –hwe 0.001 –mind 0.02) using PLINK v.1.90, the genotyping rate was 0.99 and included 298,994 SNPs. The T1D GRS (GRS1) was calculated using 29 of 30 T1D-associated SNPs, as identified earlier⁷⁸. For the missing SNP, rs11594656, a proxy SNP analysis was performed using linkage disequilibrium patterns in the European population through LDlink⁷⁹. However, no suitable proxies with r² > 0.8 were identified, leading to the exclusion of this SNP from the analysis. The GRS for the remaining 27 non-HLA SNPs was calculated using the score function in PLINK. For the two HLA SNPs, rs2187668 and rs7454108, which capture the DR3 and DR4–DQ8 haplotypes, genotype weights for the highest-risk heterozygous combinations were assigned based on evidence from a previous study⁵. These weights were incorporated into the final aggregate GRS.

Anomaly detection

The isolation forest anomaly detection machine learning approach was used to further stratify individuals at high risk for T1D into progressors and non-progressors. Isolation forests were generated for existing risk factors (including age, autoantibody data and GRS) or for the top ten features (age and top nine miRNAs; Fig. 2D). MiRNA FOD data and other values (age in years, autoantibody data and GRS) were used as obtained. The entire pipeline was implemented in R. Dimensionality reduction was carried out using the uwot package (v.0.2.2). Isolation forests were built using the isotree package (v.0.6.1-1); the anomaly cutoff was set at 0.9 (borderline 10% of anomalous data points). Data were plotted using the ggplot2 package (v.3.5.1) and forests were visualized as contour plots.

PCA, correlation and Sankey plots

The PCA was carried out in R using the prcomp() command in the stats package. PCA plots were generated using the ggplot2 and ggfortify packages in R. Correlation plots were created using Prism v.10.4.1 (GraphPad Software). Sankey plots were created using SankeyMATIC (https://sankeymatic.com/).

Sodium nitroprusside exposure

Human cadaveric islets (n = 6 different biological preparations) were resuspended in serum-free defined Connaught Medical Research Laboratories medium containing 2% fatty-acid-free BSA within 24–60 h of isolation. Islets were then exposed to different concentrations (0 mM, 1 mM and 10 mM) of freshly prepared sodium nitroprusside for 24 h. Experimental culture supernatants (without contamination from islet cells) were collected for miRNA analysis. RNA isolation and miRNA qPCR with reverse transcription were performed as described above. Normalized Ct values were used to perform supervised clustering analyses with the ComplexHeatmap package in R based on Euclidean distance.

Pathway analysis

MiRNAs that distinguished the UQ of drug responders from the LQ of drug responders (Fig. 5C) in the imatinib trial were assessed for targeting pathways using miRBase, AmiGO2 and miRPathDB. We evaluated the four key miRNAs (miR-24-3p, miR-25-3p, miR-27b-3p and miR-92a-3p) using miRPathDB 2.0, which offers comprehensive resources for miRNA target pathway analyses. These four miRNAs targeted distinct sets of genes with as little as 2.7% overlap between miR-27b-3p and miR-24-3p targeting pathways (Jaccard coefficient = 0.027), to a higher overlap (Jaccard index = 0.252) between miR-25-3p and miR-92a-3p. As the drug used in the imatinib trial targets the tyrosine kinase pathway, the analysis focused on tyrosine kinase and some of the other relevant pathways, as illustrated in Fig. 5e. All computationally identified genes targeted by these miRNAs were compared to the known imatinib targets^{37,38,39,40,41,42,43,44,45}.

Further details of the products and software package versions are provided in Supplementary Tables 10 and 11respectively.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.