Retrieved Studies and Model Characteristics for Cancer Prediction Using Electronic Health Records

Retrieved Studies

Our systematic literature review identified 1214 studies across six databases, of which 414 were duplicates. An additional 61 studies were sourced through reference and citation searches. After rigorous screening and assessment, 35 studies were selected for the final review. The flowchart depicting the selection process is available in Figure 2. Figure 3 illustrates the number of studies published by year up to August 9, 2024.

Fig. 2

Flow diagram for study identification and selection. Developed using the PRISMA template

Study Characteristics

Study Setting and Population

The studies analyzed populations from various countries, with the majority from the USA (54%), followed by the Netherlands (11%) and Taiwan (14%). Denmark, Sweden, South Korea, Israel, and Singapore each contributed a smaller number of studies. One study did not specify the population’s origin. The studies utilized data from single-center (14%), multi-center (57%), and nationwide (25%) settings.

Study designs included case-control (26%), nested case-control (17%), and cohort (57%) studies. Data primarily came from both primary and secondary care settings (66%), while some studies focused on primary (11%) or secondary care (20%) exclusively.

Outcomes and Prediction Tasks

The primary outcomes focused on cancer risk prediction (26%), cancer detection or early detection (57%), metastasis prediction (3%), and recurrence prediction (11%). Key cancers studied included pancreatic and colorectal cancer (each at 26%), lung cancer (17%), liver and gastric cancer (9% each), and minor studies involving breast, skin, leukaemia, and oesophageal cancer.

Clinical Features

Demographics, diagnoses, laboratory tests, and prescriptions were the most frequently used clinical features, appearing in 71%, 63%, 63%, and 51% of studies, respectively. Other features included symptoms, referrals, procedures, free text notes, lifestyle factors, images, tumor staging, and histological features. Figure 4 details the frequency of these features in the 35 studies.

Model Characteristics

Methods for incorporating temporal data into predictive models were categorized into two main approaches. The first approach, feature engineering, involved creating meaningful features from longitudinal data to input into AI models. The second approach directly used temporal sequences as input, often employing models specifically designed to handle sequential data.

Feature Engineering for Representing Sequential Data

Seventeen studies adopted feature engineering. Techniques included trend analysis and slope calculations, absolute value changes at specific time points, and more advanced methods such as pattern mining and wavelet/quantum FFT transformations. Unsupervised learning techniques, like autoencoders, were also utilized to learn representations from patient trajectories.

Models Taking Sequential Data as Direct Input

Eighteen studies used sequential data inputs, primarily deep learning models. Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU) were popular choices, with GRUs showing particular favor among researchers. Transformer architectures also gained prominence for their ability to capture temporal dynamics efficiently.

Addressing missing data was a significant challenge, with common strategies including zero-padding sequences and forward-filling, where recent data points fill in missing temporal gaps.

Prediction Windows

The prediction windows varied significantly across studies, with most risk prediction models using a 36-month window. Metastasis and recurrence models explored different windows, with some extending up to six months and focusing on specific clinical events.

Explainability

Twenty-two studies focused on model explainability, enhancing transparency and interpretability. Key techniques included feature importances, Local Interpretable Model-agnostic Explanations (LIME), attention mechanisms, integrated gradients, and Shap