Study design
Table of Contents
All participants provided written informed consent with Institutional Review Board approval by Mass General Brigham/Brigham and Women’s Hospital (BWH) and the FAA Civil Aerospace Medical Institute (CAMI), with protection under an NIH Certificate of Confidentiality (CC-OD-21–2237) with ClinicalTrials.gov ID NCT04211506.
Intensive outpatient screening of participants was conducted to verify they were healthy, free of medication and substance use, and abstaining from caffeine and other stimulants. Recruitment targeted adults 20–45 years old with a body mass index of 18.5–29.9 kg/m2. Potential participants had a physical exam, review of self-reported medical history, and psychological screening [28]; those with reported personal or immediate family history of psychiatric disorders were excluded. Self-reported habitual sleep durations of 7–9 h nightly were required, and potential participants were excluded if they reported recent frequent night shift work (e.g., working > 3 nights per week between 01:00–06:00 h within the past 3 months). Potential participants at higher risk of sleep disorders were not enrolled, based on scores on the Athens Insomnia Scale [29, 30]Berlin Questionnaire for Obstructive Sleep Apnea [31]and a 5-question assessment for Restless Leg Syndrome. A home sleep test was conducted with a Nox-T3 portable monitor (Nox-T3, Nox Medical, Reykjavik, Iceland), and potential participants were excluded if results indicated an apnea–hypopnea index of ≥ 15 or a periodic limb movement index ≥ 20. Finally, potential participants representing extreme morning or evening chronotypes were excluded based on scores < 31 (evening type) or ≥ 69 (morning type) on the Horne-Östberg Morningness-Eveningness Questionnaire [32].
To reduce the potential for recent outpatient sleep loss to influence study results, immediately prior to beginning the 10-day inpatient study all participants were asked to complete at least 1 week with 8 h of Time In Bed (TIB) each night followed by at least 1 week of 10 h TIB each night. Adherence was monitored via actigraphy (CamNTech Motion Watch 8), time-stamped call-ins at bedtime and wake time, and a daily sleep diary.
Participants passing outpatient screening procedures were admitted to the Center for Clinical Investigation at the Brigham and Women’s Hospital for the inpatient 10-day study. Throughout the study, the participant lived in a private study room complete with bed, bathroom, and computers for testing. The study rooms were designed to minimize noise and light from adjacent areas of the facility and were maintained at temperatures approximately 71–77 degrees Fahrenheit. Fluorescent room lighting targeting intensities < 100 lx was controlled by the investigators with overhead fixtures during scheduled time awake, and all room lights were turned out during TIB. There were no windows in the study rooms. To minimize external influences during the inpatient study, participants were blinded to their group assignment and knowledge of time of day was restricted, as was access to clocks, the internet, live television or radio, and contact with non-hospital personnel. After the first two acclimation days with ad lib meals, meals typically were provided every 4 h during time awake in accordance with a weight-maintenance diet based on anticipated caloric need as calculated using the Mifflin St. Jeor equation (activity factor 1.6) [33].
A total of 73 participants began the inpatient study, and 59 completed all 10 days. Participants were randomized without replacement to one of four study conditions that differed in the experimental segment treatment: a control with 8 h TIB nightly (CT, 14 participants), sleep restriction with 5 h TIB during daytime across 5 days (DR, 16 participants), sleep restriction with 5 h TIB during nighttime across 5 nights (NR, 14 participants), and acute total sleep deprivation with no sleep across two nights (TD, 15 participants) (Fig. 1). In all four study conditions, participants began with an acclimation segment on the day of admission (day 1), a nighttime sleep opportunity with 12 h TIB, a second acclimation day, and then 8 h TIB. This was followed by a baseline day (day 3), the experimental segment (CT, DR, NR, or TD treatment), and finally ended with a recovery segment incorporating 8–10 h TIB prior to discharge. Participants were continuously monitored during the inpatient phase to improve compliance, including onsite technicians who would enter the room if a patient failed to respond to tasks or as needed. Technicians also frequently entered the room for reasons such as sample collection and meal delivery.
Schematic of study design. Overview of schedule for inpatient study day 3–10, with darker shading indicating scheduled sleep Time In Bed (TIB) and colors reflecting baseline (green), experimental (purple), and recovery (yellow) study segments. Neurobehavioral performance test batteries were conducted every 4 h except during scheduled TIB, with the approximately 45-min test battery beginning at the Relative Clock Hours indicated with tick marks on the x-axis (i.e., 03:15, 07:15, 11:15, 15:15, 19:15, and 23:15 h)
All times reported are based on Relative Clock Hour (RCH), with the actual time of scheduled inpatient study events adjusted for each participant based on the average midpoint of their outpatient self-reported sleep–wake schedule. Time intervals between scheduled events were the same for all participants in each of the four condition groups (Additional file 1: Table S1); for example, neurobehavioral test batteries were scheduled 4 h apart during time awake. However, events such as the 08:00 (RCH) blood draw on baseline day 3 might occur at slightly different real-world actual clock times across participants.
Sleep and neurobehavioral performance monitoring
During the night before baseline day 3 and during the two recovery nights after the experimental segment of DR, NR, and TD conditions (for CT, the final two study nights), polysomnography (PSG) monitoring was conducted using the Vitaport digital ambulatory EEG recording system (Vitaport-3, TEMEC Technologies B.V., Heerlen, The Netherlands). The PSG recording involved electroencephalogram (EEG), bilateral electrooculogram (EOG), chin electromyogram (EMG; mentalis, submentalis), and two-lead electrocardiogram (ECG). Visual scoring was conducted by a registered PSG technician following the American Academy of Sleep Medicine criteria version 2.6 [34]. In addition, within approximately 5–10 min of the conclusion of each TIB sleep opportunity, participants provided self-assessments of their prior night’s sleep and estimated total sleep duration.
At 4-h intervals during time awake from baseline day 3 until discharge from the inpatient study, participants completed an approximately 45-min computer-based neurobehavioral performance test battery of objective and subjective tests. Participants performed approximately 3–5 practice tests during the first two days of acclimation to the inpatient study (not analyzed), which were intended to reduce learning effects during data collection days 3–10. Subjective assessments were conducted using scales such as the Karolinska Sleepiness Scale (KSS) [35]a 9-point Likert scale of sleepiness. Another subjective element was the Performance Effort and Evaluation Rating Scale (PEERS) [36]consisting of responses on a Visual Analog Scale (VAS) rating their estimate of how well they performed on the neurobehavioral test battery (from extremely good to poor); how much effort they had to expend; and whether they could have performed better if they had tried harder. Participants also provided VAS responses on bipolar visual analog scales rating their feelings of being alert-sleepy, calm-excited, happy-sad, groggy-clearheaded, and unmotivated-motivated [37].
The neurobehavioral performance test battery also encompassed objective evaluations, such as the 90-s version of the Digit Symbol Substitution Test (DSST). The DSST gauges cognitive function by requiring participants to match symbols with corresponding digits [38]. Another assay was the Stroop color-word test (STROOP) [39] of executive functioning or reasoning, during which participants were asked to respond to font color while ignoring a written word (which spelled one of four colors). Due to the unintentional study enrollment of one red-green color-blind participant, STROOP data were analyzed in two ways: once with all four colors omitting this participant, and again for all participants based on only portions of the test with blue-yellow colors. In another test called the Matrix Reasoning Test (MRT), problem-solving skills were evaluated on trials of varying difficulty. For this, participants were required to determine which matrix completes a set of eight other matrices based on 1–3 relational pattern changes or transformations [40]. Memory and recall were tested via the Face-name task (FACE), in which participants were asked to recall face-name pairs presented approximately 33 min previously [41]. Attention was assessed with the 10-min Psychomotor Vigilance Test (PVT) of reaction time, encompassing measures of the speed of response and lapses (i.e., failure to respond to a stimulus within 0.5 s) [42]. Spatial coordination was assayed with the Unstable Track assay (TRACK), in which participants attempted to keep a moving cursor in the middle of the screen between two vertical lines [43]. Spatial orientation and response to feedback stimuli were probed with the Comparative Visual Search (CVS) [44]in which participants scanned the computer display for a mismatch between copy or mirror images consisting of distributions of right and left-oriented triangles. Finally, a risk-taking assessment was performed with the Balloon Analog Risk Task (BART), during which participants were asked to maximize their reward by choosing the number of times to inflate an animated balloon, potentially increasing their artificial reward with each inflation at the risk of loss if the balloon popped [45].
From the 59 participants who completed the study (see Results), there were a total of 2039 neurobehavioral performance test battery data collections attempted every 4 h during time awake on study days 3–10. Subdivided among the 4 study condition groups, this consisted of 434 data collections attempted for CT (14 participants, 31 timepoints each), 576 for DR (16 participants; 36 timepoints each), 504 for NR (14 participants, 36 timepoints each), and 525 for TD (15 participants, 35 timepoints). A small number of data collections were omitted from final analysis due to technical issues, responses omitted by the subject, or otherwise deemed unreliable. The most omissions were for the STROOP due to one subject seeming to have misunderstood the instructions and a second who was colorblind, resulting in 39 (results on blue-yellow tests only) to 75 (all colors) timepoints being omitted. For other neurobehavioral performance test battery endpoints, fewer than 10 of the 2039 possible timepoints were omitted. Neurobehavioral performance test battery results and data dictionary descriptions (including specification of omitted timepoints) have been posted for public access at the National Center for Biotechnology Information database of genotypes and phenotypes (see Availability of Data and Materials).
Sample collection and sequencing
Whole blood consisting of a 2.5 mL draw was collected into PAXgene® Blood RNA tubes (BD Biosciences Catalog No. 762165) during inpatient study days 3–10, with approximately 26–31 timepoints per participant. For all participants blood draws began in the morning of inpatient study day 3 (baseline segment) at 08:00 (RCH) (Additional file 1: Table S1). For CT participants, draws continued every 4 h during time awake through 16:00 (RCH) prior to departure on day 10, for a total of 31 draws. For TD participants, blood draws continued every 4 h during time awake through 20:00 (RCH) on study day 7 (recovery segment), followed by single timepoint morning blood draws at 08:00 (RCH) on study days 8–10, for a total of 27 draws. For NR and DR participants, blood draws continued every 4 h during time awake for study days 3–4, 6, and 8–10, for a total of 26 draws. Blood was not drawn during study days 5 and 7 for NR and DR participants. For all participants, after each draw the PAXgene® Blood RNA tube was gently inverted by hand 10 times, allowed to sit at room temperature for approximately 6 h, frozen at −20 degrees Celsius for approximately 25 h, and then transferred to −80 degrees Celsius until extraction. Blood draws were taken via intravenous lines immediately following the neurobehavioral performance test battery and (typically) preceding a meal. Study staff were trained such that in the rare event of a failed collection with the intravenous line draw, the blood draw could be attempted again with a butterfly needle (not analyzed).
Total ribonucleic acid (RNA) was extracted from PAXgene® Blood RNA tubes by either the FAA Civil Aerospace Medical Institute (CAMI) or the Baylor College of Medicine Human Genome Sequencing Center (Baylor). Extractions by CAMI were performed on approximately three samples per participant (roughly ten percent) for periodic quality checks of sample material prior to submission of the remaining sample set to Baylor. Timepoints for CAMI extractions were selected in a stratified pseudo-random fashion. The CAMI extractions were conducted on the QIAcube Connect with an automated spin-column approach using the PAXGene Blood miRNA kit (Qiagen 763134), with final RNA elution into nuclease-free water. The remaining samples were extracted by Baylor using the Chemagic Prime Total RNA Blood 4 k kit (PerkinElmer, catalog #CMG-1484) and the Magnetic Bead technology Chemagic Prime 8 platform. Baylor prepared libraries from all RNA (i.e., both samples extracted by Baylor and CAMI) using the Illumina TruSeq Stranded Total RNA with Ribo-Zero Globin kit, followed by sequencing targeting 100 million forward plus reverse 150 base pair paired-end reads, as previously described [26]. Sequences were analyzed by CAMI, as described below.
Analyses
Phenotypic data exploration involved the generation of plots in CRAN R versions 4.3.3 and R version 4.4.1 [46]with ggplot2 version 3.5.1 [47]and model runs to test for associations with the study condition groups. For metrics in the 45-min neurobehavioral performance test battery, linear models were run with the lmerTest package [48] version 3.1–3 command ‘lmer’ if qq-plots showed an approximately normal distribution (default settings, with fit by REML and t-tests using Satterthwaite’s method). Otherwise generalized linear models in the lme4 package version 1.1–35.5 [49] were run with the glmer command using Gamma (continuous) or Poisson (count) distributions, log link, default settings, and fit by maximum likelihood (Laplace Approximation). Other packages used included psych package version 2.4.6.26 [50]tidyverse version 2.0.0 [51]gtools version 3.9.5 [52]dplyr [53]and optimx 2023–10.21 [54]. Models contained a random term for the participant and fixed terms for the study condition group as well as for the cumulative number of hours since midnight (RCH) at the outset of baseline study day 3. This cumulative hours covariate was centered and scaled with the ‘scale’ command to improve model convergence. In addition, sleep staging data from scored polysomnographic recordings were assessed using CRAN R versions 4.3.3 and stats package [46]along with the lmerTest package [48]lme4 [49]gtools version 3.9.5 [52] data.table version 1.16.0, dplyr [53]tidyverse version 2.0.0 [51]MplusAutomation version 1.1.1 [55]lavaan version 0.6–18 [56]psych package version 2.4.6.26 [50]and stats package version 4_4..3.3. Specifically, MANOVA tests were performed to test for differences among condition groups in the percentage of time spent in Rapid Eye Movement (REM) sleep and in Non-Rapid Eye Movement (NREM) sleep stages 1, 2, and 3 during the baseline and first two recovery nights.
In addition to phenotypic analyses, transcriptomic sequences were analyzed. De-multiplexed fastq.gz raw RNA sequence files underwent quality checks using FastQC v0.12.1 [57] and multiqc v.1.14 [58]and mapping against the T2T-CHM13v2.0 reference genome ( [59] to generate expression counts at the gene level, with cloud pipeline execution by the Department of Transportation – Secure Data Commons using Amazon Linux platforms. This pipeline involved the use of CutAdapt v4.3 for the removal of the Illumina TruSeq adapters from raw reads, discarding sequence reads shorter than 50 bases, and trimming low-quality bases with the flag –nextseq-trim = 20 [60]. Trimming was followed by paired-read alignment using STAR v.2.7.10b with read length set to 150 bases during the generation of genome indices, and the –outMultimapperOrder Random flag for random output of multimapping reads [61]. Subsequently, featureCounts v2.0.5 was used for strand-specific paired-read generation of expression counts [62]. Chimeric fragments aligned to different chromosomes were discarded by setting the -C flag, and the -d 50 flag was used to reinforce 50 bases as a minimum read length.
Gene expression models largely used default settings, with exceptions specified below, using the limma v. 3.60.4 package of CRAN R v. 4.4.1. Genes with low expression were filtered out (i.e., models only analyzed genes that had at least 1 count per million in as many samples as there were participants), followed by trimmed mean of M values normalization [63]. Linear modeling of each gene was conducted with the voom approach [64]specifically using the function voomLmFit and the Benjamini and Hochberg method to generate False Discovery Rate (FDR) adjusted P-values for multiple testing. Genes were identified as differentially expressed if models yielded an FDR < 0.05 for the factor of interest.
In all gene expression linear models, participant was encoded as a random effect by specifying participant as a blocking variable, and all other terms were additive fixed effects. The cumulative number of hours since midnight (RCH) at the outset of baseline study day 3 (without centering or scaling) was encoded as a numeric covariate in an effort to account for the potential effects of increasing duration at the inpatient facility and repeat test administration. Based on principal component plots suggesting impacts of biological sex and RNA extraction method, these elements were incorporated in models with binary factor terms (male or female, FAA or Baylor). Models were run once on all 59 participants with a factor term to differentiate study groups (CT, DR, NR, TD), and again separately on each condition group with a dataset limited to the 14–16 participants in the group. In each model the final term of interest was either hours of wakefulness or a single neurobehavioral performance metric from the 45-min test battery encoded as a covariate (e.g., PVT lapses). Hours of wakefulness was defined as a count of the total number of scheduled hours of wakefulness since the last TIB sleep episode ended (e.g., for an assay at noon (RCH), if the participant’s most recent TIB ended at 08:00 RCH it was considered 4 h of wakefulness). Each neurobehavioral performance metric (and hours of wakefulness) was modeled separately. Running separate models on each neurobehavioral performance metric of interest on all 59 participants and again separately on each of the four condition groups could be criticized as a form of multiple testing. However, the approach was taken in this exploratory study to allow comparisons of genes differentially expressed relative to the neurobehavioral performance factor of interest in all versus just one sleep condition (CT, DR, NR, or TD) and to maximize the possibility of discovering biomarker candidates.
Finally, gene lists were submitted to QIAGEN Ingenuity Pathway Analysis for a Core Analysis – Expression Analysis [65] to explore molecular pathways and functions. Log2 fold-change values from limma models were used as the data type expression log ratio in Ingenuity Pathway Analysis and selected as the measurement type for Core Analysis runs. All genes passing the low-expression cutoff in limma (≥ 1 count per million in as many samples as participants) were used as the background reference set, and an FDR < 0.05 for the neurobehavioral factor of interest (e.g., PVT lapses) was selected as the filter cutoff to identify the foreground differentially expressed list. Settings were left at default, except for limiting species to mammals and excluding endogenous chemicals from interaction networks.
