Biospecimens and quality control
Table of Contents
- Biospecimens and quality control
- Pathology review
- Sample and library preparation for 10x single-cell multiome ATAC and gene expression
- Nuclei isolation
- Library preparation
- Multiome (snATAC-seq and snRNA-seq) sequencing
- Library preparation exclusion criteria and quality control
- scMultiome data pre-processing
- Inter-patient analysis
- Intra-patient analyses
- Pseudobulk peak calling
- Population annotations
- Assessing the degree of phenotypic heterogeneity in locus-specific populations
- CNV inference
- Assessing the degree of clonal architecture heterogeneity in populations
- Differentially accessible regions and transcripts identification and analyses
- Ambient RNA correction
- Validation of gene signatures in an external single-cell RNAseq dataset
- Reporting summary
Patients
Biospecimens were collected from five patients diagnosed with prostate cancer as part of the German ICGC cohort. Written informed patient consent and an ethical vote (Institutional Reviewing Board) were obtained according to the ICGC guidelines (see The local ethics board commission in Hamburg approved the study (PV3552 and PV4679), and the study was conducted following ethical guidelines (Declaration of Helsinki). No statistical methods were used to determine sample size since this study was exploratory.
Inclusion criteria were: primary prostate cancers, high-risk, histologically adenocarcinomas, multiple biopsies from a locally advanced specimen with a unifocal disease, and matching pelvic lymph node metastases.
Patients did not receive any neo-adjuvant radiotherapy, ADT, or chemotherapy before the prostatectomy with pelvic lymphadenectomy, except for one patient (ICGC_PCAL25) who received neoadjuvant hormone therapy with LH-RH agonist 6 weeks before surgery. One patient (ICGC_PCAL03) had a known oligometastatic disease stage at the time of operation (M1b stadium in conventional imaging with bone metastases).
All prostate specimens were analyzed according to a standard procedure, including a complete embedding of the entire prostate for histological analysis72. Clinical metadata were retrieved from the patient’s records, and follow-up data were collected with questionnaires.
Pathology review
Specialized pathologists dissected each prostate immediately after surgery as described before27. In brief, dissection followed a predefined scheme to represent the position of each block relative to the entire prostate. This procedure resulted in 60–150 pieces of tissues depending on the size of the prostate. After dissection, each tissue block was placed on a separately labeled cork plate covered with a special compound for cryopreservation (OCT) before the tissue was frozen to −20 °C and subsequently stored at −80 °C. Accordingly, the dissection of the lymph node packages proceeded. For each patient, 10 distinct tumor biopsies were taken, and matching lymph node metastases were included in this study (Supplementary Dataset 1). In total, we had a sample collection of 66 samples from 5 different high-risk locally advanced prostate cancer patients (50 primary gland biopsies and 16 matching lymph node biopsies). Gleason grading per individual biopsy was reported (Supplementary Dataset 1). Tumor cell content exceeded 80% for each sample according to pathology reviews.
Sample and library preparation for 10x single-cell multiome ATAC and gene expression
Single-cell sequencing can be prone to batch effects. Given the complexity of our cohort (multiple patients, samples, and tissues), we implemented the following strategy to ensure we are minimizing potential confounding variables of wet-lab library preparation batches and sequencing batches outlined in (Supplementary Dataset 1). Most library-preparation batches (4 samples at a time) incorporated samples from a distinct patient while including randomly chosen samples from another patient. We aimed to randomize samples derived from the same tissue sites across batches as well, such that not all metastatic samples were processed together. This approach was chosen to reduce library preparation batch effects and to better reveal biological variability from technical noise in downstream analyses.
Nuclei isolation
The nuclei were isolated following the adapted protocol for clinically relevant frozen tissues described previously73. We kept the samples on dry ice until use and weighed them individually before starting. The range in weight was between 7 and 24 mg. To perform single-cell multiome ATAC + gene expression analysis, we added 0.2 U/µl of Protector RNAse-Inhibitor (Sigma Aldrich PN-3335399001) to the homogenization buffer (according to the 3’GEX assay). In brief, we thawed the frozen tissue in a pre-chilled douncer with 1 ml homogenization buffer for 5 min. We then dounced with pestle A for 10 strokes on ice and continued with pestle B for another 10 strokes. We next filtered the solution through a 40 µm tip into a prechilled 1.5 ml Eppendorf tube and spun the nuclei (5 min, 500 × g4 °C) in a fixed-angle centrifuge. Then, we removed all but the pellet with 50 µl of supernatant and re-suspended it in 350 µl HB to have a total volume of 400 µl. We transferred it to a 2 ml Lo-Bind Eppendorf tube and added 1 volume (400 µl) of 50% iodixanol solution to give a final concentration of 25% iodixanol and mixed by pipetting. Following, we slowly layered 600 µl of 30% iodixanol on top of 600 µl prepared 40% iodixanol and put 800 µl of 25% iodixanol/sample mixture carefully on top. After using a swinging bucket centrifuge (spinning for 20 min at 3000 × g with the brake off), we collected 200 µl of nuclei from the 30%–40% interface into a chilled 1.5 ml tube. We then aliquoted 32,500 nuclei (to aim eventually for 10,000 nuclei) into a 1.5 ml tube. We again spun (10 min, 600 × g4 °C) and discarded the supernatant. We next resuspended the pellet in ~6 µl diluted nuclei buffer + RNase inhibitor (see user guide, Chromium Next GEM Single Cell Multiome ATAC + gene expression from 10x).
Library preparation
Single-nucleus libraries were generated using the 10x Chromium single-cell Instrument and NextGEM Single Cell multiome ATAC + gene expression kit (10x Genomics) according to the manufacturer’s protocol (Chromium Next GEM single cell multiome ATAC + gene expression Reagent Kits User Guide; CG000338 Rev E). To minimize PCR biases during ATAC library construction, we performed an extra short quantitative PCR (qPCR) to assess how many cycles are needed to amplify our libraries without reaching saturation (1:72 °C for 5 min, 2:98 °C for 30 s, 3:98 °C 10 s, 4:63 °C for 30 s, 5:72 °C for 1 min, 6: go to 3–20 times). At the end of the qPCR run, we determined the number of cycles that should be used by looking at the plot of fluorescence vs. the cycle numbers to find out what the maximum fluorescence was, and at what cycle number ¼ of that maximum fluorescence was reached and continued with the ATAC library construction by PCR with 7–9 cycles.
Multiome (snATAC-seq and snRNA-seq) sequencing
Final libraries were quantified using the Qubit dsDNA HS Assay Kit (Life Technologies). The fragment size of every library was analyzed using the Bioanalyzer high-sensitivity chip. All 10x scMultiome ATAC and gene expression libraries were sequenced on the NovaSeq 6000 instruments (Illumina) according to the manufacturer’s protocol (Chromium Next GEM Single Cell Multiome ATAC + gene expression reagent kits user guide; CG000338 Rev E). BCL files were converted into fastq using the command cellranger_arc mkfastq with the default parameter. The generated fastq files were processed using cellranger-arc v2.0.0 count function for read trimming, alignment, duplicate marking (ATAC), and UMI counting (GEX) (Cell Ranger ARC, 10x Genomics). The reads were aligned to the pre-built human reference genome GRCh38—v2020-A-2.0.0 (3 May 2021) provided by 10x Genomics.
Library preparation exclusion criteria and quality control
Nuclei were evaluated after staining with 4′,6-diamidino-2-phenylindole (DAPI) and visualized under a high-powered fluorescent microscope (minimum magnification: 40x). Only samples containing high-quality nuclei, defined as clump-free, debris-free, and exhibiting intact membranes, were selected for downstream experimental procedures to ensure data integrity.
After ATAC-library amplification, a quality control (QC) step was performed using qPCR to assess chromatin accessibility. Fold enrichment was calculated by comparing two accessible chromatin regions to two inaccessible chromatin regions. A minimum fold enrichment of 10-fold in accessible regions was required for sample inclusion. The following primer pairs were used:
• Accessible regions (POS)
-
∘ GAPDH: (f) 5′-gccaatctCagtCCTTCCC-3′, (R) 5′-‘-ggackgcctactttt-3′
-
∘ KAT6B: (F) 5′-GAAGAGGCGGACCCAGCGGT-3′, (R) 5′-TTCCTGCCGGTCATCTCGCTT-3′
• Inaccessible regions (NEG)
-
∘ SLC22A3: (F) 5′-GGAGAGGGTGGACAGATTGA-3′, (R) 5′-TCAGCCTTGCTGCTACAGTG-3′
-
∘ QML93: (F) 5′-CACTGGTTGTCTTTGCAGGA-3′, (R) 5′-CCTGGGTCATATTGGGACAC-3′
Samples failing to meet these criteria—such as those with poor nuclei integrity or insufficient fold enrichment in accessible over inaccessible chromatin regions—were excluded from further analysis.
Therefore following samples were excluded: ICGC_PCAL03_Tu5, ICGC_PCAL03_Tu6, ICGC_PCAL03_Tu8, ICGC_PCAL03_Tu9, ICGC_PCAL10_LN3, ICGC_PCAL10_Tu8, ICGC_PCAL10_Tu9, ICGC_PCAL25_LN1, ICGC_PCAL25_LN2, ICGC_PCAL25_Tu2, ICGC_PCAL25_Tu3, ICGC_PCAL25_Tu4, ICGC_PCAL25_Tu5, ICGC_PCAL25_Tu7, ICGC_PCAL25_Tu9, ICGC_PCAL25_Tu10, ICGC_PCAL34_Tu2, ICGC_PCAL34_Tu3, ICGC_PCAL34_Tu4, ICGC_PCAL34_Tu7, ICGC_PCAL34_Tu10, ICGC_PCAL37_Tu6.
Samples with fewer than 1500 cells remaining after cellranger-arc v2.0.0 and QC (described below) were also removed: ICGC_PCAL03_LN2 (838 cells post QC), ICGC_PCAL10_Tu7 (52 cells post QC), ICGC_PCAL10_Tu10 (1339 cells post QC), ICGC_PCAL34_Tu1 (319 cells post QC), ICGC_PCAL37_LN1 (670 cells post QC), ICGC_PCAL37_Tu1 (317 cells post QC). In total, 38 samples (27 individual primary gland tumor biopsies and 11 matching lymph node biopsies) passed all QC criteria for inclusion in the analyses presented in this manuscript (Supplementary Dataset 1 batches and QC).
scMultiome data pre-processing
For each sample, output files from Cell Ranger ARC were processed with Seurat and Signac, with the following versions: R = 4.1.0, Signac = 1.4.0, Seurat = 4.1.0. Cells were filtered if one of the following criteria was met: nFeature_ATAC > 500, nFeature_RNA > 500, percent.mt <20, TSS.enrichment > 2. Cells were removed if both of the following conditions were met: nFeature_RNA > 8000 and nFeature_ATAC > 80000. Notably, we computed TSS enrichment using the Signac function (TSSEnrichment) and found that median TSS scores reported by ArchR on the same cells are on average, 2.5× higher than those reported by Signac, showcased by recomputing TSS scores using ArchR and Signac across 11 samples from ICGC_PCAL34 (Supplementary Dataset 1). 99% with TSS over 2 in Signac had TSS over 4 using ArchR, and the remaining 1% had a median of Signac TSS 3.6. Lastly, we did not opt for doublet correction and chose to filter on the top percentile of counts for RNA and ATAC, as doublet correction methods on RNA and ATAC individually did not produce results that agreed across samples74,75. Lastly, ATAC data from each sample were then reprocessed using a binned genome approach with 5 kb window sizes across hg38 using the Signac FeatureMatrix function. Downstream analyses were performed on the resulting Seurat object, containing an RNA and tiled genome matrix and were merged across patients.
Inter-patient analysis
Inter-patient data analysis was done using Python (version 3.9.13). Due to the large size of the patient-wide dataset containing all samples, we transitioned from using R to Python-based packages. Subsequent to QC steps and cell filtering described above, Seurat RNA and ATAC assays from each patient were converted to a Python-readable format using R packages MuDataSeurat and SeuratDisk. We then used the Muon package to load and merge each assay across all samples76 and proceeded with downstream processing.
For RNA, we removed mitochondrial and ribosomal transcripts prior to cluster/population calling. We had already used mitochondrial-based metrics to remove low-quality cells. However, mitochondrial and ribosomal reads can represent technical variability unrelated to the biological variability of interest77. Thus, for a more conservative approach in understanding inter-patient heterogeneity, we decided against the inclusion of mitochondrial and ribosomal reads across patients to minimize potential technical variation across patients influencing population-calling (clustering) results. We used scanpy default functions to normalize and scale data (normalize_total), retrieve and highly variable genes (highly_variable_genes), perform dimensionality reduction with PCA, and build the nearest-neighbor graph with defaults.
For ATAC, we normalized with TF-IDF and retrieved the top 5000 highly variable features, performed dimensionality reduction with LSI, and built the nearest neighbor graph on the top 50 LSI dimensions, excluding dimensions with higher than 0.7 Pearson correlation coefficient with depth (nCount_ATAC). We then used the Muon implementation of the weighted-nearest neighbor approach to build an integrated graph of the RNA and ATAC data, which captures the complementary information from both gene expression and chromatin accessibility in each cell28,76. We used Leiden clustering resolution with 1.3 on the obtained multi-modal graph. After clustering, we called peaks for each population using the MACS3 algorithm and built an inter-patient-wide feature set as described below (pseudobulk population calling). We then repeated dimensionality reduction steps for ATAC, recomputed WNN, clustering, and population assignments. Lastly, we did not proceed with batch correction, as nonmalignant cell populations showed intermixing across patients. This is indicative of the clustering algorithms preserving genuine biological differences. Batch correction could be potentially detrimental to removing meaningful differences across malignant epithelial populations78.
Intra-patient analyses
All intra-patient analyses were performed in R version = 4.1.0, Signac = 1.4.0, and Seurat = 4.1.0. For prostate-specific analyses (Fig. 2), we merged samples coming from the primary tumor sections within a prostate gland of a given patient. For intra-patient analyses across tissues (Fig. 4), we merged all samples from metastatic lymph nodes and primary prostate tumor sections. For each merged object, we processed RNA, ATAC, and the subsequent Multiome object as described above, but used the equivalent R-based Seurat/Signac implementations. For RNA processing, we used SCTransform to normalize the filtered count matrix and PCA to run louvain graph-based clustering on the first 50 principal components, and for ATAC, we used RunTFIDF to normalize the data, found highly variable regions using FindTopFeaturesand RunSVD to project ATAC data to lower dimensions. We built the nearest neighbor graph on the top 50 PCA and top 50 LSI dimensions, excluding dimensions with higher than 0.7 correlation with ATAC depth (ATAC_nCount). The weighted nearest neighbor graph was built using FindMultiModalNeighbors. Populations were called on this graph using FindClusters for each patient with multiple resolutions, we tested and annotated clusters across resolutions 0.4, 0.6, and 0.8; and found 0.4 to cover the major cell types across all patients.
Pseudobulk peak calling
We wrote a Snakemake pipeline for calling and merging peaks for any group of cells in our dataset. First, we generated a fragment TSV file for every cell in a given group (in this case, our called populations) per sample. Second, MACS3 was then used to call peaks on the fragment files with the following parameters: -f BED -q 0.01 -g 2.7e9 –keep-dup all –nomodel –nolambda –shift −75 –extsize 150 –bdg –call-summits79 and peaks were iteratively merged80. We then re-computed the FeatureMatrix in Signac and either transferred to Python for inter-patient analysis or added them to the Seurat objects for intra-glandular/intra-patient analyses. Third, the new FeatureMatrices were used to obtain signal over promoter and non-promoter elements and compute ChromVar-based transcription factor enrichment scores29. For each peak catalog, we assigned whether a peak was a promoter element if it was within +−1 kb TSS in gencode transcript annotations, otherwise, the peak was annotated as non-promoter, which was used in subsequent Pearson correlation analyses described below. We used a merged peakset across all samples for inter-patient and patient-specific peaksets for intra-patient analyses.
Population annotations
We used a multi-faceted manual and automated approach to go about population annotations, independently utilizing RNA and ATAC data. For every population, we computed the marker genes and chromVar transcription factor motif enrichments by looking at the top upregulated genes relative to the entire sample of interest (Seurat’s FindAllMarkers). For every cell in our dataset, we computed the singleR assignments using the publicly available default datasets and a prostate-specific dataset81,82. We also computed each cell’s ModuleScore across gene expression markers reported in relevant papers24,82,83,84 using Seurat’s AddModuleScorewith additional cell-cycle markers obtained from We then visualized all singleR’s cell type assignment proportions per population and visualized ModuleScores to facilitate manual cell type assignments. For every population, we checked the alignment of the highly expressed genes with expected genes per cell type with literature search and available atlases (proteinatlas.org). We also utilized ATAC-based transcription factor-deviation scores per population, as ATAC signal tends to be less prone to biological variability such as cell cycle, and TF-motif enrichment computations rely on more global genome-wide statistics. We cross-referenced TFs based on their enrichment relative to the expected average (which would contain more malignant epithelial cell TFs) and annotated populations according to TFs. Furthermore, more than one expert was exposed to the population markers and visualizations and independently annotated populations according to the steps above to ensure agreement. This rigorous, multi-layered approach provided confidence that our findings represent biological cell types within and across patient samples.
Assessing the degree of phenotypic heterogeneity in locus-specific populations
We assessed heterogeneity between populations in two ways. First, we reported how inter-mixed populations were based on variables of interest, such as patient, tumor locus, and tissue, using Shannon entropy. Additionally, to compare scores across patients when reporting the median across patients, we normalized the scores to be between 0 and 1 when 0 indicating no entropy and 1 indicating maximum entropy relative that can be achieved given the number of loci and cells per loci. Second, we derived Pearson correlation values on DESeq2 normalized pseudobulked signal across the three feature sets of non-promoter DNA elements, as well as promoter elements and gene expression. We obtained normalized pseudobulked values per population by summing raw counts of each population according to the three feature sets and normalizing them using DESeq2 vst85.
To assess how transcription factor families drive variance, we conducted PCA on median chromVar motif deviations per population. We then binned the feature loadings into 30 bins across 633 motifs, covering the top and bottom ~5th percentiles of features contributing to a given axis of variation.
To contextualize biological programmes driving intra-prostate tumor heterogeneity, we used the SCENIC+ pipeline on tumors from ICGC_PCAL34 and ICGC_PCAL0337. Pre-processing of ATAC data for the SCENIC+ pipeline was done using the polars_1xx branch of the pycisTopic package for scalability purposes. ScRNA preprocessing and the SCENIC+ pipeline were run according to recommendations highlighted at eRegulons specificity scores for each malignant epithelial cluster were calculated using sceniecplus.RSS.regulon_specificity_scores() to select the top 10 most specific and confident eRegulons per malignant epithelial cluster. The genes of the highly specific eRegulons were used to create gene lists for pathway enrichment using the gseapy Python package.
CNV inference
We used the NUMBAT algorithm39,86 to perform CNV inference. We performed NUMBAT per sample to ensure tumor vs. normal calls were properly cross-referenced with our RNA/ATAC-based cell type annotations. We then used NUMBAT on a patient-wide basis to get a sense of all CNV segments with higher accuracy. However, due to the large number of cells from each patient, we kept only the cells annotated as malignant epithelial and had to further subsample the number of cells used in the per-patient analysis as follows, for the pipeline to run. To allow computational feasibility, we kept a maximum of 3000 cells per sample in ICGC_PCAL03, ICGC_PCAL10, ICGC_PCAL37, and 6000 malignant epithelial cells per sample in ICGC_PCAL34 and ICGC_PCAL25. Lymphocyte populations from ICGC_PCAL10 and ICGC_PCAL37 were used as background, since they had the largest number of cells from the TME. For visualization purposes (Fig. 2b), we used the most common CNV call in all cells assigned to a given clone. To compute the difference between clones, we used the Hamming metric, which measures the distance between two vectors of equal length. The CNV calls per cell were then used to compute this metric between all cells and clones, to broadly assess the difference between clones in a given patient.
Assessing the degree of clonal architecture heterogeneity in populations
We assessed how similar the clonal makeup of each population was according to the Jensen–Shannon (JS) divergence metric. We used JS divergence as it provides a symmetrical and finite measurement of the difference between two probability distributions. For every malignant epithelial population with more than 200 cells, we computed the proportion of cells that were assigned to each clone, and the JS distance was calculated per population pairs.
Differentially accessible regions and transcripts identification and analyses
To identify differentially accessible peaks and transcripts, we pseudobulked raw counts from malignant cells from a given patient per sample across all features. Only cells annotated as malignant epithelial were kept. We then used DESeq2’s ~patient+tissue to retrieve differential peaks and transcripts of primary prostate and metastatic lymph nodes, FDR thresholded at <0.1. We then computed transcription factor DNA motif enrichment scores using HOMER on differentially accessible peaks with absolute LFC > 1 and q < 0.187.
Ambient RNA correction
To gain confidence in our differential analyses, we repeated our pipeline but used ambient RNA corrected count matrices using CellBender43. We optimized parameters per sample, except for the FPR, which we had to keep consistent across all, according to the authors (FPR = 0.05). Overall, CellBender retained a median of 40% more cells than CellRanger and did not have an apparent benefit to the malignant epithelial cell expression patterns (Supplementary Fig. 8a, b). We kept only the cells that passed QC according to our scMultiome data pre-processing as outlined, and repeated our differential analysis as described above.
Validation of gene signatures in an external single-cell RNAseq dataset
To validate the AUCell enrichment scores of gene signatures, we utilized an external dataset47. Specifically, we downloaded the Seurat object (primecut.integrated.geneexp.seurat.rds) from the publicly available repository hosted at This dataset contains gene expression profiles of castration-sensitive prostate cancer (mCSP) cells from various metastatic sites. We subset the dataset to exclude TME cell types, isolating a total of 3976 malignant epithelial mCSPC cells. These cells were distributed across four metastatic sites: lymph nodes (n = 1096), bone (n = 1263), lung (n = 641), and liver (n = 976). We calculated AUCell enrichment scores for our lymph node up- and down-regulated genes across the treatment stages in all metastatic sites to assess the dynamics of our gene signatures in response to treatment. We calculated AUCell enrichment scores for lymph node up- and downregulated genes across treatment stages in all metastatic sites to assess gene signature dynamics. Scores were normalized between 0 and 1, and a Kruskal–Wallis test determined whether at least one treatment group differed significantly. Pairwise comparisons were conducted using the Wilcoxon rank-sum test. This approach allowed us to investigate both site-specific and systemic changes in gene signature dynamics during mCSPC progression and treatment.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
