BreastMark: Transcriptomic Datasets and Breast Cancer Outcome
Methods
Gene Expression Data
Gene expression data sets were downloaded from the Gene Expression Omnibus or authors' websites in the form of raw data files, where possible. Only breast cancer datasets with survival information and at least 48 patients were included. Large datasets were chosen for this analysis so as to avoid the sampling effects associated with small datasets. A cut-off of 48 was chosen as all smaller breast cancer datasets either lacked detailed clinical data or had too few samples (approximately 30 samples or less). In total, 4,738 samples across 26 datasets incorporating 12 different microarray platforms were utilised to develop the BreastMark system (Table 1). Table 2 contains a breakdown of the clinical information available with each dataset. Where raw data were not available, the normalised data as published by the original authors were used. In the case of the raw data for the Affymetrix datasets (.cel files), gene expression values were called using the GeneChip (GC) robust multichip average method and data were quantile normalised using the Bioconductor package, affy. For the dual-channel platforms, data were loess normalised using the Bioconductor package limma. Hybridisation probes were mapped to Entrez gene IDs to gene centre the data. The Entrez gene IDs corresponding to the array probes were obtained using Biomart and the Bioconductor annotation libraries. Probes that hit multiple genes were filtered out. If there were multiple probes for the same gene, the probe values were averaged for that gene. This resulted in expression data for a total of 20,017 Entrez gene IDs across 4,738 samples.
microRNA Expression Data
miRNAs are frequently located within the introns of protein coding genes and in exons of non-coding transcripts. miRNA expression can be detected using conventional microarrays through host gene expression for intragenic miRNAs or by direct probe matching for intergenic miRNAs. A total of 1,987 samples were processed on U133A Affymetrix arrays, while 973 were processed on U133plus2 Affymetrix arrays (2,960 in total). U133A and U133plus2 microarrays have 22,277 probe sets in common. Using this information, it is possible to infer the expression of 341 miRNAs across 2,960 samples (based on miRBase version 13.0, Ensembl version 54_36p). As with the gene centred data, this information was also combined with the available clinical data for survival analysis.
Breast Cancer Subtypes
The R package genefu was used to classify the 4,739 breast cancer samples into the luminal A, luminal B, HER2, normal-like and basal molecular subtypes using the ssp2003, ssp2006, and pam50, classifiers.
Survival Analysis
We have combined detailed clinical data from each of the studies used here, including one or more of DFS, DDFS and OS. The software allows for each of these three survival end points to be analysed separately. Median expression was used to dichotomise the data, allowing stratification into high and low groups within each of the 26 individual datasets. Once a sample was assigned to a particular group, the 26 datasets were combined and a global pooled survival analysis was performed in real-time. It is important to treat each dataset separately when determining which group a sample belongs to, as the expression of these genes will vary greatly across the different experiments/platforms. In essence, each dataset is split into high and low in singularity to negate study-specific effects. Survival curves are based on Kaplan-Meier estimates and the log-rank P-value is shown for difference in survival. Cox regression analysis was used to calculate hazard ratios. The R package 'survival' was used to calculate and plot the Kaplan-Meier survival curve. All calculations were carried out in the R statistical environment.
Software Parameters
The software incorporates all the clinical data made available by the original authors. This allows the data to be analysed based on one or more common clinical parameters including patient age, tumour size, lymph node status, tamoxifen treatment, chemotherapy treatment, ER status, HER2 status, PR status and tumour grade. The software also allows the upper or lower quartiles of the expression of the gene of interest to be used to determine high and low groups within each of the 26 individual datasets.
Web Server
The interface is available on a publicly accessible web server and is updated quarterly. The software uses CGI to link the web server with the R/perl based algorithm. All calculations are carried out in real-time.
Validation of BreastMark Using the Oncotype DX Gene Signature
The 21 gene signature used by oncotype DX in predicting patient prognosis was downloaded from the original paper. This panel of prospectively selected genes comprises 16 prognostic genes normalised relative to the expression of five reference genes. The 16 prognostic genes are broken down into five categories: proliferation, invasion, HER2, estrogen and 'other'. The likelihood of breast cancer relapse in patients was based on a recurrence score (RS) algorithm constructed and tested on a cohort of 668 patient samples. The higher the RS, the poorer the patient outcome observed. This algorithm weights each of the five categories based on the influence they have on disease recurrence. For example, the proliferation group is weighted most highly and, therefore, the expression of these genes influences the RS the most. Each of the 16 oncogenes were queried in our dataset to test the effect each gene has on survival using the three above-mentioned survival end-points for prognosis, namely DFS, DDFS and OS. It is expected that the genes with the greatest influence on the RS would have the highest hazard ratios and the lowest P-values. Sample numbers will vary depending on the number of platforms with expression information available for a particular gene.
Validation of BreastMark Using the MammaPrint Gene Signature
The 70-gene prognostic signature was downloaded from the original paper along with their correlation with prognosis. It was possible to obtain unique Entrez gene IDs for 61 of these genes (there is more than one copy of PEC1, IGFBP5 and DIAPH3 (three) in the 70 gene list and five others have no Entrez gene ID). As with the oncotype DX signature, each gene was analysed separately within our datasets using the three survival endpoints, DFS, DDFS and OS. Although looking at these genes individually does not represent the full power of this prognostic signature, this dataset should still be enriched for prognostic markers. Additionally, the positive and negative correlation coefficients published by the original authors should be consistent with our observed hazard ratios of less than or greater than 1, respectively. Sample numbers will vary depending on the number of platforms with expression information for a particular Entrez Gene ID.
Receptor Tyrosine Kinases
We compiled a list of 58 RTKs from the literature. Using our algorithm, we identified which of the RTKs were associated with survival within the basal molecular subtype using the ssp2003, ssp2006 and PAM50 molecular classifiers (see above). DFS, DDFS and OS were used as the survival endpoints. A P-value of < 0.05 in a minimum of two out of three classifiers was considered significant. The data were dichotomised using three cut-offs, median expression, greater than the 75th percentile referred to as the 'high' cut-off and less than the 25th percentile referred to as the 'low' cut-off.