Library Digital Collections

A dataset of chromosomal instability gene signature scores in normal and cancer cells from the human breast

View Collection Items

Collections »

A dataset of chromosomal instability gene signature scores in normal and cancer cells from the human breast

About this collection


1 digital object.

Cite This Work

Baba, Shahnawaz A.; Labhsetwar, Shreyas; Klemke, Richard; Desgrosellier, Jay S. (2023). A dataset of chromosomal instability gene signature scores in normal and cancer cells from the human breast. UC San Diego Library Digital Collections.


These data show the relative amount of chromosomal instability (CIN) in a diverse array of human breast cell types, including non-transformed mammary epithelial cells as well as cancer cell lines. Additional data is also provided from human embryonic and mesenchymal stem cells. To produce this dataset, we compared a published chromosomal instability gene signature against publicly available datasets containing gene expression information for each cell. We then analyzed these data with the Python GSEAPY software package, providing a CIN enrichment score for each cell. These data are useful for comparing the relative amounts of CIN in different breast cell types. This includes cells representing the major clinical (ER/PR+, HER2+ & Triple-negative) as well as intrinsic breast cancer subtypes (Luminal B, HER2+, Basal-like and Claudin-low). Our dataset has a great potential for re-use given the recent surge in interest surrounding the role of CIN in breast cancer. The large size of the dataset, coupled with the diversity of the cell types represented, provides numerous possibilities for future comparisons.

Creation Date
  • 2018 to 2023
Date Issued
  • 2023
Principal Investigator

FASTQs were converted to gene-expression matrices and the files were processed to remove all the header information and only retain the data. The CIN gene signature was acquired from Bakhoum et al. The CIN scores were obtained by examining enrichment for the CIN associated gene signature in each cell type represented in the sequencing datasets according to Barbie et al. To generate the CIN scores for each cell type, we analyzed data with the Python GSEAPY Library ( First, input files were read using Python’s Pandas library and joined with each other using the ID & amp columns before deleting any unnecessary columns. Ensemble Gene IDs were mapped to their HGNC Symbols using Python’s BioMart API Any Ensemble ID which did not have a corresponding HGNC Symbol was dropped. Once we obtained the data frame having HGNC Symbols as rows, samples as columns, and their feature counts as values in all rows, this data frame, along with the CIN gene set was passed to the Single Sample GSEA Python library. The final data comprised 36866 rows and 106 columns before feeding it into GSEAPY. To determine the enrichment scores (ES), we applied Single Sample GSEA to the final data frame. The experiment was repeated with a normalized version of the data frame, but the normalized enrichment scores (NES) were identical to the ES. GSEAPY output was then processed into Excel format and saved as final results files.


Shahnawaz A. Baba was responsible for conceptualization of the study. Shahnawaz A. Baba, Labhsetwar, Shreyas, Richard Klemke, and Jay S. Desgrosellier were responsible for developing the methodology. Shreyas Labhsetwar and Richard Klemke were responsible for formal analysis and software development. Jay S. Desgrosellier was Principal Investigator.


Tobacco-Related Disease Research Program [Grant #T32IR4741 (to J.S.D.)]; and the California Breast Cancer Research Program [Grant #B28IB5479 (to J.S.D.)].



View formats within this collection

  • English
Related Resources