Using Machine learning to Identify cancer-driving mutations

Researchers examine whether machine learning can locate pan-cancer mutational hotspots at persistent CCCTC-binding factor (P-CTCF) binding sites (P-CTCFBSs) in a recent study that was published in the journal Nucleic Acids Research.

Cancer and CTCF

Mutations affecting the CTCF-binding site affect CTCF, a transcription- and nuclear architecture-regulating protein found in non-coding DNA. Constant CTCF-BSs exhibit binding conservation and resistance to CTCF knockdown.

Higher binding strengths, unique constitutive binding, chromatin loop anchor enrichment, and topologically associating domain (TAD) borders set these subtypes apart. Oncogenic genes can be activated by mutations in the CTCF binding site, albeit few of these mutations have been found.

Regarding the research

In the current study, researchers created CTCF-In-Silico Investigation of PersisTEnt Binding (INSITE), a computational technique that can predict the persistence of CTCF binding after knockdown in cancer cells.

The machine learning tool CTCF-INSITE evaluates both genetic and epigenetic traits that explain the endurance of CTCF binding. By creating persistence metrics for the Encyclopedia of DNA Elements (ENCODE) CTCF ChIP-sequencing data from various tissue types, the mutational load at PCTCF binding sites was ascertained using International Cancer Genome Consortium (ICGC) sequences from matched cancers. Utilizing high-coverage whole-genome sequencing (WGS) data from the platinum genome effort and data from the National Center for Biotechnology Information (NCBI) were also employed in the investigation.

Using CTCF ChIP-seq data from IMR-90, MCF7, and LNCaP cell lines obtained from lung tissue, breast cancer, and prostate adenocarcinoma, respectively, the researchers screened cohorts with fewer mutations per individual. The Interquartile Range (IQR) approach was used to identify and remove outliers, and 24 cohorts totaling 3,218 patients were made available for the investigation.

Then, mutations from cohorts of the same cancer type were combined to produce twelve different cancer kinds. Genomic characteristics, chromatin connections, binding affinity, replication timing, constitutive binding, and conservation scores were examined for IMR-90, LNCaP, and MCF7 cells.

Because random forest modeling predicts CTCF binding in silico more accurately than linear regression models, it was utilized. Data were split using a 9:1 ratio into training and testing datasets.

The binding site inside a ChIP-seq peak ranging from 200 to 2,000 base pairs (bp) was also ascertained by binding motif investigations. Every region of a ChIP-seq peak was then assigned a motif score.

Every patient’s trinucleotide mutational context was ascertained using gene set enrichment analysis (GSEA), and the mutational burden between P—and L-CTCF-BSs was compared using fluorescence polarization DNA binding (FPDB) assays. A background mutation rate of CTCFBSs was produced for each malignancy by aggregating these findings.

Study findings

In prostate and breast tumors, P-CTCF binding sites had much higher mutational rates than all other CTCF binding sites. In all 12 cancer types studied, projected P-CTCF binding locations had a significantly higher mutational load. P-CTCF binding site mutations, which are likely to have a functional effect on CTCF chromatin looping and binding, were substantially more abundant.

In vitro tests demonstrated that the disruptively anticipated P-CTCF binding site cancer mutations inhibited CTCF binding. Mutations in P-CTCF binding sites were more common than L-CTCF in 12 different cancer types. P-CTCF binding site mutations were linked to loop disruption, implying that they contribute to three-dimensional genomic dysregulation in cancer.

Binding affinity is critical for P-CTCF-BS survival, particularly at chromatin loop anchors, late replication timing areas, and TAD borders. Furthermore, the persistence of chromosomal loops suggests durability.

The researchers discovered significant allelic imbalances in binding at 91 locations, with mutations reducing binding affinity. Breast cancer genes were downregulated by ultraviolet (UV) light, but prostate cancer genes were enriched during the epithelial-to-mesenchymal transition. P-CTCF-BSs had a higher mutational rate and were significantly more likely to have disruptive mutations than L-CTCF binding sites.


The study’s findings identified a novel subclass of cancer-specific CTCF-BS DNA alterations and provide light on their critical function in pan-cancer genomic architecture. CTCF-INSITE revealed considerable enrichment of mutations across cancer types. These mutations are thought to be functional because they may break chromatin loops and cause lower binding in in vitro binding tests.

The heightened mutational signal at P-CTCF binding sites should help researchers investigate the mutational characteristics of various forms of cancer. Thus, the predictive power of CTCF-INSITE for CTCF-BSs suggests intriguing candidates for experimental modification, which researchers must prioritize in order to better understand the genesis of cancer.

Source link