Breast cancer is one of the most prevalent cancers that affect women worldwide. There are numerous factors that can predict this illness, such as inherited genetics, reproductive factors, and lifestyle choices.
The etiological distinction between pre- and post-menopausal breast cancers has been highlighted in earlier research. Recently, scientists have used many methods to accurately predict breast cancer in female patients.
Background
Machine learning (ML) techniques can process complex non-linear relationships and analyze enormous datasets on predictors. Although ML has been used in earlier studies to estimate the risk of breast cancer, no predictors were identified.
It is possible to use hypothesis-free methods to find new breast cancer predictors due to the United Kingdom Biobank (UKB), which includes a large and comprehensive cohort. Using genome-wide association studies (GWAS), a new invention called polygenic risk scores (PRS) can forecast the impact of thousands of genetic variations linked to particular diseases or traits.
People with a high disease risk can be identified using PRS, and statins can be prescribed to them right away. Notably, PRS improved the accuracy of pre-existing risk indicators for coronary artery disease, such as the Framingham risk score.
In the past, risk prediction models like the Tyrer-Cuzick model and the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA) have been integrated with breast cancer PRS. Contradictory results have been obtained despite analysis of the relationship between PRS and phenotypic traits including gene-environment interactions for breast cancer.
About the study
Recent research published in Scientific Reports used machine learning (ML) techniques for feature selection and Cox models for risk prediction. The major objective of this work was to show how well machine learning (ML) approaches may be used for feature selection to support traditional statistical methods.
To investigate any potential interactions between phenotypic characteristics and PRS, SHapley Additive exPlanation (SHAP) feature dependent plots were used. With more than 500,000 participants from England, Wales, and Scotland, data from UKB were used for the current study. Physical examinations, questionnaires, biological samples, and verbal interviews with a qualified nurse were used to gather baseline data.
Due to the aforementioned menopausal status-related etiological heterogeneity, post-menopausal women between the ages of 40 and 69 at baseline were recruited. Using the International Classification of Diseases codes, which included PRS313 and PRS120k as probable genetic characteristics, the incidence of breast cancer was determined.
Study findings
The overall number of participants in this study was 104,313, and during the 11.9-year follow-up period, 4,010 of them had breast cancer. Several known and unknow risk variables for the occurrence of post-menopausal cancer were found by combining machine learning (ML) with conventional statistical methods used in cancer epidemiology.
Age at menopause, testosterone levels, and age were the three recognized known risk variables. Additionally, five new indicators were discovered, including blood counts, urine biomarkers, and blood biochemistry.
The incidence of post-menopausal breast cancer was substantially correlated with the newly discovered predictors. To determine whether these are breast cancer risk factors that may be modifiable in the future, more study is required.
Body mass index (BMI) was not used in the XGBoost model; instead, a precise body composition measure was chosen, suggesting that this metric is a key indicator of breast cancer risk. Contrary to a prior study that revealed no connection between basal metabolic rate and breast cancer, the basal metabolic rate was also found to be a significant predictor for breast cancer.
A blood biomarker for kidney function called plasma urea has also been linked to breast cancer. This is the first time that a link has been found between plasma phosphate, sodium, or creatinine and breast cancer.
According to agnostic ML models, the two polygenic risk scores were the greatest risk variables. PRS are demonstrably important predictors of post-menopausal breast cancer, as shown by Cox regressions.
Conclusions
In the present investigation, blood counts, blood biochemistry, and urine biomarkers were found to have five statistically significant unique relationships with post-menopausal breast cancer. The discrimination performance was maintained after including these five novel features in the Cox model’s foundational model. Furthermore, the SHAP value determined that the two pre-specified PRSs were the most significant features.
These results encourage additional investigation into the use of more exact anthropometry measurements to enhance breast cancer prediction. Prior to adoption in clinical practice, external validation of the results is a crucial step.