BMC Medical Informatics and Decision Making (Aug 2005)
Exploring cancer register data to find risk factors for recurrence of breast cancer – application of Canonical Correlation Analysis
Abstract Background A common approach in exploring register data is to find relationships between outcomes and predictors by using multiple regression analysis (MRA). If there is more than one outcome variable, the analysis must then be repeated, and the results combined in some arbitrary fashion. In contrast, Canonical Correlation Analysis (CCA) has the ability to analyze multiple outcomes at the same time. One essential outcome after breast cancer treatment is recurrence of the disease. It is important to understand the relationship between different predictors and recurrence, including the time interval until recurrence. This study describes the application of CCA to find important predictors for two different outcomes for breast cancer patients, loco-regional recurrence and occurrence of distant metastasis and to decrease the number of variables in the sets of predictors and outcomes without decreasing the predictive strength of the model. Methods Data for 637 malignant breast cancer patients admitted in the south-east region of Sweden were analyzed. By using CCA and looking at the structure coefficients (loadings), relationships between tumor specifications and the two outcomes during different time intervals were analyzed and a correlation model was built. Results The analysis successfully detected known predictors for breast cancer recurrence during the first two years and distant metastasis 2–4 years after diagnosis. Nottingham Histologic Grading (NHG) was the most important predictor, while age of the patient at the time of diagnosis was not an important predictor. Conclusion In cancer registers with high dimensionality, CCA can be used for identifying the importance of risk factors for breast cancer recurrence. This technique can result in a model ready for further processing by data mining methods through reducing the number of variables to important ones.