Survival analysis is a statistical framework used to model time‑to‑event outcomes, where the primary variable of interest is the time until an event such as death, recurrence, or disease progression occurs. A defining feature of survival data is the presence of censoring, which happens when the event has not been observed for some individuals by the end of the study period. Because traditional regression methods cannot properly account for censored observations, survival analysis provides a more appropriate and flexible approach for analyzing clinical and epidemiological data, including breast cancer studies where follow‑up times vary and not all patients experience the event of interest (Kleinbaum & Klein, 2012).
The development of modern survival analysis methods began with the introduction of the Kaplan–Meier estimator (Kaplan & Meier, 1958), which allowed researchers to estimate survival probabilities non‑parametrically. However, while the Kaplan–Meier method is useful for describing survival patterns and comparing groups, it does not allow for the inclusion of multiple covariates. This limitation led to the creation of the Cox proportional hazards model (Cox, 1972), a semi‑parametric regression model that relates covariates to the hazard of an event without requiring the specification of the baseline hazard function. The Cox model quickly became one of the most widely used tools in medical research because of its flexibility and interpretability.
In breast cancer research, the Cox proportional hazards model is especially valuable because it allows investigators to examine how clinical, demographic, and biological factors influence the risk of recurrence or mortality over time. Variables such as tumor stage, hormone receptor status, treatment type, and patient age can be incorporated into the model to estimate their effects on the hazard of an event. The model’s ability to handle censored data and adjust for multiple predictors makes it a powerful method for understanding patient outcomes and identifying prognostic factors. The following sections describe the Cox model in more detail, including its assumptions, mathematical formulation, limitations, and implementation in R.
2. Methods
The goal of this analysis was to evaluate how tumor stage at diagnosis influences survival time among breast cancer patients. Guided by the hypothesis that higher tumor stage is associated with significantly worse survival, we used a combination of non‑parametric and semi‑parametric survival analysis techniques to model time‑to‑event outcomes. The dataset included demographic, clinical, and tumor‑specific variables such as age, race, AJCC 6th edition stage, tumor size, tumor differentiation, estrogen and progesterone receptor status, regional nodes examined, regional nodes positive, survival months, and event status. The analytical approach consisted of constructing Kaplan–Meier survival curves to visualize unadjusted survival differences across tumor stages, performing log‑rank tests to compare survival distributions, and fitting a Cox proportional hazards model to quantify the association between tumor stage and the hazard of death while adjusting for relevant covariates. All analyses were conducted in R, and model assumptions were evaluated to ensure the validity and interpretability of the results.
2.1 Functions and Equations
Kaplan–Meier Estimation
The survival function, \( S(t) \), quantifies the probability that an individual survives beyond a specified time \( t \). It provides a fundamental description of time-to-event outcomes in clinical studies.
The Kaplan–Meier estimator is used to estimate the survival function non-parametrically:
\(( d_i)\) denotes the number of deaths at time \(( t_i )\)
\(( n_i)\) corresponds to the number of individuals at risk immediately prior to \(( t_i )\)
Kaplan–Meier curves allow for visual comparison of survival distributions across categorical groups, such as tumor stage. Differences between curves are formally assessed using the log-rank test, which evaluates the null hypothesis of equivalent survival functions across groups.
Cox proportional hazards model
To account for multiple covariates simultaneously, we applied the Cox proportional hazards model, a semi‑parametric method widely used in clinical survival analysis. The model relates the hazard at time t to a set of predictor variables through the following function:
\(h0(t)\)is the baseline hazard function representing the hazard when all covariates are zero
\(x\)denotes the vector of predictor variables (e.g., AJCC stage, age, tumor size, tumor differentiation)
\(eβ\) is the hazard ratio (HR) quantifying the effect of each covariate on the hazard
Interpretation:
HR > 1 indicates an increased risk of the event associated with the covariate
HR < 1 indicates a decreased risk of the event
This model allows adjustment for confounding factors and provides insight into the relative contribution of each clinical feature to overall survival.
Hazard Ratio
The hazard ratio represents the multiplicative change in the hazard associated with a one‑unit increase in a covariate. Values greater than 1 indicate increased risk, while values less than 1 indicate reduced risk. \[
HR = e^{\beta}
\]
Predicted Survival Function
The predicted survival function shows how an individual’s survival probability is derived from the baseline survival curve and adjusted according to their covariate values. \[
\hat{S}(t \mid X) = \left[\hat{S}_0(t)\right]^{\exp(\beta X)}
\]
2.2 Limitations
Although survival analysis is powerful, it has several limitations:
Assumption violations: The Cox model’s proportional hazards assumption may not always hold (Therneau & Grambsch, 2000).
Censoring issues: If censoring is related to the event, estimates may be biased (Hernán, 2010).
Competing risks: When individuals can experience different types of events, standard survival methods may overestimate event probabilities.
Small sample sizes: Rare events or small cohorts can reduce model stability.
Time‑dependent bias: Misclassifying exposure time can lead to immortal time bias.
These limitations highlight the importance of carefully checking assumptions and choosing appropriate models.
2.3 Assumptions
Proportional hazards assumption
Accurate survival measurement
Independent censoring
Correct model specification
If violated, stratified or time-dependent models may be considered.
3. Dataset and Analytical Workflow
Analyses were conducted using a real-world breast cancer dataset containing demographic, clinical, and tumor-specific variables. Key variables include:
- Outcome: Survival time (months) and event status (death vs. censored)
- Primary predictor: AJCC 6th edition tumor stage
- Covariates: Age, race, tumor size, estrogen and tumor differentiation
The analytical workflow comprised the following steps:
1. Load and inspect the dataset for completeness and consistency.
2. Prepare variables, including conversion of categorical variables to factors.
3. Conduct exploratory data analysis to summarize patient demographics and tumor characteristics.
4. Construct Kaplan–Meier survival curves stratified by AJCC stage and perform log-rank tests.
5. Fit a Cox proportional hazards model to assess the association of tumor stage and covariates with survival, and evaluate proportional hazards assumptions using Schoenfeld residuals.
6. Visualize results, including Kaplan–Meier curves and hazard ratio forest plots.
3.1 - Overview of AJCC 6th Edition Breast Cancer Staging
The AJCC 6th edition staging system classifies breast cancer using:
T – Tumor size
N – Regional lymph node involvement
M – Presence of distant metastasis
These components are combined to assign an overall stage from I to IV.
Stage I
Localized tumors (≤ 2 cm) with minimal or no lymph node involvement.
Typically associated with excellent survival outcomes.
Stage II
Larger tumors and/or limited regional lymph node involvement.
Generally favorable survival but more variable than Stage I.
Stage III
Locally advanced disease involving multiple lymph nodes or adjacent structures.
Significantly lower survival compared to Stages I and II.
Poorly differentiated: Highly abnormal; more aggressive behavior.
Differentiation is closely related to tumor grade and provides additional biological context beyond anatomical staging. Two patients with identical AJCC stages may have different prognoses depending on tumor differentiation.
Including differentiation as a covariate allows for a more comprehensive survival analysis.
3.3 Analytical Steps
Create survival object using Surv()
Fit Kaplan–Meier curves using survfit()
Fit Cox model using coxph()
Extract hazard ratios and 95% CIs
Test proportional hazards using cox.zph()
3.4 Software
All analyses were conducted in R using: Packages: survival survminer dplyr ggplot2 tidyr Functions: Surv() survfit() survdiff() coxph() summary() exp(coef()) exp(confint()) cox.zph() plot(cox.zph()) ggsurvplot() ggforest() factor(), mutate(), select() Analysis and Results
Code
library(survival)library(survminer)library(dplyr)library(ggplot2)library(gt)library(tidyr)library(tidyverse)data <-read.csv("breast_cancer_data.csv")### cleaning datacleaned_data <- data %>%select( Age, Race, Marital.Status, X6th.Stage, differentiate, Tumor.Size, Estrogen.Status, Progesterone.Status, Regional.Node.Examined, Reginol.Node.Positive, Survival.Months, Status )cleaned_data <- cleaned_data %>%rename(`AJCC_Stage`= X6th.Stage,`Regional.Node.Positive`= Reginol.Node.Positive )cleaned_data <- cleaned_data %>%mutate(Status =case_when( Status =="Alive"~1, Status =="Dead"~0 ),Race =case_when( Race =="White"~1, Race =="Black"~2, Race =="Other"~3 ),differentiate =case_when( differentiate =="Well differentiated"~1, ##grade 1 differentiate =="Moderately differentiated"~2, ## grade 2 differentiate =="Poorly differentiated"~3, ## grade 3 differentiate =="Undifferentiated"~4 ),Estrogen.Status =case_when( Estrogen.Status =="Positive"~1, Estrogen.Status =="Negative"~0 ),Progesterone.Status =case_when( Progesterone.Status =="Positive"~1, Progesterone.Status =="Negative"~0 ),Marital.Status =case_when( Marital.Status =="Married"~1, Marital.Status =="Single "~2, Marital.Status =="Divorced"~3, Marital.Status =="Widowed"~4, Marital.Status =="Separated"~5 ) )#------------------variable_table <-data.frame(Variable =c("Age", "Race", "6th Stage", "Differentiate", "Tumor Size","Survival Months", "Estrogen Status", "Progesterone Status","Regional Nodes Examined", "Regional Nodes Positive", "Status" ),Definition =c("This variable is the patient’s age at diagnosis.","This variable is the patient’s self-identified racial category.","This variable is the cancer stage based on the AJCC 6th Edition.","This variable is the tumor grade based on how abnormal the cells appear.","This variable is the measured size of the primary tumor.","This variable is the number of months from diagnosis to last follow-up or death.","This variable is an indicator of estrogen receptor expression.","This variable is an indicator of progesterone receptor expression.","This variable is the number of lymph nodes examined.","This variable is the number of lymph nodes found positive for cancer.","This variable is the patient’s vital status at last follow-up." ),stringsAsFactors =FALSE)variable_table %>%gt() %>%tab_header(title ="Table 1. Variable Description" ) %>%tab_footnote(footnote ="Each variable in the dataset, accompanied by a qualitative description." )
Table 1. Variable Description
Variable
Definition
Age
This variable is the patient’s age at diagnosis.
Race
This variable is the patient’s self-identified racial category.
6th Stage
This variable is the cancer stage based on the AJCC 6th Edition.
Differentiate
This variable is the tumor grade based on how abnormal the cells appear.
Tumor Size
This variable is the measured size of the primary tumor.
Survival Months
This variable is the number of months from diagnosis to last follow-up or death.
Estrogen Status
This variable is an indicator of estrogen receptor expression.
Progesterone Status
This variable is an indicator of progesterone receptor expression.
Regional Nodes Examined
This variable is the number of lymph nodes examined.
Regional Nodes Positive
This variable is the number of lymph nodes found positive for cancer.
Status
This variable is the patient’s vital status at last follow-up.
Each variable in the dataset, accompanied by a qualitative description.
Explanation (Table 1): This table provides a description of all variables in the dataset, helping viewers understand what each variable represents prior to analysis.
Explanation (Table 2): This table shows the coding scheme used for categorical variables, which is critical for interpreting the results of survival analysis and Cox regression.
Explanation (Table 3): This table summarizes baseline characteristics of the cohort, including age, race, tumor stage, hormone status, and survival times. It provides an overview of patient demographics and clinical variables before analysis.
4. Analysis and Results
4.1 Distribution/Counts
Code
ggplot(cleaned_data, aes(x = Age)) +geom_histogram(binwidth =5, fill ="darkgreen", color ="white") +labs(title ="Age Distribution", x ="Age", y ="Count")
Explanation (Figure 1): This histogram shows the distribution of patient ages at diagnosis. Most patients cluster in middle age, which can influence survival outcomes.
Code
ggplot(cleaned_data, aes(x = Tumor.Size)) +geom_histogram(binwidth =5, fill ="tomato", color ="white") +labs(title ="Tumor Size Distribution", x ="Tumor Size (mm)", y ="Count")
Explanation (Figure 2): This histogram illustrates the distribution of primary tumor sizes. Larger tumor size at diagnosis may be associated with worse survival.
Code
ggplot(cleaned_data, aes(x =factor(Estrogen.Status))) +geom_bar(fill ="purple") +labs(title ="Estrogen Receptor Status", x ="Status (1=Positive, 0=Negative)", y ="Count")
Explanation (Figure 3): This bar chart shows the number of patients with positive versus negative estrogen receptor status, an important predictor of treatment response and survival.
Code
ggplot(cleaned_data, aes(x = AJCC_Stage)) +geom_bar(fill ="steelblue") +labs(title ="AJCC Stage Distribution", x ="AJCC Stage", y ="Count")
Explanation (Figure 4): This figure displays the distribution of tumor stages at diagnosis. Stage is a key predictor of survival and is central to our analysis.
4.2 Cox Proportional Hazards Model
Fit Cox model adjusting for covariates Report hazard ratios and confidence intervals Evaluate proportional hazards assumption
Code
# -----------------------------# Fit Cox proportional hazards model# -----------------------------cox_model <-coxph(Surv(Survival.Months, Status) ~ AJCC_Stage + Age + Race + differentiate + Tumor.Size + Estrogen.Status,data = cleaned_data)# -----------------------------# View Cox model summary# -----------------------------summary(cox_model)
# -----------------------------# Forest plot of hazard ratios# -----------------------------ggforest(cox_model, data = cleaned_data, main ="Figure 7. Hazard Ratios for Breast Cancer Survival")
Explanation (Figure 7): This forest plot visualizes hazard ratios from the Cox proportional hazards model, adjusting for age, race, tumor grade, and tumor size. HR > 1 indicates increased risk of death. The proportional hazards assumption is tested to validate model reliability.
4.3 Kaplan-Meier Survival Analysis
Compare survival across AJCC stages
Present Kaplan–Meier curves
Report log-rank test results
Code
library(survival)library(survminer)# -----------------------------# Create survival object# -----------------------------surv_object <-Surv(cleaned_data$Survival.Months, cleaned_data$Status)# -----------------------------# Fit Kaplan–Meier model by AJCC stage# -----------------------------km_fit <-survfit(surv_object ~ AJCC_Stage, data = cleaned_data)# -----------------------------# Kaplan–Meier survival curves by stage# -----------------------------ggsurvplot(km_fit,data = cleaned_data,risk.table =TRUE,risk.table.height =0.5,risk.table.fontsize =4,pval =TRUE,conf.int =TRUE,legend.title ="AJCC Stage",legend ="right",xlab ="Months",ylab ="Survival Probability",palette ="Dark2",title ="Figure 5. Kaplan–Meier Survival by AJCC Stage",risk.table.y.text.col =TRUE,risk.table.y.text =FALSE)
This figure illustrates differences in patient survival probabilities over time across AJCC cancer stages to assess the association between stage at diagnosis and survival outcomes.
This figure shows survival probabilities over time by estrogen receptor status to evaluate the relationship between hormone receptor expression and patient survival.
5. Discussion
The purpose of this study was to evaluate whether tumor stage at diagnosis influences survival time among breast cancer patients. Although the hypothesis proposed that higher tumor stage would be associated with significantly worse survival, the multivariable Cox proportional hazards model did not support this expectation. After adjusting for demographic and tumor‑specific characteristics, AJCC stage was not a significant independent predictor of mortality in this cohort. This suggests that the prognostic effect traditionally attributed to stage may be attenuated when other tumor characteristics, such as differentiation and hormone receptor status, are considered simultaneously.
Instead, tumor differentiation and estrogen receptor status emerged as the only significant predictors of survival. Poorer differentiation was associated with increased hazard of death, consistent with its role as an indicator of more aggressive tumor biology. Estrogen receptor status also demonstrated a significant association with mortality, highlighting the importance of hormonal pathways in shaping disease behavior and treatment response. These findings underscore that biological features of the tumor may exert a stronger influence on survival than anatomical stage alone, at least within this dataset.
The lack of significance for age, race, tumor size, and stage does not diminish their clinical relevance but suggests that their effects may be mediated through or overshadowed by other tumor‑specific factors. It is also possible that unmeasured variables, such as treatment type, comorbidities, or socioeconomic factors, contributed to the observed patterns. Importantly, the proportional hazards assumption was satisfied, indicating that the model appropriately captured the relationships between covariates and survival over time.
Overall, these results highlight the multifactorial nature of breast cancer prognosis. While tumor stage remains a cornerstone of clinical decision‑making, this analysis demonstrates that stage alone may not fully explain survival differences once biological characteristics are taken into account. The findings reinforce the importance of comprehensive tumor profiling and individualized risk assessment in modern oncology. By examining multiple predictors simultaneously, the Cox model provided a nuanced understanding of survival patterns and contributed valuable insight into the complex interplay of clinical and pathological factors in breast cancer outcomes.
6. Conclusion
This study examined whether tumor stage at diagnosis influences survival time among breast cancer patients using a Cox proportional hazards model. Although the original hypothesis proposed that higher tumor stage would be associated with significantly worse survival, the multivariable analysis did not support this expectation. After adjusting for demographic and tumor‑specific characteristics, AJCC stage was not a significant independent predictor of mortality in this cohort. Instead, tumor differentiation and estrogen receptor status emerged as the primary factors associated with survival, suggesting that biological features of the tumor may play a more prominent role than stage alone in determining patient outcomes.
These findings highlight the complexity of breast cancer prognosis and underscore the importance of evaluating multiple clinical and pathological variables simultaneously. While tumor stage remains a clinically meaningful descriptor of disease extent, its prognostic value may be diminished when other tumor characteristics are taken into account. The results emphasize the need for comprehensive tumor profiling and individualized risk assessment rather than reliance on stage as the sole indicator of survival likelihood.
Overall, this analysis contributes to a more nuanced understanding of survival patterns in breast cancer and demonstrates the utility of the Cox proportional hazards model for evaluating multifactorial clinical datasets. The study also reinforces the importance of continued research into the biological and treatment‑related factors that shape patient outcomes, particularly when traditional predictors such as stage do not behave as expected in adjusted models.
References
Abadi, A., Yavari, P., Dehghani-Arani, M., Alavi-Majd, H., Ghasemi, E., Amanpour, F., & Bajdik, C. (2014). Cox models survival analysis based on breast cancer treatments. Iranian Journal of Cancer Prevention, 7(3), 124–129.
Ali, S., Hamam, D., Liu, X., & Lebrun, J.-J. (2022). Terminal differentiation and anti-tumorigenic effects of prolactin in breast cancer. Frontiers in Endocrinology, 13, 993570. https://doi.org/10.3389/fendo.2022.993570
Anderson, W. F., Rosenberg, P. S., Prat, A., Perou, C. M., & Sherman, M. E. (2019). How many etiological subtypes of breast cancer: Two, three, four, or more? Journal of the National Cancer Institute, 111(3), 258–269.
Bewick, V., Cheek, L., & Ball, J. (2004). Statistics review 12: Survival analysis. Critical Care, 8(5), 389–394. https://doi.org/10.1186/cc2955
Bland, J. M., & Altman, D. G. (2004). The logrank test. BMJ, 328(7447), 1073.
Bradburn, M. J., Clark, T. G., Love, S. B., & Altman, D. G. (2003). Survival analysis part II: Multivariate data analysis—An introduction to concepts and methods. British Journal of Cancer, 89(3), 431–436. https://doi.org/10.1038/sj.bjc.6601119
Breast cancer stages | Understanding breast cancer staging. (2025, December 2). Cancer.gov. https://www.cancer.gov/types/breast/stages
Bustan, M. N., Aidid, M., & Gobel, F. A. (2018, June). Cox proportional hazard survival analysis to inpatient breast cancer cases. In Journal of Physics: Conference Series (Vol. 1028, No. 1). IOP Publishing.
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34(2), 187–220.
Deo, S. V., Deo, V., & Sundaram, V. (2021). Survival analysis—Part 2: Cox proportional hazards model. Indian Journal of Thoracic and Cardiovascular Surgery, 37(2), 229–233. https://doi.org/10.1007/s12055-020-01108-7
George, B., Seals, S., & Aban, I. (2014). Survival analysis and regression models. Journal of Nuclear Cardiology, 21(4), 686–694.
Hernán, M. A. (2010). The hazards of hazard ratios. Epidemiology, 21(1), 13–15.
Hosmer, D. W., Lemeshow, S., & May, S. (2008). Applied survival analysis: Regression modeling of time-to-event data (2nd ed.). Wiley.
Howlader, N., Cronin, K. A., Kurian, A. W., & Andridge, R. (2018). Differences in breast cancer survival by molecular subtypes in the United States. Cancer Epidemiology, Biomarkers & Prevention, 27(6), 619–626. https://doi.org/10.1158/1055-9965.EPI-17-0627
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.
Kleinbaum, D. G., & Klein, M. (2012). Survival analysis: A self-learning text (3rd ed.). Springer.
Koh, J., & Kim, M. J. (2019). Introduction of a new staging system of breast cancer for radiologists: An emphasis on the prognostic stage. Korean Journal of Radiology, 20(1), 69–82. https://doi.org/10.3348/kjr.2018.0231
Smith, T., Smith, B., & Ryan, M. A. (2003, March). Survival analysis using Cox proportional hazards modeling for single and multiple event time data. In Proceedings of the Twenty-Eighth Annual SAS Users Group International Conference (pp. 254–228).
Su, P. F., Lin, C. C. K., Hung, J. Y., & Lee, J. S. (2022). The proper use and reporting of survival analysis and Cox regression. World Neurosurgery, 161, 303–309.
Therneau, T. M., & Grambsch, P. M. (2000). Modeling survival data: Extending the Cox model. Springer.
Wei, L. J. (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine, 11(14–15), 1871–1879.
Xu, M., Shan, D., Zhang, R., Li, J., Guo, L., Chen, X., & Qu, J. (2025). Differentiation of breast cancer subtypes and correlation with biological status using functional magnetic resonance imaging: Comparison with amide proton transfer-weighted imaging and diffusion-weighted imaging. Quantitative Imaging in Medicine and Surgery, 15(7), 6102–6117. https://doi.org/10.21037/qims-24-2174