Survival Differences by Tumor Stage in Breast Cancer: A Cox Proportional Hazards Analysis Using Real-World Data

Author

Victoria Nguyen & Julaxis Love

Published

April 21, 2026

Slides: slides.html


1. Introduction

Survival analysis is a statistical framework used to model time‑to‑event outcomes, where the primary variable of interest is the time until an event such as death, recurrence, or disease progression occurs. A defining feature of survival data is the presence of censoring, which happens when the event has not been observed for some individuals by the end of the study period. Because traditional regression methods cannot properly account for censored observations, survival analysis provides a more appropriate and flexible approach for analyzing clinical and epidemiological data, including breast cancer studies where follow‑up times vary and not all patients experience the event of interest (Kleinbaum & Klein, 2012).

The development of modern survival analysis methods began with the introduction of the Kaplan–Meier estimator (Kaplan & Meier, 1958), which allowed researchers to estimate survival probabilities non‑parametrically. However, while the Kaplan–Meier method is useful for describing survival patterns and comparing groups, it does not allow for the inclusion of multiple covariates. This limitation led to the creation of the Cox proportional hazards model (Cox, 1972), a semi‑parametric regression model that relates covariates to the hazard of an event without requiring the specification of the baseline hazard function. The Cox model quickly became one of the most widely used tools in medical research because of its flexibility and interpretability.

In breast cancer research, the Cox proportional hazards model is especially valuable because it allows investigators to examine how clinical, demographic, and biological factors influence the risk of recurrence or mortality over time. Variables such as tumor stage, hormone receptor status, treatment type, and patient age can be incorporated into the model to estimate their effects on the hazard of an event. The model’s ability to handle censored data and adjust for multiple predictors makes it a powerful method for understanding patient outcomes and identifying prognostic factors. The following sections describe the Cox model in more detail, including its assumptions, mathematical formulation, limitations, and implementation in R.


2. Methods

The goal of this analysis was to evaluate how tumor stage at diagnosis influences survival time among breast cancer patients. Guided by the hypothesis that higher tumor stage is associated with significantly worse survival, we used a combination of non‑parametric and semi‑parametric survival analysis techniques to model time‑to‑event outcomes. The dataset included demographic, clinical, and tumor‑specific variables such as age, race, AJCC 6th edition stage, tumor size, tumor differentiation, estrogen and progesterone receptor status, regional nodes examined, regional nodes positive, survival months, and event status. The analytical approach consisted of constructing Kaplan–Meier survival curves to visualize unadjusted survival differences across tumor stages, performing log‑rank tests to compare survival distributions, and fitting a Cox proportional hazards model to quantify the association between tumor stage and the hazard of death while adjusting for relevant covariates. All analyses were conducted in R, and model assumptions were evaluated to ensure the validity and interpretability of the results.


2.1 Functions and Equations

Kaplan–Meier Estimation

The survival function, \( S(t) \), quantifies the probability that an individual survives beyond a specified time \( t \). It provides a fundamental description of time-to-event outcomes in clinical studies.

The Kaplan–Meier estimator is used to estimate the survival function non-parametrically:

\[ \hat{S}(t) = \prod_{t_i \le t} \left(1 - \frac{d_i}{n_i}\right) \]

Where:

  • \((t_i)\) represents observed event times

  • \(( d_i)\) denotes the number of deaths at time \(( t_i )\)

  • \(( n_i)\) corresponds to the number of individuals at risk immediately prior to \(( t_i )\)

Kaplan–Meier curves allow for visual comparison of survival distributions across categorical groups, such as tumor stage. Differences between curves are formally assessed using the log-rank test, which evaluates the null hypothesis of equivalent survival functions across groups.

Cox proportional hazards model

To account for multiple covariates simultaneously, we applied the Cox proportional hazards model, a semi‑parametric method widely used in clinical survival analysis. The model relates the hazard at time t to a set of predictor variables through the following function:

\[ h(t \mid X) = h_0(t)\exp(\beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p) \]

Where:

  • \(h0(t)\)is the baseline hazard function representing the hazard when all covariates are zero

  • \(x\)denotes the vector of predictor variables (e.g., AJCC stage, age, tumor size, tumor differentiation)

  • \(eβ\) is the hazard ratio (HR) quantifying the effect of each covariate on the hazard

Interpretation:

  • HR > 1 indicates an increased risk of the event associated with the covariate

  • HR < 1 indicates a decreased risk of the event

This model allows adjustment for confounding factors and provides insight into the relative contribution of each clinical feature to overall survival.

Hazard Ratio

The hazard ratio represents the multiplicative change in the hazard associated with a one‑unit increase in a covariate. Values greater than 1 indicate increased risk, while values less than 1 indicate reduced risk. \[ HR = e^{\beta} \]

Predicted Survival Function

The predicted survival function shows how an individual’s survival probability is derived from the baseline survival curve and adjusted according to their covariate values. \[ \hat{S}(t \mid X) = \left[\hat{S}_0(t)\right]^{\exp(\beta X)} \]


2.2 Limitations

Although survival analysis is powerful, it has several limitations:

  • Assumption violations: The Cox model’s proportional hazards assumption may not always hold (Therneau & Grambsch, 2000).

  • Censoring issues: If censoring is related to the event, estimates may be biased (Hernán, 2010).

  • Competing risks: When individuals can experience different types of events, standard survival methods may overestimate event probabilities.

  • Small sample sizes: Rare events or small cohorts can reduce model stability.

  • Time‑dependent bias: Misclassifying exposure time can lead to immortal time bias.

These limitations highlight the importance of carefully checking assumptions and choosing appropriate models.


2.3 Assumptions

  • Proportional hazards assumption
  • Accurate survival measurement
  • Independent censoring
  • Correct model specification

If violated, stratified or time-dependent models may be considered.


3. Dataset and Analytical Workflow

Analyses were conducted using a real-world breast cancer dataset containing demographic, clinical, and tumor-specific variables. Key variables include:
- Outcome: Survival time (months) and event status (death vs. censored)
- Primary predictor: AJCC 6th edition tumor stage
- Covariates: Age, race, tumor size, estrogen and tumor differentiation

The analytical workflow comprised the following steps:
1. Load and inspect the dataset for completeness and consistency.
2. Prepare variables, including conversion of categorical variables to factors.
3. Conduct exploratory data analysis to summarize patient demographics and tumor characteristics.
4. Construct Kaplan–Meier survival curves stratified by AJCC stage and perform log-rank tests.
5. Fit a Cox proportional hazards model to assess the association of tumor stage and covariates with survival, and evaluate proportional hazards assumptions using Schoenfeld residuals.
6. Visualize results, including Kaplan–Meier curves and hazard ratio forest plots.

3.1 - Overview of AJCC 6th Edition Breast Cancer Staging

The AJCC 6th edition staging system classifies breast cancer using:

  • T – Tumor size
  • N – Regional lymph node involvement
  • M – Presence of distant metastasis

These components are combined to assign an overall stage from I to IV.

Stage I

Localized tumors (≤ 2 cm) with minimal or no lymph node involvement.
Typically associated with excellent survival outcomes.

Stage II

Larger tumors and/or limited regional lymph node involvement.
Generally favorable survival but more variable than Stage I.

Stage III

Locally advanced disease involving multiple lymph nodes or adjacent structures.
Significantly lower survival compared to Stages I and II.

Stage IV

Metastatic disease involving distant organs (e.g., bone, liver, lung, brain).
Associated with substantially poorer survival outcomes.

Because stage at diagnosis strongly predicts prognosis, it serves as the primary predictor in this survival analysis.


3.2 Tumor Differentiation

Tumor differentiation describes how closely cancer cells resemble normal breast tissue under microscopic examination.

  • Well-differentiated: Resemble normal cells; slower growth.
  • Moderately differentiated: Intermediate characteristics.
  • Poorly differentiated: Highly abnormal; more aggressive behavior.

Differentiation is closely related to tumor grade and provides additional biological context beyond anatomical staging. Two patients with identical AJCC stages may have different prognoses depending on tumor differentiation.

Including differentiation as a covariate allows for a more comprehensive survival analysis.

3.3 Analytical Steps

  1. Create survival object using Surv()
  2. Fit Kaplan–Meier curves using survfit()
  3. Fit Cox model using coxph()
  4. Extract hazard ratios and 95% CIs
  5. Test proportional hazards using cox.zph()

3.4 Software

All analyses were conducted in R using: Packages: survival survminer dplyr ggplot2 tidyr Functions: Surv() survfit() survdiff() coxph() summary() exp(coef()) exp(confint()) cox.zph() plot(cox.zph()) ggsurvplot() ggforest() factor(), mutate(), select() Analysis and Results

Code
library(survival)
library(survminer)
library(dplyr)
library(ggplot2)
library(gt)
library(tidyr)
library(tidyverse)

data <- read.csv("breast_cancer_data.csv")

### cleaning data
cleaned_data <- data %>%
  select(
    Age, Race, Marital.Status, X6th.Stage, differentiate,
    Tumor.Size, Estrogen.Status, Progesterone.Status,
    Regional.Node.Examined, Reginol.Node.Positive, Survival.Months,
    Status
  )

cleaned_data <- cleaned_data %>%
  rename(
    `AJCC_Stage` = X6th.Stage,
    `Regional.Node.Positive` = Reginol.Node.Positive
  )

cleaned_data <- cleaned_data %>%
  mutate(
    Status = case_when(
      Status == "Alive" ~ 1,
      Status == "Dead"  ~ 0
    ),
    Race = case_when(
      Race == "White" ~ 1,
      Race == "Black" ~ 2,
      Race == "Other" ~ 3
    ),
    differentiate = case_when(
      differentiate == "Well differentiated" ~ 1, ##grade 1
      differentiate == "Moderately differentiated" ~ 2, ## grade 2
      differentiate == "Poorly differentiated" ~ 3, ## grade 3
      differentiate == "Undifferentiated" ~ 4
    ),
    Estrogen.Status = case_when(
      Estrogen.Status == "Positive" ~ 1,
      Estrogen.Status == "Negative" ~ 0
    ),
    Progesterone.Status = case_when(
      Progesterone.Status == "Positive" ~ 1,
      Progesterone.Status == "Negative" ~ 0
    ),
    Marital.Status = case_when(
      Marital.Status == "Married" ~ 1,
      Marital.Status == "Single " ~ 2,
      Marital.Status == "Divorced" ~ 3,
      Marital.Status == "Widowed" ~ 4,
      Marital.Status == "Separated" ~ 5
    )
  )

#------------------
variable_table <- data.frame(
  Variable = c(
    "Age", "Race", "6th Stage", "Differentiate", "Tumor Size",
    "Survival Months", "Estrogen Status", "Progesterone Status",
    "Regional Nodes Examined", "Regional Nodes Positive", "Status"
  ),
  Definition = c(
    "This variable is the patient’s age at diagnosis.",
    "This variable is the patient’s self-identified racial category.",
    "This variable is the cancer stage based on the AJCC 6th Edition.",
    "This variable is the tumor grade based on how abnormal the cells appear.",
    "This variable is the measured size of the primary tumor.",
    "This variable is the number of months from diagnosis to last follow-up or death.",
    "This variable is an indicator of estrogen receptor expression.",
    "This variable is an indicator of progesterone receptor expression.",
    "This variable is the number of lymph nodes examined.",
    "This variable is the number of lymph nodes found positive for cancer.",
    "This variable is the patient’s vital status at last follow-up."
  ),
  stringsAsFactors = FALSE
)

variable_table %>%
  gt() %>%
  tab_header(
    title = "Table 1. Variable Description"
  ) %>%
  tab_footnote(
    footnote = "Each variable in the dataset, accompanied by a qualitative description."
  )
Table 1. Variable Description
Variable Definition
Age This variable is the patient’s age at diagnosis.
Race This variable is the patient’s self-identified racial category.
6th Stage This variable is the cancer stage based on the AJCC 6th Edition.
Differentiate This variable is the tumor grade based on how abnormal the cells appear.
Tumor Size This variable is the measured size of the primary tumor.
Survival Months This variable is the number of months from diagnosis to last follow-up or death.
Estrogen Status This variable is an indicator of estrogen receptor expression.
Progesterone Status This variable is an indicator of progesterone receptor expression.
Regional Nodes Examined This variable is the number of lymph nodes examined.
Regional Nodes Positive This variable is the number of lymph nodes found positive for cancer.
Status This variable is the patient’s vital status at last follow-up.
Each variable in the dataset, accompanied by a qualitative description.

Explanation (Table 1): This table provides a description of all variables in the dataset, helping viewers understand what each variable represents prior to analysis.

Code
##Coding scheme
codebook <- tibble(
  Variable = c(
    "Status",
    "Race",
    "Differentiate(Tumor grade)",
    "Estrogen Status",
    "Progesterone Status",
    "Marital Status"
  ),
  
  Coding = c(
    "1 = Alive; 0 = Dead",
    "1 = White; 2 = Black; 3 = Other",
    "1 = Well differentiated; 2 = Moderately differentiated; 3 = Poorly differentiated; 4 = Undifferentiated",
    "1 = Positive; 0 = Negative",
    "1 = Positive; 0 = Negative",
    "1 = Married; 2 = Single; 3 = Divorced; 4 = Widowed; 5 = Separated"
  )
)

codebook %>%
  gt() %>%
  tab_header(
    title = "Table 2. Variable Coding Details"
  ) %>%
  cols_label(
    Variable = "Variable",
    Coding = "Coding Scheme"
  )
Table 2. Variable Coding Details
Variable Coding Scheme
Status 1 = Alive; 0 = Dead
Race 1 = White; 2 = Black; 3 = Other
Differentiate(Tumor grade) 1 = Well differentiated; 2 = Moderately differentiated; 3 = Poorly differentiated; 4 = Undifferentiated
Estrogen Status 1 = Positive; 0 = Negative
Progesterone Status 1 = Positive; 0 = Negative
Marital Status 1 = Married; 2 = Single; 3 = Divorced; 4 = Widowed; 5 = Separated

Explanation (Table 2): This table shows the coding scheme used for categorical variables, which is critical for interpreting the results of survival analysis and Cox regression.

Code
#Clean Data summary
make_cat_table <- function(data, var, label) {
  tbl <- table(data[[var]])
  pct <- prop.table(tbl) * 100
  
  tibble(
    Variable = label,
    Category = names(tbl),
    Summary = paste0(
      as.numeric(tbl), " (", sprintf("%.1f", pct), "%)"
    )
  )
}

# Build table sections
table_age <- tibble(
  Variable = "Age (mean ± SD)",
  Category = "",
  Summary = sprintf("%.1f ± %.1f",
                    mean(cleaned_data$Age, na.rm = TRUE),
                    sd(cleaned_data$Age, na.rm = TRUE))
)

table_race <- make_cat_table(cleaned_data, "Race", "Race")
table_marital <- make_cat_table(cleaned_data, "Marital.Status", "Marital Status")
table_stage <- make_cat_table(cleaned_data, "AJCC_Stage", "AJCC Stage")
table_grade <- make_cat_table(cleaned_data, "Differentiate", "Tumor Grade")
table_er <- make_cat_table(cleaned_data, "Estrogen.Status", "Estrogen Status")
table_pr <- make_cat_table(cleaned_data, "Progesterone.Status", "Progesterone Status")
table_status <- make_cat_table(cleaned_data, "Status", "Vital Status")

table_tumor_size <- tibble(
  Variable = "Tumor Size (mean ± SD)",
  Category = "",
  Summary = sprintf("%.1f ± %.1f",
                    mean(cleaned_data$Tumor.Size, na.rm = TRUE),
                    sd(cleaned_data$Tumor.Size, na.rm = TRUE))
)

table_nodes_examined <- tibble(
  Variable = "Regional Nodes Examined (mean ± SD)",
  Category = "",
  Summary = sprintf("%.1f ± %.1f",
                    mean(cleaned_data$Regional.Node.Examined, na.rm = TRUE),
                    sd(cleaned_data$Regional.Node.Examined, na.rm = TRUE))
)

table_nodes_positive <- tibble(
  Variable = "Regional Nodes Positive (mean ± SD)",
  Category = "",
  Summary = sprintf("%.1f ± %.1f",
                    mean(cleaned_data$Regional.Node.Positive, na.rm = TRUE),
                    sd(cleaned_data$Regional.Node.Positive, na.rm = TRUE))
)

table_survival <- tibble(
  Variable = "Survival Months (mean ± SD)",
  Category = "",
  Summary = sprintf("%.1f ± %.1f",
                    mean(cleaned_data$Survival.Months, na.rm = TRUE),
                    sd(cleaned_data$Survival.Months, na.rm = TRUE))
)

# Combine all sections
table1 <- bind_rows(
  table_age,
  table_race,
  table_marital,
  table_stage,
  table_grade,
  table_tumor_size,
  table_er,
  table_pr,
  table_nodes_examined,
  table_nodes_positive,
  table_survival,
  table_status
)

# Create GT table
table1 %>%
  gt() %>%
  tab_header(
    title = "Table 3. Baseline Characteristics"
  ) %>%
  cols_label(
    Variable = "Variable",
    Category = "Category",
    Summary = "n (%) or Mean ± SD"
  ) %>%
  tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    locations = cells_body(
      columns = Variable
    )
  )
Table 3. Baseline Characteristics
Variable Category n (%) or Mean ± SD
Age (mean ± SD) 54.0 ± 9.0
Race 1 3413 (84.8%)
Race 2 291 (7.2%)
Race 3 320 (8.0%)
Marital Status 1 2643 (65.7%)
Marital Status 2 615 (15.3%)
Marital Status 3 486 (12.1%)
Marital Status 4 235 (5.8%)
Marital Status 5 45 (1.1%)
AJCC Stage IIA 1305 (32.4%)
AJCC Stage IIB 1130 (28.1%)
AJCC Stage IIIA 1050 (26.1%)
AJCC Stage IIIB 67 (1.7%)
AJCC Stage IIIC 472 (11.7%)
Tumor Grade NA (%)
Tumor Size (mean ± SD) 30.5 ± 21.1
Estrogen Status 0 269 (6.7%)
Estrogen Status 1 3755 (93.3%)
Progesterone Status 0 698 (17.3%)
Progesterone Status 1 3326 (82.7%)
Regional Nodes Examined (mean ± SD) 14.4 ± 8.1
Regional Nodes Positive (mean ± SD) 4.2 ± 5.1
Survival Months (mean ± SD) 71.3 ± 22.9
Vital Status 0 616 (15.3%)
Vital Status 1 3408 (84.7%)

Explanation (Table 3): This table summarizes baseline characteristics of the cohort, including age, race, tumor stage, hormone status, and survival times. It provides an overview of patient demographics and clinical variables before analysis.

4. Analysis and Results

4.1 Distribution/Counts

Code
ggplot(cleaned_data, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "darkgreen", color = "white") +
  labs(title = "Age Distribution", x = "Age", y = "Count")

Explanation (Figure 1): This histogram shows the distribution of patient ages at diagnosis. Most patients cluster in middle age, which can influence survival outcomes.

Code
ggplot(cleaned_data, aes(x = Tumor.Size)) +
  geom_histogram(binwidth = 5, fill = "tomato", color = "white") +
  labs(title = "Tumor Size Distribution", x = "Tumor Size (mm)", y = "Count")

Explanation (Figure 2): This histogram illustrates the distribution of primary tumor sizes. Larger tumor size at diagnosis may be associated with worse survival.

Code
ggplot(cleaned_data, aes(x = factor(Estrogen.Status))) +
  geom_bar(fill = "purple") +
  labs(title = "Estrogen Receptor Status", x = "Status (1=Positive, 0=Negative)", y = "Count")

Explanation (Figure 3): This bar chart shows the number of patients with positive versus negative estrogen receptor status, an important predictor of treatment response and survival.

Code
ggplot(cleaned_data, aes(x = AJCC_Stage)) +
  geom_bar(fill = "steelblue") +
  labs(title = "AJCC Stage Distribution", x = "AJCC Stage", y = "Count")

Explanation (Figure 4): This figure displays the distribution of tumor stages at diagnosis. Stage is a key predictor of survival and is central to our analysis.

4.2 Cox Proportional Hazards Model

Fit Cox model adjusting for covariates
Report hazard ratios and confidence intervals
Evaluate proportional hazards assumption

Code
# -----------------------------
# Fit Cox proportional hazards model
# -----------------------------
cox_model <- coxph(
  Surv(Survival.Months, Status) ~ AJCC_Stage + Age + Race + differentiate + Tumor.Size + Estrogen.Status,
  data = cleaned_data
)

# -----------------------------
# View Cox model summary
# -----------------------------
summary(cox_model)
Call:
coxph(formula = Surv(Survival.Months, Status) ~ AJCC_Stage + 
    Age + Race + differentiate + Tumor.Size + Estrogen.Status, 
    data = cleaned_data)

  n= 4024, number of events= 3408 

                      coef  exp(coef)   se(coef)      z Pr(>|z|)  
AJCC_StageIIB    0.0160412  1.0161706  0.0464208  0.346   0.7297  
AJCC_StageIIIA   0.0205644  1.0207773  0.0546725  0.376   0.7068  
AJCC_StageIIIB  -0.1896348  0.8272612  0.1566524 -1.211   0.2261  
AJCC_StageIIIC  -0.0686549  0.9336489  0.0720392 -0.953   0.3406  
Age             -0.0008568  0.9991436  0.0019997 -0.428   0.6683  
Race            -0.0024256  0.9975773  0.0290849 -0.083   0.9335  
differentiate   -0.0694154  0.9329391  0.0278740 -2.490   0.0128 *
Tumor.Size       0.0003373  1.0003373  0.0010772  0.313   0.7542  
Estrogen.Status  0.1947639  1.2150241  0.0820612  2.373   0.0176 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                exp(coef) exp(-coef) lower .95 upper .95
AJCC_StageIIB      1.0162     0.9841    0.9278    1.1130
AJCC_StageIIIA     1.0208     0.9796    0.9171    1.1362
AJCC_StageIIIB     0.8273     1.2088    0.6086    1.1246
AJCC_StageIIIC     0.9336     1.0711    0.8107    1.0752
Age                0.9991     1.0009    0.9952    1.0031
Race               0.9976     1.0024    0.9423    1.0561
differentiate      0.9329     1.0719    0.8833    0.9853
Tumor.Size         1.0003     0.9997    0.9982    1.0025
Estrogen.Status    1.2150     0.8230    1.0345    1.4270

Concordance= 0.527  (se = 0.006 )
Likelihood ratio test= 19.46  on 9 df,   p=0.02
Wald test            = 18.64  on 9 df,   p=0.03
Score (logrank) test = 18.69  on 9 df,   p=0.03
Code
# -----------------------------
# Forest plot of hazard ratios
# -----------------------------
ggforest(cox_model, data = cleaned_data, main = "Figure 7. Hazard Ratios for Breast Cancer Survival")

Explanation (Figure 7): This forest plot visualizes hazard ratios from the Cox proportional hazards model, adjusting for age, race, tumor grade, and tumor size. HR > 1 indicates increased risk of death. The proportional hazards assumption is tested to validate model reliability.

4.3 Kaplan-Meier Survival Analysis

Compare survival across AJCC stages

Present Kaplan–Meier curves

Report log-rank test results

Code
library(survival)
library(survminer)

# -----------------------------

# Create survival object

# -----------------------------

surv_object <- Surv(cleaned_data$Survival.Months, cleaned_data$Status)

# -----------------------------

# Fit Kaplan–Meier model by AJCC stage

# -----------------------------

km_fit <- survfit(surv_object ~ AJCC_Stage, data = cleaned_data)

# -----------------------------

# Kaplan–Meier survival curves by stage

# -----------------------------

ggsurvplot(km_fit,
data = cleaned_data,
risk.table = TRUE,
risk.table.height = 0.5,
risk.table.fontsize = 4,
pval = TRUE,
conf.int = TRUE,
legend.title = "AJCC Stage",
legend = "right",
xlab = "Months",
ylab = "Survival Probability",
palette = "Dark2",
title = "Figure 5. Kaplan–Meier Survival by AJCC Stage",
risk.table.y.text.col = TRUE,
risk.table.y.text = FALSE)

This figure illustrates differences in patient survival probabilities over time across AJCC cancer stages to assess the association between stage at diagnosis and survival outcomes.

Kaplan–Meier survival curves by estrogen

Code
km_fit_er <- survfit(surv_object ~ Estrogen.Status, data = cleaned_data)

ggsurvplot(
  km_fit_er,
  data = cleaned_data,
  risk.table = TRUE,
  pval = TRUE,
  conf.int = TRUE,
  legend.title = "ER Status",
  legend.labs = c("Negative", "Positive")
)

This figure shows survival probabilities over time by estrogen receptor status to evaluate the relationship between hormone receptor expression and patient survival.

5. Discussion

The purpose of this study was to evaluate whether tumor stage at diagnosis influences survival time among breast cancer patients. Although the hypothesis proposed that higher tumor stage would be associated with significantly worse survival, the multivariable Cox proportional hazards model did not support this expectation. After adjusting for demographic and tumor‑specific characteristics, AJCC stage was not a significant independent predictor of mortality in this cohort. This suggests that the prognostic effect traditionally attributed to stage may be attenuated when other tumor characteristics, such as differentiation and hormone receptor status, are considered simultaneously.

Instead, tumor differentiation and estrogen receptor status emerged as the only significant predictors of survival. Poorer differentiation was associated with increased hazard of death, consistent with its role as an indicator of more aggressive tumor biology. Estrogen receptor status also demonstrated a significant association with mortality, highlighting the importance of hormonal pathways in shaping disease behavior and treatment response. These findings underscore that biological features of the tumor may exert a stronger influence on survival than anatomical stage alone, at least within this dataset.

The lack of significance for age, race, tumor size, and stage does not diminish their clinical relevance but suggests that their effects may be mediated through or overshadowed by other tumor‑specific factors. It is also possible that unmeasured variables, such as treatment type, comorbidities, or socioeconomic factors, contributed to the observed patterns. Importantly, the proportional hazards assumption was satisfied, indicating that the model appropriately captured the relationships between covariates and survival over time.

Overall, these results highlight the multifactorial nature of breast cancer prognosis. While tumor stage remains a cornerstone of clinical decision‑making, this analysis demonstrates that stage alone may not fully explain survival differences once biological characteristics are taken into account. The findings reinforce the importance of comprehensive tumor profiling and individualized risk assessment in modern oncology. By examining multiple predictors simultaneously, the Cox model provided a nuanced understanding of survival patterns and contributed valuable insight into the complex interplay of clinical and pathological factors in breast cancer outcomes.

6. Conclusion

This study examined whether tumor stage at diagnosis influences survival time among breast cancer patients using a Cox proportional hazards model. Although the original hypothesis proposed that higher tumor stage would be associated with significantly worse survival, the multivariable analysis did not support this expectation. After adjusting for demographic and tumor‑specific characteristics, AJCC stage was not a significant independent predictor of mortality in this cohort. Instead, tumor differentiation and estrogen receptor status emerged as the primary factors associated with survival, suggesting that biological features of the tumor may play a more prominent role than stage alone in determining patient outcomes.

These findings highlight the complexity of breast cancer prognosis and underscore the importance of evaluating multiple clinical and pathological variables simultaneously. While tumor stage remains a clinically meaningful descriptor of disease extent, its prognostic value may be diminished when other tumor characteristics are taken into account. The results emphasize the need for comprehensive tumor profiling and individualized risk assessment rather than reliance on stage as the sole indicator of survival likelihood.

Overall, this analysis contributes to a more nuanced understanding of survival patterns in breast cancer and demonstrates the utility of the Cox proportional hazards model for evaluating multifactorial clinical datasets. The study also reinforces the importance of continued research into the biological and treatment‑related factors that shape patient outcomes, particularly when traditional predictors such as stage do not behave as expected in adjusted models.


References

Abadi, A., Yavari, P., Dehghani-Arani, M., Alavi-Majd, H., Ghasemi, E., Amanpour, F., & Bajdik, C. (2014). Cox models survival analysis based on breast cancer treatments. Iranian Journal of Cancer Prevention, 7(3), 124–129.

Ali, S., Hamam, D., Liu, X., & Lebrun, J.-J. (2022). Terminal differentiation and anti-tumorigenic effects of prolactin in breast cancer. Frontiers in Endocrinology, 13, 993570. https://doi.org/10.3389/fendo.2022.993570

Anderson, W. F., Rosenberg, P. S., Prat, A., Perou, C. M., & Sherman, M. E. (2019). How many etiological subtypes of breast cancer: Two, three, four, or more? Journal of the National Cancer Institute, 111(3), 258–269.

Bewick, V., Cheek, L., & Ball, J. (2004). Statistics review 12: Survival analysis. Critical Care, 8(5), 389–394. https://doi.org/10.1186/cc2955

Bland, J. M., & Altman, D. G. (2004). The logrank test. BMJ, 328(7447), 1073.

Bradburn, M. J., Clark, T. G., Love, S. B., & Altman, D. G. (2003). Survival analysis part II: Multivariate data analysis—An introduction to concepts and methods. British Journal of Cancer, 89(3), 431–436. https://doi.org/10.1038/sj.bjc.6601119

Breast cancer stages | Understanding breast cancer staging. (2025, December 2). Cancer.gov. https://www.cancer.gov/types/breast/stages

Bustan, M. N., Aidid, M., & Gobel, F. A. (2018, June). Cox proportional hazard survival analysis to inpatient breast cancer cases. In Journal of Physics: Conference Series (Vol. 1028, No. 1). IOP Publishing.

Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34(2), 187–220.

Deo, S. V., Deo, V., & Sundaram, V. (2021). Survival analysis—Part 2: Cox proportional hazards model. Indian Journal of Thoracic and Cardiovascular Surgery, 37(2), 229–233. https://doi.org/10.1007/s12055-020-01108-7

George, B., Seals, S., & Aban, I. (2014). Survival analysis and regression models. Journal of Nuclear Cardiology, 21(4), 686–694.

Hernán, M. A. (2010). The hazards of hazard ratios. Epidemiology, 21(1), 13–15.

Hosmer, D. W., Lemeshow, S., & May, S. (2008). Applied survival analysis: Regression modeling of time-to-event data (2nd ed.). Wiley.

Howlader, N., Cronin, K. A., Kurian, A. W., & Andridge, R. (2018). Differences in breast cancer survival by molecular subtypes in the United States. Cancer Epidemiology, Biomarkers & Prevention, 27(6), 619–626. https://doi.org/10.1158/1055-9965.EPI-17-0627

Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.

Kleinbaum, D. G., & Klein, M. (2012). Survival analysis: A self-learning text (3rd ed.). Springer.

Koh, J., & Kim, M. J. (2019). Introduction of a new staging system of breast cancer for radiologists: An emphasis on the prognostic stage. Korean Journal of Radiology, 20(1), 69–82. https://doi.org/10.3348/kjr.2018.0231

Smith, T., Smith, B., & Ryan, M. A. (2003, March). Survival analysis using Cox proportional hazards modeling for single and multiple event time data. In Proceedings of the Twenty-Eighth Annual SAS Users Group International Conference (pp. 254–228).

Su, P. F., Lin, C. C. K., Hung, J. Y., & Lee, J. S. (2022). The proper use and reporting of survival analysis and Cox regression. World Neurosurgery, 161, 303–309.

Therneau, T. M., & Grambsch, P. M. (2000). Modeling survival data: Extending the Cox model. Springer.

Wei, L. J. (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine, 11(14–15), 1871–1879.

Xu, M., Shan, D., Zhang, R., Li, J., Guo, L., Chen, X., & Qu, J. (2025). Differentiation of breast cancer subtypes and correlation with biological status using functional magnetic resonance imaging: Comparison with amide proton transfer-weighted imaging and diffusion-weighted imaging. Quantitative Imaging in Medicine and Surgery, 15(7), 6102–6117. https://doi.org/10.21037/qims-24-2174