LASSO Regression - Spring 2026

Determining Predictors in Survival for Breast Cancer Patients

Author

Stephanie Hopkins & Aminata Gado Oumarou (Advisor: Dr. Cohen)

Published

March 22, 2026

Slides: slides.html ( Go to slides.qmd to edit)

Introduction

The LASSO (Least Absolute Shrinkage and Selection Operator) regression model has become increasingly popular over the past 30 years. It was first introduced by Robert Tibshirani as a method for variable selection and regularization to improve prediction accuracy and interpretability. Since then, LASSO has been widely applied across many fields, including biology, computational social sciences, social media, the stock market, economics, political science, robotics, climatology, and pharmacology (Emmert-Streib and Dehmer 2019).

Regression is a fundamental task in supervised machine learning, used to model relationships between variables and address key challenges such as parameter estimation and variable selection (Muthukrishnan and Rohini 2016). As machine learning methods continue to grow in popularity and show strong results, their importance in research has increased. In medicine especially, predictive models developed through machine learning can help professionals distinguish between risk groups, design personalized treatment plans, and improve patient survival outcomes (Huang, Ding, and Li 2023).

LASSO regression works by minimizing prediction error while applying a constraint that shrinks some regression coefficients toward zero (Ranstam and Cook 2018). It does this by adding an L1 penalty term to the loss function, which constrains the sum of the absolute values of the coefficients (Ranstam and Cook 2018). This penalty forces weaker predictors to exactly zero, effectively performing automatic variable selection (Usman, SIS, and Alhaji 2021). While traditional LASSO in linear regression minimizes the residual sum of squares subject to an L1 constraint (Tibshirani 1996), the method can be extended to survival analysis through the Cox proportional hazards model. In survival analysis, the Cox proportional hazards model is widely used to examine the relationship between covariates and time-to-event outcomes, such as overall survival. Rather than minimizing residual error, the Cox model estimates regression coefficients by maximizing the partial likelihood function. The LASSO-Cox model incorporates an L1 penalty into this partial likelihood, allowing some regression coefficients to shrink exactly to zero while estimating hazard ratios (Tibshirani 1997). This approach reduces multicollinearity, improves estimation stability, and produces a sparse and interpretable model, which is especially valuable in high-dimensional clinical datasets (Shahraki, Salehi, and Zare 2015).

The goal of this analysis is to use LASSO-Cox regression to identify the factors most strongly associated with breast cancer survival time. Cancer continues to have a major impact worldwide. In 2025, an estimated 2,041,910 new cancer cases will be diagnosed in the United States, and approximately 618,120 people are expected to die from the disease (National Cancer Institute 2025). Breast cancer remains a significant public health concern. By 2019, more than 3.8 million individuals in the United States were living with breast cancer, and it is still the most commonly diagnosed cancer among women and the second most common cancer globally (Usman, SIS, and Alhaji 2021). Breast cancer begins in the breast’s tissues and develops in the epithelial cells of the ducts (85% of cases) and lobules (15%). Causes include hereditary factors, age, postmenopausal obesity, alcohol or smoking, and other factors (Hadba and Naser 2024). Identifying the clinical and demographic factors most strongly linked to survival can help improve prognostic models and support better clinical decision-making. Previous research has identified numerous factors associated with breast cancer prognosis and survival. These factors are commonly grouped into three categories: (1) demographic and genetic characteristics, such as age at diagnosis, marital status, and reproductive history; (2) clinicopathological factors, including tumor location, size, and histologic grade; and (3) treatment-related variables, such as drug therapy, radiotherapy, and chemotherapy (Teng et al. 2022).

To accomplish this, the analysis uses the Surveillance, Epidemiology, and End Results (SEER) dataset to determine the most important variables associated with breast cancer survivability. SEER has collected and maintained high-quality, validated data on mortality among cancer survivors, allowing for analysis of overall and cause-specific death patterns (Xu et al. 2022). The dataset includes variables such as age, race, marital status, T stage, N stage, 6th stage, grade, A stage, tumor size, estrogen status, progesterone status, regional nodes examined, regional nodes positive, survival months, and status (Sembay 2021).

Although LASSO regression has been valuable across many research areas, it is particularly useful in medical studies where identifying the most influential predictors is essential for improving patient outcomes. The LASSO-Cox model is especially valuable for identifying predictors of time-to-event outcomes in the presence of censored data. Using the SEER dataset, this study applies LASSO-Cox regression—an L1-penalized Cox proportional hazards model—for feature selection to identify key clinical variables associated with breast cancer survival while reducing overfitting and improving model stability (Lee et al. 2025).

Methods

Survivial analysis involves the statistical modeling of the time until an event, in this case death, occurs. Cox regression is widely used and an effective method for performing survival analysis. The selected demographic, clinical, and outcome variables were used as predictors in a LASSO-Cox regression model to identify factors associated with patient survival time.

Variable selection was performed using a Cox proportional hazards model with an L1 (LASSO) penalty. In standard linear regression, LASSO is defined as (Emmert-Streib and Dehmer 2019):

\[ \hat{\beta} = arg \space \min \left\{\frac{1}{2n} \sum_{i=1}^{n} (y_i - \sum_{j} \beta_j x_{ij})^2 + \lambda||\beta||_1\right\} \]

where \(y_{i}\) represents the observed outcome, \(x_{ij}\) denotes predictor values, β is the vector of regression coefficients, and λ is a tuning parameter which controls the strength of the penalty. The L1 penalty encourages sparsity by shrinking weaker coefficients exactly to zero, thereby performing automatic variable selection.

For the LASSO-Cox model, coefficients are instead estimated by maximizing the penalized Cox partial likelihood:

\[ \hat{\beta}_{\text{LASSO-Cox}} = \arg_{\beta}\max\left \{PL(\beta) - \lambda \sum_{j=1}^{p}|\beta_j| \right\} \]

where PL(β) denotes the Cox partial likelihood function and λ is a positive tuning parameter that controls the amount of shrinkage (Shahraki, Salehi, and Zare 2015). The L1 penalty shrinks some regression coefficients to zero while estimating hazard ratios for the remaining predictors.

Assumptions:

  • Hazard ratios between individuals remain constant over time (proportional hazards assumption)

  • Survival times of different patients are independent from one another

  • Censoring is assumed to be non-informative, meaning that the reason for censoring is unrelated to the probability of the event

The analysis was conducted in R (Version 4.4.2) using the glmnet package which is designed for fitting penalized regression models, including penalized Cox models. The strength of the penalty is controlled by λ. To select an appropriate value of λ, k-fold cross-validation was used. This procedure repeatedly partions the data into training and validation sets to evaluate model performance and select the penalty that achieves an optimal balance between model complexity and predictive accuracy.

Predictors with non-zero coefficients in the final LASSO-Cox model are interpreted as significant variables associated with breast cancer survival.

Analysis and Results

Data Exploration and Visualization

This study uses breast cancer data from the Surveillance, Epidemiology, and End Results (SEER) program, which is maintained by the National Cancer Institute (NCI). The dataset, composed of 4,024 patients, includes demographic, clinical, and outcome variables for female patients diagnosed with infiltrating ductal and lobular breast carcinoma between 2006 and 2019. These variables are well suited for predictive modeling of patient survival outcomes. The dataset excludes patients with unknown tumor size, examined regional lymph nodes (LNs), regional positive LNs, and patients whose survival months were less than 1 month (Sembay 2021).

Table 1. Characteristics of the variables included in the study.
Category Variables
Demographics Age at diagnosis, race, marital status (married, single, widowed, divorced, separated)
Tumor Characteristics T stage (T1–T4), N stage (N1–N3), 6th AJCC stage (IIA–IIIC), tumor grade (I–IV), A stage (regional/distant), tumor size
Hormone Receptors Estrogen status, progesterone status (positive/negative)
Lymph Nodes Nodes examined, nodes positive
Outcome Survival months, vital status (alive/dead)

Using the variables defined above, analysis of the variables to predict survival will be analyzed. To start, below is the distribution of age at the time of breast cancer diagnosis.

Code
#install.packages("survminer")
#install.packages("glmnet")
library(glmnet)
library(survminer)
library(survival)
library(ggplot2)
library(corrplot)
seer_data <- read.csv("SEER_Breast_Cancer_Data.csv")
Code
ggplot(seer_data, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Age at Diagnosis",
       x = "Age (Years)",
       y = "Frequency") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Graph #1. Distribution of age at breast cancer diagnosis.

This histogram shows that patients are between ages 30 to 70 with a majority of diagnoses aged 50 and older. Due to the data that has been excluded in this study, the age of the women that have been diagnosed with breast cancer could vary if it were to be included. The next analysis shows a boxplot of survival based on stages of breast cancer.

Code
ggplot(seer_data, aes(x = X6th.Stage, y = Survival.Months)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Survival by AJCC Stage",
       x = "6th AJCC Stage",
       y = "Survival Months") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) 

Graph #2. Distribution of survival time across AJCC stages.

The boxplot shows that there is a general downward trend in survival months based on 6th AJCC cancer stages (IIA, IIB, IIIA, IIIB, IIIC). The early stages of breast cancer (IIA, IIB, IIIA) show a higher median survival where later stages (especially IIIC) show a lower median survival. This is suggesting that a higher stage breast cancer diagnosis is associated with shorter survival. Although, the median survival across the stages do not differ dramatically which shows that there is overlap between groups, thus stage alone may not fully explain a survival outcome. Due to the length of the whiskers, the length of survival varies widely within each stage and there may be other predictors that likely influence survival.

Code
seer_data$event <- ifelse(seer_data$Status == "Dead", 1, 0)

fit <- survfit(Surv(Survival.Months, event) ~ 1, data = seer_data)

plot <- ggsurvplot(fit,
                   data = seer_data,
                   conf.int = TRUE,
                   censor = FALSE,
                   risk.table = TRUE,
                   size = 1.2, 
                   pval = FALSE,
                   xlab = "Time (Months)",
                   ylab = "Survival Probability",
                   title = "Kaplan–Meier Survival Curve",
                   palette = "blue")

plot$plot <- plot$plot +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

plot

Graph #3. Kaplan-Meier estimate of overall survival among breast cancer patients.

A Kaplan–Meier survival analysis was conducted to estimate the overall survival over a 100-month follow-up period. The survival curve indicates a gradual decline in survival probability over time, beginning at 100% at baseline and decreasing steadily to approximately 78% by 100 months. The number at risk decreased over time, from 4,024 participants at baseline to 444 at 100 months, reflecting both events and censoring. Overall, the results suggest relatively high survival across the study population with a slow and consistent reduction in survival probability over time.

Code
seer_data$event <- ifelse(seer_data$Status == "Dead", 1, 0)
seer_data$X6th.Stage <- as.factor(seer_data$X6th.Stage)

fit <- survfit(Surv(Survival.Months, event) ~ X6th.Stage, data = seer_data)

plot <- ggsurvplot(fit, 
                   data = seer_data, 
                   conf.int = TRUE, 
                   risk.table = TRUE, 
                   risk.table.height = .40, 
                   risk.table.fontsize = 3.5, 
                   pval = FALSE, 
                   xlab = "Time (Months)",
                   ylab = "Survival Probability",
                   title = "Kaplan–Meier Survival Curves by 6th Stage",
                   legend.labs = c("IIA","IIB","IIIA","IIIB","IIIC"))
plot$plot <- plot$plot + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

plot

Graph #4 Kaplan–Meier Survival by Hormone Receptor Status

This graph shows Kaplan–Meier survival curves grouped by the 6th stage classification. The curves illustrate how survival probability changes over time for patients diagnosed at different cancer stages. Patients diagnosed at earlier stages (IIA and IIB) generally show higher survival probabilities throughout the follow-up period. In contrast, patients diagnosed at later stages (IIIA, IIIB, and IIIC) experience a faster decline in survival probability over time.

Among the later stages, stage IIIC appears to have the lowest survival probability, while stage IIIA and IIIB fall between the early and advanced stages. Overall, the curves suggest that patients diagnosed with more advanced stages tend to have lower survival probabilities compared to those diagnosed at earlier stages.

The risk table below the graph shows the number of patients remaining at risk at different time points during the study period.

Code
numeric_data <- seer_data[, c("Age","Tumor.Size","Regional.Node.Examined","Reginol.Node.Positive","Survival.Months")]
numeric_data <- data.frame(lapply(numeric_data, as.numeric))
cor_matrix <- cor(numeric_data, use = "complete.obs")
#print(cor_matrix)
corrplot(cor_matrix, method = "color", type = "upper", addCoef.col = "black",tl.col = "black",tl.srt = 45,number.cex = 0.7,
        title = "Correlation Between Continuous Variables", mar = c(0,0,2,0))

Graph #5: Correlation between continuous variables.

This heatmap illustrates the correlations between age, tumor size, regional nodes examined, regional nodes positive, and survival months. The strongest relationship is observed between the number of regional lymph nodes examined and the number of positive lymph nodes. With a correlation coefficient of r = 0.41, this indicates that patients with more lymph nodes examined tend to have more positive nodes detected. Tumor size and the number of positive lymph nodes show a small to moderate positive correlation (r = 0.24), suggesting that larger tumors may be associated with increased lymph node involvement. A weak negative correlation is observed between positive lymph nodes and survival months (r = -0.14), indicating that greater lymph node involvement may be associated with shorter survival time. There appears to be little to no relationship between age and survival months (r = -0.01) in this dataset.

Code
ggplot(seer_data, aes(x = Status, y = Tumor.Size, fill = Status)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3) +
  labs(
    title = "Distribution of Tumor Size by Survival Status",
    x = "Survival Status",
    y = "Tumor Size"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5)
  )

Graph #6: Distribution of Tumor Size by Survival Status

The boxplot shows that there is a correlation between tumor size and survival status, as patients with larger tumor sizes tend to die compared to those with smaller tumor sizes. The Alive group shows a lower median tumor size, whereas the Dead group shows a higher median tumor size. There is overlap between the two groups, indicating that tumor size alone is not a strong predictor of survival in cancer patients. The wider spread and presence of several large outliers, particularly among deceased patients, suggest considerable variability and the influence of additional prognostic factors.

Modeling and Results

To identify key predictors of breast cancer survival among patients in the SEER dataset, a LASSO-penalized Cox proportional hazards model was used. The data were first prepared for use in the LASSO-Cox model.

Categorical predictors were converted into indicator (dummy) variables so they could be included in the model. Continuous variables were standardized automatically within the glmnet framework to ensure that no single variable dominated the L1 penalty applied by LASSO. Because the outcome of interest was time to event, a penalized Cox proportional hazards model was used. Observations with missing values were removed to maintain consistency in the analysis.

Code
#remove observations with missing values
vars <- c("Age", "Race", "Marital.Status", "T.Stage", "N.Stage", "X6th.Stage", "Grade", "A.Stage", "Tumor.Size", "Estrogen.Status", "Progesterone.Status", "Regional.Node.Examined", "Reginol.Node.Positive", "Survival.Months", "Status")

#Subset the dataset to include only the variables used in the survival analysis

model_data <- seer_data[, vars]

#Remove observations with missing values
model_data <- na.omit(model_data)

#Create an event indicator variable required for survival analysis
# 1 indicates the event occurred (death) and 0 indicates censored observations (alive)
model_data$event <- ifelse(model_data$Status == "Dead", 1, 0)

A survival object was created using the survival time and event indicator. Predictor variables were converted into a model matrix to provide a numeric input suitable for the glmnet package, with categorical variables represented through dummy coding. A penalized Cox proportional hazards model (LASSO) was then fit using glmnet which estimated regression coefficients while applying regularization to perform variable selection.

Code
y <- Surv(model_data$Survival.Months, model_data$event)

#x <- model.matrix(Surv(Survival.Months, event) ~ Age + Race + Marital.Status + T.Stage + N.Stage + X6th.Stage + Grade + A.Stage + Tumor.Size + Estrogen.Status + Progesterone.Status + Regional.Node.Examined + Reginol.Node.Positive, data = model_data)[,-1]

x <- model.matrix(event ~ . - event - Status - Survival.Months, data = model_data)[,-1]


#Fit a penalized cox regression model using glmnet
fit <- glmnet(x, y, family = "cox")
plot(fit)

The above coefficient path plot demonstrates how predictor effects evolve as the penalty parameter (λ) decreases. The x-axis, shown as -log(λ), represents the strength of the penalty, with the left side corresponding to stronger regularizations and a simpler model, and the right side corresponding to weaker regularization and a more complex model with additional variables. The y-axis displays the estimated coefficients, indicating the magnitude and direction of each predictor’s effects. A negative value represents a decreased hazard where a positive value annotates an increased hazard. The numbers along the top of the model represent the number of non-zero coefficients at each value of λ, demonstrating that more variables enter the model as the penalty decreases.

The function ‘cv.glmnet()’ was used to determine the optimal value of the penalty parameter (λ) through k-fold cross-validation. This approach evaluates model performance across a sequence of λ values by estimating the cross-validation partial liklihood deviance. The resulting plot displays the mean cross-validation error and associated confidence intervals for each λ.

Code
#Perform cross-validation to identify the optimal lambda value
cvfit <- cv.glmnet(x, y, family = "cox", alpha = 1)

#Plot the cross-validation error across lambda values
plot(cvfit)

The optimal value of λ identified through cross-validation is determined by cvfit$lambda.min. The coefficients corresponding to the optimal λ value are then extracted to identify predictors with non-zero coefficients that remain in the final LASSO-Cox model.

Code
cvfit$lambda.min
[1] 0.001833867
Code
coef(cvfit, s = "lambda.min")
25 x 1 sparse Matrix of class "dgCMatrix"
                                                                         1
Age                                                            0.018722208
RaceOther (American Indian/AK Native, Asian/Pacific Islander) -0.622302957
RaceWhite                                                     -0.304224071
Marital.StatusMarried (including common law)                  -0.176789029
Marital.StatusSeparated                                        0.370771395
Marital.StatusSingle (never married)                           .          
Marital.StatusWidowed                                          .          
T.StageT2                                                      0.190835182
T.StageT3                                                      0.243913453
T.StageT4                                                      0.586023254
N.StageN2                                                      0.492727999
N.StageN3                                                      0.634816013
X6th.StageIIB                                                  0.146393769
X6th.StageIIIA                                                 .          
X6th.StageIIIB                                                 0.109644516
X6th.StageIIIC                                                 0.026364936
GradePoorly differentiated; Grade III                          0.339527448
GradeUndifferentiated; anaplastic; Grade IV                    1.044340329
GradeWell differentiated; Grade I                             -0.396577588
A.StageRegional                                               -0.129446176
Tumor.Size                                                     0.003085944
Estrogen.StatusPositive                                       -0.640058909
Progesterone.StatusPositive                                   -0.477898322
Regional.Node.Examined                                        -0.029231817
Reginol.Node.Positive                                          0.056460337

The following coefficients represent predictors with non-zero values selected by the LASSO-Cox proportional hazards model and are interpreted as key factors influencing survival outcomes.

Code
lasso_coef <- coef(cvfit, s = "lambda.min")
lasso_coef[lasso_coef != 0]
 [1]  0.018722208 -0.622302957 -0.304224071 -0.176789029  0.370771395
 [6]  0.190835182  0.243913453  0.586023254  0.492727999  0.634816013
[11]  0.146393769  0.109644516  0.026364936  0.339527448  1.044340329
[16] -0.396577588 -0.129446176  0.003085944 -0.640058909 -0.477898322
[21] -0.029231817  0.056460337

The final LASSO-penalized Cox model identified 22 predictors with non-zero coefficients at the optimal penalty parameter, highlighting the most influential variables associated with survival outcomes. The variables with positive coefficients were linked to an increased hazard, or worse survival, while negative coefficients were linked to a decreased hazard, or better survival.

Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

ChatGPT was used to assist with grammar and vocabulary corrections.

References

Emmert-Streib, Frank, and Matthias Dehmer. 2019. “High-Dimensional LASSO-Based Computational Regression Models: Regularization, Shrinkage, and Selection.” Machine Learning and Knowledge Extraction 1 (1): 359–83.
Hadba, Wakaa Ali, and Hadeel Imad Naser. 2024. “Comparison of Lasso Logistic Regression, Artificial Neural Networks, and Support Vector Machine in Predicting Breast Cancer.”
Huang, Binjie, Feifei Ding, and Yumin Li. 2023. “A Practical Recurrence Risk Model Based on Lasso-Cox Regression for Gastric Cancer.” Journal of Cancer Research and Clinical Oncology 149 (17): 15845–54.
Lee, Tsair-Fwu, Jun-Ping Shiau, Chia-Hui Chen, Wen-Ping Yun, Cheng-Shie Wuu, Yu-Jie Huang, Shyh-An Yeh, Hui-Chun Chen, and Pei-Ju Chao. 2025. “A Machine Learning Model for Predicting Breast Cancer Recurrence and Supporting Personalized Treatment Decisions Through Comprehensive Feature Selection and Explainable Ensemble Learning.” Cancer Management and Research, 917–32.
Muthukrishnan, Ramakrishnan, and RLASSO Rohini. 2016. “LASSO: A Feature Selection Technique in Predictive Modeling for Machine Learning.” In 2016 IEEE International Conference on Advances in Computer Applications (ICACA), 18–20. Ieee.
National Cancer Institute. 2025. “Cancer Statistics.” https://www.cancer.gov/about-cancer/understanding/statistics.
Ranstam, Jonas, and Jonathan A Cook. 2018. “LASSO Regression.” Journal of British Surgery 105 (10): 1348–48.
Sembay, Zhandos. 2021. “Seer Breast Cancer Data.” https://zenodo.org/records/5120960.
Shahraki, Hadi Raeisi, Alireza Salehi, and Najaf Zare. 2015. “Survival Prognostic Factors of Male Breast Cancer in Southern Iran: A LASSO-Cox Regression Approach.” Asian Pacific Journal of Cancer Prevention 16 (15): 6773–77.
Teng, Jing, Honglei Zhang, Wuyi Liu, Xiao-Ou Shu, and Fei Ye. 2022. “A Dynamic Bayesian Model for Breast Cancer Survival Prediction.” IEEE Journal of Biomedical and Health Informatics 26 (11): 5716–27.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society Series B: Statistical Methodology 58 (1): 267–88.
———. 1997. “The Lasso Method for Variable Selection in the Cox Model.” Statistics in Medicine 16 (4): 385–95.
Usman, M, Doguwa SIS, and BB Alhaji. 2021. “COMPARING THE PREDICTION ACCURACY OF RIDGE, LASSO AND ELASTIC NET REGRESSION MODELS WITH LINEAR REGRESSION USING BREAST CANCER DATA.” Bayero Journal of Pure & Applied Sciences 14 (2).
Xu, Yuxin, Xiaojie Wang, Ying Huang, Daoxiong Ye, and Pan Chi. 2022. “A LASSO-Based Survival Prediction Model for Patients with Synchronous Colorectal Carcinomas Based on SEER.” Translational Cancer Research 11 (8): 2795.