
Survival Analysis Step 1: Main Cox Analysis
survival-analysis-step1.RmdWhat this step is for
This tutorial is the first survival-analysis step in
leo.ukb. Before we talk about subgroup differences,
interaction, or mediation, we need a clean answer to the first
question:
is there a main exposure-outcome association in the cohort?
That is the role of the main Cox model.
In practice, a strong survival-analysis workflow is easier to defend when it is split into stages:
- establish the main exposure-outcome association
- only then ask whether that association differs across groups
- only then ask whether part of that association may operate through a mediator
Step 1 is therefore the foundation for everything that comes later.
Start from the cohort question and analytic roles
We use survival::lung as a compact teaching example, but
we structure the data frame the same way we would for a real cohort
analysis.
lung_df <- dplyr::transmute(
survival::lung,
outcome = as.integer(status == 2),
outcome_censor = time / 365.25,
age = age,
sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
ecog_group = factor(
ph.ecog,
levels = 0:3,
labels = c("ECOG0", "ECOG1", "ECOG2", "ECOG3")
)
)Now check the analytic roles explicitly instead of jumping straight into model fitting.
| Variable | Type | Step 1 role |
|---|---|---|
outcome |
Binary event indicator | Incident outcome event |
outcome_censor |
Follow-up time in years | Incident outcome follow-up time |
age |
Continuous exposure | Main exposure in the first example |
sex |
Binary covariate | Basic demographic adjustment |
ecog_group |
4-level ECOG performance-status grouping | ECOG performance-status adjustment |
Inspect the analysis-ready data before fitting
Main Cox analysis is easier to trust when the cohort structure, event counts, and follow-up burden have already been checked.
dplyr::glimpse(lung_df)
#> Rows: 228
#> Columns: 5
#> $ outcome <int> 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ outcome_censor <dbl> 0.8377823, 1.2457221, 2.7652293, 0.5749487, 2.4175222, …
#> $ age <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60,…
#> $ sex <fct> Male, Male, Male, Male, Male, Male, Female, Female, Mal…
#> $ ecog_group <fct> ECOG1, ECOG0, ECOG0, ECOG1, ECOG0, ECOG1, ECOG2, ECOG2,…| outcome | outcome_censor | age | sex | ecog_group |
|---|---|---|---|---|
| 1 | 0.8377823 | 74 | Male | ECOG1 |
| 1 | 1.2457221 | 68 | Male | ECOG0 |
| 0 | 2.7652293 | 56 | Male | ECOG0 |
| 1 | 0.5749487 | 57 | Male | ECOG1 |
| 1 | 2.4175222 | 60 | Male | ECOG0 |
| 0 | 2.7980835 | 74 | Male | ECOG1 |
event_summary <- data.frame(
Metric = c("Participants", "Events", "Event rate"),
Value = c(
nrow(lung_df),
sum(lung_df$outcome),
sprintf("%.1f%%", 100 * mean(lung_df$outcome))
)
)
knitr::kable(
event_summary,
caption = "Overall event burden"
)| Metric | Value |
|---|---|
| Participants | 228 |
| Events | 165 |
| Event rate | 72.4% |
| sex | outcome | n |
|---|---|---|
| Male | 0 | 26 |
| Male | 1 | 112 |
| Female | 0 | 37 |
| Female | 1 | 53 |
followup_summary <- summary(lung_df$outcome_censor)
knitr::kable(
data.frame(
Statistic = names(followup_summary),
Value = as.numeric(followup_summary)
),
caption = "Follow-up summary in years"
)| Statistic | Value |
|---|---|
| Min. | 0.0136893 |
| 1st Qu. | 0.4565366 |
| Median | 0.6995209 |
| Mean | 0.8356809 |
| 3rd Qu. | 1.0855578 |
| Max. | 2.7980835 |
These quick checks already tell us whether the event rate, subgroup size, and follow-up scale look plausible enough for a first-pass Cox analysis.
Choose the exposure scale before fitting
Before fitting anything, decide whether your exposure is being treated as continuous or categorical. This decision changes what the hazard ratio means.
scale_table <- data.frame(
Exposure_setup = c("Continuous exposure", "Categorical exposure"),
Typical_question = c(
"What is the change in hazard per one-unit increase?",
"How does each category compare with the reference level?"
),
Example_in_this_step = c("age", "ecog_group"),
Main_setting = c("Default", "x_exp_type = 'categorical'")
)
knitr::kable(
scale_table,
caption = "Choose the exposure scale before fitting the main Cox model"
)| Exposure_setup | Typical_question | Example_in_this_step | Main_setting |
|---|---|---|---|
| Continuous exposure | What is the change in hazard per one-unit increase? | age | Default |
| Categorical exposure | How does each category compare with the reference level? | ecog_group | x_exp_type = ‘categorical’ |
This is the practical rule:
- keep the exposure continuous if the scientific question is per-unit change
- treat it as categorical if the scientific question is level-versus-reference comparison
Define the adjustment hierarchy before fitting
In cohort studies, model labels should reflect the scientific
adjustment hierarchy rather than coding convenience alone.
leo_cox() accepts a named list, and those names are carried
directly into the final result table.
cox_models <- list(
"Crude" = NULL,
"Model A" = c("sex"),
"Model B" = c("sex", "ecog_group")
)
cox_models
#> $Crude
#> NULL
#>
#> $`Model A`
#> [1] "sex"
#>
#> $`Model B`
#> [1] "sex" "ecog_group"This habit makes the output read much more like a manuscript table from the beginning.
Main Cox example 1: continuous exposure
We first fit the age-outcome association.
res_cox <- leo_cox(
df = lung_df,
y_out = c("outcome", "outcome_censor"),
x_exp = "age",
x_cov = cox_models,
verbose = FALSE
)
res_cox
#> Exposure Outcome Case N Control N Person-years Class Crude HR
#> 1 age outcome 164 63 190.3409 Continuous 1.019
#> Crude 95% CI Crude P value Model A HR Model A 95% CI Model A P value
#> 1 1.001, 1.038 0.040 1.017 0.999, 1.036 0.061
#> Model B HR Model B 95% CI Model B P value
#> 1 1.011 0.993, 1.029 0.246How to read the main Cox result
This output is already paper-oriented:
-
Case N,Control N, andPerson-yearstell you how much information went into the fitted model -
Classtells you whether the exposure was treated as continuous or categorical - each model block gives
HR,95% CI, andP value
For a continuous exposure, the hazard ratio is interpreted per one-unit increase in the exposure. Here that means per one-year increase in age.
To get the tidy table for downstream programming, set
simplify = "tidy" or set simplify = FALSE and
use leo_cox_format():
res_cox_full <- leo_cox(
df = lung_df,
y_out = c("outcome", "outcome_censor"),
x_exp = "age",
x_cov = cox_models,
simplify = FALSE,
verbose = FALSE
)
age_tidy <- leo_cox_format(res_cox_full, style = "tidy")
age_model_b <- dplyr::filter(age_tidy, Model == "Model B")
age_model_b
#> Model Exposure Outcome Level Case N Control N Person-years HR
#> 1 Model B age outcome <NA> 164 63 190.3409 1.011
#> 95% CI P value Class
#> 1 0.993, 1.029 0.246 ContinuousIn this teaching example, the Model B estimate is
HR = 1.011, meaning that after adjustment for sex and ECOG
status, the estimated hazard is about 1.1% higher per one-year increase
in age. The confidence interval, however, includes 1, so the adjusted
evidence for a main age-outcome association is limited in this small
example.
That is exactly how a main Cox result should be interpreted:
- first the direction and size of the point estimate
- then the uncertainty
- then the scientific caution
Use tidy output when you need programmable rows
The wide table is convenient for direct reading. The tidy table is better for row-binding, plotting, or manuscript automation.
dplyr::select(
age_tidy,
Model, Exposure, Outcome, HR, `95% CI`, `P value`
)
#> Model Exposure Outcome HR 95% CI P value
#> 1 Crude age outcome 1.019 1.001, 1.038 0.040
#> 2 Model A age outcome 1.017 0.999, 1.036 0.061
#> 3 Model B age outcome 1.011 0.993, 1.029 0.246A useful rule is:
- use
simplify = "wide"(the default) orsimplify = "tidy"when you only need the result table - use
simplify = FALSEwhen you need fitted model objects orleo_cox_format()
You can also set p_fmt = "scientific" to show precise
p-values instead of <0.001.
Main Cox example 2: categorical exposure
The same workflow also handles categorical exposures.
res_cox_cat <- leo_cox(
df = lung_df,
y_out = c("outcome", "outcome_censor"),
x_exp = "ecog_group",
x_cov = list("Crude" = NULL, "Model A" = c("sex")),
x_exp_type = "categorical",
verbose = FALSE
)
res_cox_cat
#> Exposure Outcome Case N Control N Person-years Class
#> 1 ECOG0 (Ref) outcome 37 26 60.6926762 Categorical (4 levels)
#> 2 ECOG1 outcome 82 31 97.2813142 Categorical (4 levels)
#> 3 ECOG2 outcome 44 6 32.0438056 Categorical (4 levels)
#> 4 ECOG3 outcome 1 0 0.3230664 Categorical (4 levels)
#> Crude HR Crude 95% CI Crude P value Model A HR Model A 95% CI
#> 1 1.000 1.000, 1.000 1.000 1.000 1.000, 1.000
#> 2 1.446 0.980, 2.134 0.063 1.519 1.028, 2.246
#> 3 2.500 1.610, 3.883 <0.001 2.579 1.660, 4.007
#> 4 9.097 1.218, 67.937 0.031 7.756 1.037, 58.039
#> Model A P value
#> 1 1.000
#> 2 0.036
#> 3 <0.001
#> 4 0.046This table is read differently from the age model:
-
ECOG0is the reference level - each other row compares one ECOG level with
ECOG0 - the interpretation is therefore level-versus-reference rather than per-unit increase
ecog_tidy <- leo_cox(
df = lung_df,
y_out = c("outcome", "outcome_censor"),
x_exp = "ecog_group",
x_cov = list("Crude" = NULL, "Model A" = c("sex")),
x_exp_type = "categorical",
simplify = "tidy",
verbose = FALSE
)
dplyr::select(
ecog_tidy,
Model, Exposure, Level, HR, `95% CI`, `P value`
)
#> Model Exposure Level HR 95% CI P value
#> 1 Crude ECOG0 (Ref) ECOG0 1.000 1.000, 1.000 1.000
#> 2 Crude ECOG1 ECOG1 1.446 0.980, 2.134 0.063
#> 3 Crude ECOG2 ECOG2 2.500 1.610, 3.883 <0.001
#> 4 Crude ECOG3 ECOG3 9.097 1.218, 67.937 0.031
#> 5 Model A ECOG0 (Ref) ECOG0 1.000 1.000, 1.000 1.000
#> 6 Model A ECOG1 ECOG1 1.519 1.028, 2.246 0.036
#> 7 Model A ECOG2 ECOG2 2.579 1.660, 4.007 <0.001
#> 8 Model A ECOG3 ECOG3 7.756 1.037, 58.039 0.046In this example, the adjusted Model A hazard ratios suggest that worse performance status is associated with a higher incident hazard:
-
ECOG1versusECOG0:HR = 1.519 -
ECOG2versusECOG0:HR = 2.579 -
ECOG3versusECOG0:HR = 7.756
The last row also reminds us why raw output must be read critically:
ECOG3 has only one event in this small teaching data set,
so its confidence interval is extremely wide. Sparse cells are a
scientific warning, not just a cosmetic table issue.
How to interpret common Step 1 result patterns
For the main Cox model, the usual interpretation logic is:
- if the HR is clearly away from 1 and the confidence interval does not cross 1, the main association is both directionally clear and statistically supported
- if the HR is away from 1 but the confidence interval crosses 1, the data suggest a possible association, but the estimate is still imprecise
- if the point estimate is close to 1, the effect size may be small; however, the strength of statistical evidence still depends on the precision of the confidence interval, the sample size, and the scientific context
- for categorical exposures, always read both the effect size and the cell sizes; sparse strata can generate unstable large HRs
Step 1 is therefore not only about “running Cox”. It is about learning how to separate signal, uncertainty, and sparse-data artifacts.
How to write Step 1 results in a paper
For a continuous exposure, a concise Results-style sentence could be:
In the fully adjusted model, each one-year increase in age was associated with an estimated 1.1% higher incident hazard (
HR = 1.011,95% CI 0.993 to 1.029), although the confidence interval included the null.
For a categorical exposure, a corresponding sentence might be:
Compared with
ECOG0, worse performance status was associated with a higher incident hazard, with the strongest contrast observed forECOG2(HR = 2.579,95% CI 1.660 to 4.007) after adjustment for sex.
The scientific style is the same in both cases:
- identify the exposure contrast
- report
HR,95% CI, and model level - describe the estimate conservatively
What to take away from Step 1
After this step, you should already know:
- how to structure a survival-ready cohort data frame
- how to inspect event burden and follow-up before fitting
- how to choose between continuous and categorical exposure coding
- how to define an adjustment hierarchy before fitting
- how to interpret a main Cox table in manuscript language
Step 2 then asks a different question:
does the exposure effect differ across groups, and on which scale?
That belongs to subgroup and interaction analysis, not to the main Cox result itself.