Survival Analysis Step 1: Main Cox Analysis • leo.ukb

What this step is for

This tutorial is the first survival-analysis step in leo.ukb. Before we talk about subgroup differences, interaction, or mediation, we need a clean answer to the first question:

is there a main exposure-outcome association in the cohort?

That is the role of the main Cox model.

In practice, a strong survival-analysis workflow is easier to defend when it is split into stages:

establish the main exposure-outcome association
only then ask whether that association differs across groups
only then ask whether part of that association may operate through a mediator

Step 1 is therefore the foundation for everything that comes later.

Start from the cohort question and analytic roles

We use survival::lung as a compact teaching example, but we structure the data frame the same way we would for a real cohort analysis.

lung_df <- dplyr::transmute(
  survival::lung,
  outcome = as.integer(status == 2),
  outcome_censor = time / 365.25,
  age = age,
  sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
  ecog_group = factor(
    ph.ecog,
    levels = 0:3,
    labels = c("ECOG0", "ECOG1", "ECOG2", "ECOG3")
  )
)

Now check the analytic roles explicitly instead of jumping straight into model fitting.

Analytic roles in the Step 1 teaching data frame
Variable	Type	Step 1 role
`outcome`	Binary event indicator	Incident outcome event
`outcome_censor`	Follow-up time in years	Incident outcome follow-up time
`age`	Continuous exposure	Main exposure in the first example
`sex`	Binary covariate	Basic demographic adjustment
`ecog_group`	4-level ECOG performance-status grouping	ECOG performance-status adjustment

Inspect the analysis-ready data before fitting

Main Cox analysis is easier to trust when the cohort structure, event counts, and follow-up burden have already been checked.

dplyr::glimpse(lung_df)
#> Rows: 228
#> Columns: 5
#> $ outcome        <int> 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ outcome_censor <dbl> 0.8377823, 1.2457221, 2.7652293, 0.5749487, 2.4175222, …
#> $ age            <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60,…
#> $ sex            <fct> Male, Male, Male, Male, Male, Male, Female, Female, Mal…
#> $ ecog_group     <fct> ECOG1, ECOG0, ECOG0, ECOG1, ECOG0, ECOG1, ECOG2, ECOG2,…

knitr::kable(utils::head(lung_df, 6))

outcome	outcome_censor	age	sex	ecog_group
1	0.8377823	74	Male	ECOG1
1	1.2457221	68	Male	ECOG0
0	2.7652293	56	Male	ECOG0
1	0.5749487	57	Male	ECOG1
1	2.4175222	60	Male	ECOG0
0	2.7980835	74	Male	ECOG1

event_summary <- data.frame(
  Metric = c("Participants", "Events", "Event rate"),
  Value = c(
    nrow(lung_df),
    sum(lung_df$outcome),
    sprintf("%.1f%%", 100 * mean(lung_df$outcome))
  )
)

knitr::kable(
  event_summary,
  caption = "Overall event burden"
)

Overall event burden
Metric	Value
Participants	228
Events	165
Event rate	72.4%

knitr::kable(
  dplyr::count(lung_df, sex, outcome),
  caption = "Event counts by sex"
)

Event counts by sex
sex	outcome	n
Male	0	26
Male	1	112
Female	0	37
Female	1	53

followup_summary <- summary(lung_df$outcome_censor)
knitr::kable(
  data.frame(
    Statistic = names(followup_summary),
    Value = as.numeric(followup_summary)
  ),
  caption = "Follow-up summary in years"
)

Follow-up summary in years
Statistic	Value
Min.	0.0136893
1st Qu.	0.4565366
Median	0.6995209
Mean	0.8356809
3rd Qu.	1.0855578
Max.	2.7980835

These quick checks already tell us whether the event rate, subgroup size, and follow-up scale look plausible enough for a first-pass Cox analysis.

Choose the exposure scale before fitting

Before fitting anything, decide whether your exposure is being treated as continuous or categorical. This decision changes what the hazard ratio means.

scale_table <- data.frame(
  Exposure_setup = c("Continuous exposure", "Categorical exposure"),
  Typical_question = c(
    "What is the change in hazard per one-unit increase?",
    "How does each category compare with the reference level?"
  ),
  Example_in_this_step = c("age", "ecog_group"),
  Main_setting = c("Default", "x_exp_type = 'categorical'")
)

knitr::kable(
  scale_table,
  caption = "Choose the exposure scale before fitting the main Cox model"
)

Choose the exposure scale before fitting the main Cox model
Exposure_setup	Typical_question	Example_in_this_step	Main_setting
Continuous exposure	What is the change in hazard per one-unit increase?	age	Default
Categorical exposure	How does each category compare with the reference level?	ecog_group	x_exp_type = ‘categorical’

This is the practical rule:

keep the exposure continuous if the scientific question is per-unit change
treat it as categorical if the scientific question is level-versus-reference comparison

Define the adjustment hierarchy before fitting

In cohort studies, model labels should reflect the scientific adjustment hierarchy rather than coding convenience alone. leo_cox() accepts a named list, and those names are carried directly into the final result table.

cox_models <- list(
  "Crude" = NULL,
  "Model A" = c("sex"),
  "Model B" = c("sex", "ecog_group")
)

cox_models
#> $Crude
#> NULL
#> 
#> $`Model A`
#> [1] "sex"
#> 
#> $`Model B`
#> [1] "sex"        "ecog_group"

This habit makes the output read much more like a manuscript table from the beginning.

Main Cox example 1: continuous exposure

We first fit the age-outcome association.

res_cox <- leo_cox(
  df = lung_df,
  y_out = c("outcome", "outcome_censor"),
  x_exp = "age",
  x_cov = cox_models,
  verbose = FALSE
)

res_cox
#>   Exposure Outcome Case N Control N Person-years      Class Crude HR
#> 1      age outcome    164        63     190.3409 Continuous    1.019
#>   Crude 95% CI Crude P value Model A HR Model A 95% CI Model A P value
#> 1 1.001, 1.038         0.040      1.017   0.999, 1.036           0.061
#>   Model B HR Model B 95% CI Model B P value
#> 1      1.011   0.993, 1.029           0.246

How to read the main Cox result

This output is already paper-oriented:

Case N, Control N, and Person-years tell you how much information went into the fitted model
Class tells you whether the exposure was treated as continuous or categorical
each model block gives HR, 95% CI, and P value

For a continuous exposure, the hazard ratio is interpreted per one-unit increase in the exposure. Here that means per one-year increase in age.

To get the tidy table for downstream programming, set simplify = "tidy" or set simplify = FALSE and use leo_cox_format():

res_cox_full <- leo_cox(
  df = lung_df,
  y_out = c("outcome", "outcome_censor"),
  x_exp = "age",
  x_cov = cox_models,
  simplify = FALSE,
  verbose = FALSE
)
age_tidy <- leo_cox_format(res_cox_full, style = "tidy")
age_model_b <- dplyr::filter(age_tidy, Model == "Model B")
age_model_b
#>     Model Exposure Outcome Level Case N Control N Person-years    HR
#> 1 Model B      age outcome  <NA>    164        63     190.3409 1.011
#>         95% CI P value      Class
#> 1 0.993, 1.029   0.246 Continuous

In this teaching example, the Model B estimate is HR = 1.011, meaning that after adjustment for sex and ECOG status, the estimated hazard is about 1.1% higher per one-year increase in age. The confidence interval, however, includes 1, so the adjusted evidence for a main age-outcome association is limited in this small example.

That is exactly how a main Cox result should be interpreted:

first the direction and size of the point estimate
then the uncertainty
then the scientific caution

Use tidy output when you need programmable rows

The wide table is convenient for direct reading. The tidy table is better for row-binding, plotting, or manuscript automation.

dplyr::select(
  age_tidy,
  Model, Exposure, Outcome, HR, `95% CI`, `P value`
)
#>     Model Exposure Outcome    HR       95% CI P value
#> 1   Crude      age outcome 1.019 1.001, 1.038   0.040
#> 2 Model A      age outcome 1.017 0.999, 1.036   0.061
#> 3 Model B      age outcome 1.011 0.993, 1.029   0.246

A useful rule is:

use simplify = "wide" (the default) or simplify = "tidy" when you only need the result table
use simplify = FALSE when you need fitted model objects or leo_cox_format()

You can also set p_fmt = "scientific" to show precise p-values instead of <0.001.

Main Cox example 2: categorical exposure

The same workflow also handles categorical exposures.

res_cox_cat <- leo_cox(
  df = lung_df,
  y_out = c("outcome", "outcome_censor"),
  x_exp = "ecog_group",
  x_cov = list("Crude" = NULL, "Model A" = c("sex")),
  x_exp_type = "categorical",
  verbose = FALSE
)

res_cox_cat
#>      Exposure Outcome Case N Control N Person-years                  Class
#> 1 ECOG0 (Ref) outcome     37        26   60.6926762 Categorical (4 levels)
#> 2       ECOG1 outcome     82        31   97.2813142 Categorical (4 levels)
#> 3       ECOG2 outcome     44         6   32.0438056 Categorical (4 levels)
#> 4       ECOG3 outcome      1         0    0.3230664 Categorical (4 levels)
#>   Crude HR  Crude 95% CI Crude P value Model A HR Model A 95% CI
#> 1    1.000  1.000, 1.000         1.000      1.000   1.000, 1.000
#> 2    1.446  0.980, 2.134         0.063      1.519   1.028, 2.246
#> 3    2.500  1.610, 3.883        <0.001      2.579   1.660, 4.007
#> 4    9.097 1.218, 67.937         0.031      7.756  1.037, 58.039
#>   Model A P value
#> 1           1.000
#> 2           0.036
#> 3          <0.001
#> 4           0.046

This table is read differently from the age model:

ECOG0 is the reference level
each other row compares one ECOG level with ECOG0
the interpretation is therefore level-versus-reference rather than per-unit increase

ecog_tidy <- leo_cox(
  df = lung_df,
  y_out = c("outcome", "outcome_censor"),
  x_exp = "ecog_group",
  x_cov = list("Crude" = NULL, "Model A" = c("sex")),
  x_exp_type = "categorical",
  simplify = "tidy",
  verbose = FALSE
)
dplyr::select(
  ecog_tidy,
  Model, Exposure, Level, HR, `95% CI`, `P value`
)
#>     Model    Exposure Level    HR        95% CI P value
#> 1   Crude ECOG0 (Ref) ECOG0 1.000  1.000, 1.000   1.000
#> 2   Crude       ECOG1 ECOG1 1.446  0.980, 2.134   0.063
#> 3   Crude       ECOG2 ECOG2 2.500  1.610, 3.883  <0.001
#> 4   Crude       ECOG3 ECOG3 9.097 1.218, 67.937   0.031
#> 5 Model A ECOG0 (Ref) ECOG0 1.000  1.000, 1.000   1.000
#> 6 Model A       ECOG1 ECOG1 1.519  1.028, 2.246   0.036
#> 7 Model A       ECOG2 ECOG2 2.579  1.660, 4.007  <0.001
#> 8 Model A       ECOG3 ECOG3 7.756 1.037, 58.039   0.046

In this example, the adjusted Model A hazard ratios suggest that worse performance status is associated with a higher incident hazard:

ECOG1 versus ECOG0: HR = 1.519
ECOG2 versus ECOG0: HR = 2.579
ECOG3 versus ECOG0: HR = 7.756

The last row also reminds us why raw output must be read critically: ECOG3 has only one event in this small teaching data set, so its confidence interval is extremely wide. Sparse cells are a scientific warning, not just a cosmetic table issue.

How to interpret common Step 1 result patterns

For the main Cox model, the usual interpretation logic is:

if the HR is clearly away from 1 and the confidence interval does not cross 1, the main association is both directionally clear and statistically supported
if the HR is away from 1 but the confidence interval crosses 1, the data suggest a possible association, but the estimate is still imprecise
if the point estimate is close to 1, the effect size may be small; however, the strength of statistical evidence still depends on the precision of the confidence interval, the sample size, and the scientific context
for categorical exposures, always read both the effect size and the cell sizes; sparse strata can generate unstable large HRs

Step 1 is therefore not only about “running Cox”. It is about learning how to separate signal, uncertainty, and sparse-data artifacts.

How to write Step 1 results in a paper

For a continuous exposure, a concise Results-style sentence could be:

In the fully adjusted model, each one-year increase in age was associated with an estimated 1.1% higher incident hazard (HR = 1.011, 95% CI 0.993 to 1.029), although the confidence interval included the null.

For a categorical exposure, a corresponding sentence might be:

Compared with ECOG0, worse performance status was associated with a higher incident hazard, with the strongest contrast observed for ECOG2 (HR = 2.579, 95% CI 1.660 to 4.007) after adjustment for sex.

The scientific style is the same in both cases:

identify the exposure contrast
report HR, 95% CI, and model level
describe the estimate conservatively

What to take away from Step 1

After this step, you should already know:

how to structure a survival-ready cohort data frame
how to inspect event burden and follow-up before fitting
how to choose between continuous and categorical exposure coding
how to define an adjustment hierarchy before fitting
how to interpret a main Cox table in manuscript language

Step 2 then asks a different question:

does the exposure effect differ across groups, and on which scale?

That belongs to subgroup and interaction analysis, not to the main Cox result itself.