Accelerated failure time model in XGBoost

Last updated on Oct 17, 2020 time to event, XGBoost

Question 1: What is AFT model?

Let T be ‘failure time’, the response variable, and $x$ be the corresponding covariate vector. If there is no censored observation, we probably would regress T on $x$ .

Under censoring, we can observe a bivariate vector $(Y_{i}, δ_{i})$ , where $Y_{i} = m i n (T_{i}, C_{i})$ and $δ_{i} = 1$ if $Y_{i} = T_{i}$ and 0 otherwise. $C_{i}$ is time of censoring, it is independent of $Y_{i}$ conditional on covariates for the individual $i$ .

Suppose: $\begin{array}{r} T_{i} = e x p (β^{'} x_{i} + e_{i}) \end{array}$ where $e_{i}, i = 1, . . ., n$ are independent and identically distributed random variables.

This function may replaced by other strictly increasing functions.

Question 2: What are the advantages of the AFT model?

AFT modle can predict the failure risk while Cox-PH does not yield a usable predicted risk directly, it rather gives relative importance of features.
AFT support all three censoring types (Right, Left, and Interval).
AFT provides a better fit when proportional hazard assumption does not hold.

Question 3:How it works in XGBoost?

Likelihood function taking account of three censoring types.

XGBoost optimizes a twice-differentiable convex loss function $l (y_{i}, {\hat{y}}_{i})$ in its second-order method of gradient boosting.

Now let’s define a loss function $l_{A F T}$ for the AFT model. Let $D = {(x_{i}, y_{i})}_{i = 1}^{n}$ denote the training data, and $Y_{1}, . . ., Y_{n}$ denote random variables i.i.d. with distribution for $Y$ . The likelihood for $D$ is the product of probability densities $f_{Y}$ for individual data points.

$L (D) = P [Y_{1} = y_{1}, \dots, Y_{n} = y_{n}] = \prod_{i = 1}^{n} P [Y_{i} = y_{i}] = \prod_{i = 1}^{n} f_{Y} (y_{i})$

As usual, we change it to log scale, the goal is changed to maximize log likelihood.

$\ln L (D) = \sum_{i = 1}^{n} \ln P [Y_{i} = y_{i}] = \sum_{i = 1}^{n} \ln f_{Y} (y_{i})$

Under censoring, we don’t know $y_{i}$ for some individuals. Therefore we revise the likelihood function by using definition of probability to take account of the censored data:

$\ln L (D) = \underset{uncensored label}{\underset{⏟}{\sum \ln P [Y_{i} = y_{i}]}} + \underset{censored label with y_{i} \in [{\underset{―}{y}}_{i}, \overset{―}{y_{i}}]}{\underset{⏟}{\sum \ln P [{\underset{―}{y}}_{i} \leq Y_{i} \leq \overset{―}{y_{i}}]}} = \underset{uncensored label}{\underset{⏟}{\sum \ln f_{Y} (y_{i})}} + \underset{censored label with y_{i} \in [\underset{―}{y_{i}}, \overset{―}{y_{i}}]}{\underset{⏟}{\sum \ln (F_{Y} (\overset{―}{y_{i}}) - F_{Y} (\underset{―}{y_{i}})}}$

where $\underset{―}{y_{i}}$ and $\overset{―}{y_{i}}$ are lower and upper bounds for $y_{i}$ , respectively. $F_{Y}$ is the cumulative distribution function (CDF). When $\overset{―}{y_{i}}$ is infinity, it is right-censored data while when $\underset{―}{y_{i}}$ is 0, it indicates left-censored data.

Likelihood function in cooperation with AFT

AFT model:

$l n (y_{i}) = {\hat{y}}_{i} + σ z_{i}$ where ${\hat{y}}_{i} = T (x)$ . and NOTE $y_{i}$ indicates time to event and its variance comes from $z_{i}$ . $Z$ is a random variable of a known probability distribution.

Suppose $Y = g (Z)$ and $g (\cdot)$ is a monotone increasing function. The pdf and cdf of $Y$ can be expressed in terms of pdf and cdf of $Z$ :

$f_{Y} (y) = f_{Z} (g^{- 1} (y)) \cdot \frac{d}{d y} g^{- 1} (y) F_{Y} (y) = F_{Z} (g^{- 1} (y))$

Therefore, the loss function of AFT model will be:

$ℓ_{AFT} (y, \hat{y}) = {\begin{cases} - \ln [f_{Z} (s (y)) \cdot \frac{1}{σ y}] & if y is not censored \\ - \ln [F_{Z} (s (\bar{y})) - F_{Z} (s (\underset{―}{y}))] & if y is censored with y \in [\underset{―}{y}, \bar{y}] \end{cases}$

where $s (y) = (\ln y - \hat{y}) / σ$

Gradient and hessian of the AFT loss

The gradient boosting algorithm in XGBoost uses the gradient and hessian of the loss function, which are first and second partial derivatives of $ℓ$ with respect to $\hat{y}$ . The gradient and hessian of the AFT loss function are as follows:

${\frac{\partial ℓ_{AFT}}{\partial \hat{y}} |}_{y, \hat{y}} = {\begin{cases} \frac{f_{Z}^{'} (s (y))}{σ f_{Z} (s (y))} & if y is not censored \\ \frac{f_{Z} (s (\bar{y})) - f_{Z} (s (\underset{―}{y}))}{σ [F_{Z} (s (\bar{y})) - F_{Z} (s (\underset{―}{y}))]} & if y is censored with y \in [\underset{―}{y}, \bar{y}] \end{cases}$

${\frac{\partial^{2} ℓ_{AFT}}{\partial {\hat{y}}^{2}} |}_{y, \hat{y}} = {\begin{cases} - \frac{f_{Z} (s (y)) f_{Z}^{''} (s (y)) - f_{Z}^{'} (s (y))^{2}}{σ^{2} f_{Z} (s (y))^{2}} & if y is not censored \\ - [F_{Z} (s (\bar{y})) - F_{Z} (s (\underset{―}{y}))] [f_{Z}^{'} (s (\bar{y})) - f_{Z}^{'} (s (\underset{―}{y}))] \\ \frac{+ {[f_{Z} (s (\bar{y})) - f_{Z} (s (\underset{―}{y}))]}^{2}}{σ^{2} {[F_{Z} (s (\bar{y})) - F_{Z} (s (\underset{―}{y}))]}^{2}} & if y is censored \end{cases}$

Cite1: Survival regression with accelerated failure time model in XGBoost

Cite2: XGBoost Documentation

time to event XGBoost