Bank Account Survival Analysis

Published

September 1, 2023

Ever wondered how long your savings will last? Your bank wants to know that too.

Survival Analysis

In my past life as an ecologist, we learned about survival models, but it turns out they show up in all sorts of domains (economics, engineering, and sociology to name a few). Wikipedia puts it well:

Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?

Now that I work in finance, I’ve learned that Asset & Liability Managers (ALM) at the bank are interested in estimating the average life of various types of bank accounts. A good average life model allows ALM folks to predict future patterns in deposits, a hot topic in this current economic environment.

Average Life of Bank Accounts

Account Balances

Let’s start with some fake data.

set.seed(123)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr) # for `kable()`

n_accounts <- 500

floor_zero <- function(x) ifelse(x < 0, 0, x)

balance_by_month <- expand_grid(
  account_id = as.character(seq_len(n_accounts)), 
  date = ymd(080101) + months(0:11)
) |> 
  mutate(month = month(date, label = TRUE, abbr = FALSE)) |> 
  group_by(account_id) |> 
  mutate(
    balance = account_id |> 
      rnorm(mean = 0, sd = 1000) |> 
      cumsum() |> 
      floor_zero()
  )

balance_by_month |> 
  head() |> 
  kable()
account_id date month balance
1 2008-01-01 January 0.0000
1 2008-02-01 February 0.0000
1 2008-03-01 March 768.0552
1 2008-04-01 April 838.5636
1 2008-05-01 May 967.8513
1 2008-06-01 June 2682.9163

So we’ve got monthly balances for one year from 500 accounts. Let’s plot the data!

ggplot(balance_by_month, aes(x = date, y = balance, fill = account_id)) +
  geom_line(alpha = 0.2)

We’ve got some random walks floored at zero. We’re assuming that balances can’t be negative. It looks like some folks are doing really well but others not so much. How do we translate these balances to survival models.

Defining the Event

Survival analysis is all about examining time-to-event phenomena. Back in ecology land, the event was death, but when does a bank account die? Earlier we floored balances at zero. Some banks charge overdraft fees, so when a customer reaches zero, they’re likely to be charged a fee. We’ll just assume that reaching zero means that a bank account is done for. First let’s calculate the first month that the account hit zero. That’s when the “death” event occurs.

first_zero_month <- balance_by_month |> 
  filter(balance == 0) |> 
  summarise(first_zero_month = min(.data$month))

ggplot(first_zero_month, aes(x = first_zero_month)) + 
  geom_bar()

It looks like a cascade of death events. Most occur in the first month with fewer and fewer as the year goes on. But we’ve got a problem! We need to represent this in the data. In this analysis, we don’t want bank accounts to suddenly come back alive! We need to define our event clearly.

event_by_month <- balance_by_month |> 
  left_join(first_zero_month, by = "account_id") |> 
  mutate(event = if_else(month >= first_zero_month, 1, 0)) |> 
  mutate(month = as.integer(month)) |> 
  select(account_id, month, event)

event_by_month |> 
  head() |> 
  kable()
account_id month event
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
1 6 1

Now our event column stays put once the account balance hits zero.

Fitting Our First Survival Model

Survival models are built in to R. Ain’t that great? Let’s fit a model.

library(survival)

fit_customer <- survfit(Surv(time = event_by_month$month, event = event_by_month$event) ~ 1)

plot(fit_customer, xlab = "Time", ylab = "Survival Probability", main = "Survival Curve for Customers")

Base R plots are just the best sometimes. Now let’s calculate the average life of an account using the Kaplan-Meier estimator.

survival_prob <- fit_customer$surv
times <- fit_customer$time
delta_time <- c(times[1], diff(times))
account_avg_life <- sum(survival_prob * delta_time)

There you have it! On average an account lasts 6 months.