In the simpler case of homogeneous ETAS model, where \(\mu (t)=\mu _0\), the unknowns are only \(\{\mu _0, K, \alpha , c, p\}\). They can be estimated using maximum likelihood estimation (MLE) method by maximizing the log-likelihood function \(\log {L}\) (Ogata 1998)

$$\begin{aligned} \log {L} = \sum _{\{i:S<t_i<T\}} \log {\lambda (t_i)} -\int _S^T{\lambda (t)}\hbox {d}t \end{aligned}$$

(6)

where [*S*, *T*] is the time domain containing the observations.

However, simple maximum likelihood method does not provide good estimates in case of a non-stationary ETAS model. The MLE estimates would be such that the background rate overfits the data. To avoid this, a roughness penalty has to be applied to constrain the wiggliness of the estimated \(\mu (t)\). Thus, the model unknowns \(\Phi\) and \(\Theta\) are estimated by penalized maximum likelihood estimation (penalized MLE) method. Estimated \(\hat{\Phi }\) and \(\hat{\Theta }\) are those that maximize the penalized log-likelihood objective (Kumazawa and Ogata 2014)

$$\begin{aligned} R(\Phi , \Theta , \tau ) = \log {L(\Phi ,\Theta )}-\tau \times Q(\Phi ) \end{aligned}$$

(7)

where \(\tau\) is the regularization or smoothing parameter and \(Q(\Phi )\) is the roughness penalty function. The roughness penalty is generally considered in the form of integrated squared *m*th-order derivative of the desired function

$$\begin{aligned} Q(\Phi ) = \int _S ^T[{\mu ^{(m)}(t)}]^2\hbox {d}t \end{aligned}$$

(8)

When \(\mu (t)\) is expressed in terms of B-spline basis, the penalty function can be conveniently expressed as

$$\begin{aligned} Q(\Phi ) = \Phi ^{\prime }P\Phi \end{aligned}$$

(9)

where the penalty matrix P is a symmetric matrix with elements

$$\begin{aligned} P_{ij} = \int _S ^T {B_i^{(m)}(t,d,\kappa _t)B_j^{(m)}(t,d,\kappa _t)}\hbox {d}t \end{aligned}$$

(10)

The smoothing parameter \(\tau\) in Eq. (7) controls the relative contribution of goodness-of-fit criterion (here log-likelihood) and the roughness penalty function in determining the values of estimated parameters. Large \(\tau\) values in a penalized MLE lead to over smoothed estimates of \(\mu (t)\), while a small *τ* results in a \(\mu (t)\) which is under smoothed. It is therefore important to employ a \(\tau\) that provides optimal smoothing.

### Choosing optimal smoothness parameter

Consider the penalized log-likelihood objective function given in Eq. (7). The penalty function is used to regularize the inversion and the smoothness parameter \(\tau\) plays the role of a regularization parameter. The purpose of the penalty function is to provide additional constraints to the inversion to impart stability.

As the penalty function depends only on \(\Phi\), it thus provides constraints for the parameters in \(\Phi\) alone. Depending on the choice of the order *m* of roughness penalty, the penalty matrix *P* could be rank deficient. For a ridge penalty (\(m=0\)), the penalty matrix would be full rank (\(r= M\)); for a penalty on first-order derivatives (\(m=1\)), the penalty matrix would have a rank \(M-1\), and so on. In general, for a penalty on the *m*th-order derivative, the penalty matrix would be rank deficient by *m* and would have a rank of \(r=M-m\). This rank deficiency implies that the penalty matrix cannot simultaneously constrain all the parameters \(\phi _i\) in \(\Phi\); only \(r = M-m\) of them would be constrained. In brief, of the \(M+4\) number of unknown parameters (\(\Phi\) of dimension M and \(\Theta\) of dimension 4), there exist \(M-m\) constrained parameters, say \(\Phi _f=\{\phi _{i}, i=1,2,3,{\ldots }M-m\}\), and \(4+m\) unconstrained parameters \(\Theta \cup \Phi _d\) where \(\Phi _{d}=\{\phi _{i}, i=M-m+1,{\ldots },M\}\). Apart from these, the optimal smoothing parameter \(\tau\) also has to be estimated. Kumazawa et al. (2016) assumed a priori \(\Theta\) and were hence left only with constrained parameters \(\Phi _f\) and the other unknowns \(\eta = \{\Phi _d, \tau \}\). They employed the penalized MLE step to estimate \(\Phi _f\) and the Type-II MLE to estimate \(\eta\).

#### Type-II maximum likelihood estimation

Consider the expression for penalized log-likelihood given in Eq. (7). From the Bayesian perspective, applying a roughness penalty to the log-likelihood function is equivalent to putting a prior on the variables. To understand this better, let us exponentiate both sides of Eq. (7)

$$\begin{aligned} e^{R(\Phi , \Theta , \tau )} = e^{\log {L(\Phi ,\Theta )}}\times e^{-\tau \times Q(\Phi ,\tau )}. \end{aligned}$$

(11)

In the above equation, \(e^{-\tau \times Q(\Phi ,\tau )}\) corresponds to prior, \(e^{\log {L(\Phi ,\Theta )}}\) is the likelihood and \(e^{R(\Phi , \Theta , \tau )}\) is proportional to the posterior. Note that estimating the parameters that maximizes the penalized log-likelihood objective is same as finding the mode of the above posterior. Thus, penalized maximum likelihood estimation is equivalent to maximum a posteriori (MAP) estimation.

The prior in expression (11) is improper because of the mentioned rank deficiency of the penalty matrix. However, the penalty matrix has rank \(M-m\) and thus the prior is proper on the \(M-m\) parameters \(\Phi _f\). So, we can normalize this part of the prior as \(e^{-\tau \times Q(\Phi _f)}/\int {e^{-\tau \times Q(\Phi _f)}}\) to obtain a prior, which is a proper probability density function. Using this, the posterior \(T(\Phi _f, \eta )\) can be written as

$$\begin{aligned} T(\Phi _f, \eta ) = L(\Phi ,\Theta )\times \frac{ e^{-Q'(\Phi _f, \tau )}}{\int {e^{-Q'(\Phi _f, \tau )}}} \end{aligned}$$

(12)

where \(Q'(\Phi _f,\tau )= \tau \times Q(\Phi _f)\) is the roughness penalty scaled by the smoothing parameter \(\tau\).

Note that there is no prior on the parameters \(\eta\). They are treated as hyperparameters and are estimated by maximizing the posterior that is marginalized over \(\Phi _{f}\) as

$$\begin{aligned} {\Lambda (\eta )} = {\int {T(\Phi _f, \eta )}\hbox {d}\Phi _f} \end{aligned}$$

(13)

This procedure of estimating hyperparameters by maximizing the marginalized likelihood is called the Type-II maximum likelihood procedure (Kumazawa and Ogata 2014; Ogata 2011). This procedure is also known as empirical Bayes (Bishop 2006). The Akaike Bayesian Information Criterion (ABIC) is equal to \(-\,2 \times \max \log {\Lambda (\eta )} + \dim (\eta )\). Thus, estimation using Type-II MLE is equivalent to choosing a model that minimizes the Akaike Bayesian Information Criterion (ABIC).

Computing the marginalized posterior involves integration over \(\Phi _f=\{\phi _{i}, i=1,2,3,{\ldots }M-m\}\) which has large dimensionality. This integration is difficult to compute in practice; so, Laplace approximation is used to approximate the posterior by a Gaussian distribution and thereby simplify this integration. Using Laplace approximation, the logarithm of marginal likelihood can be written as (Ogata 2011)

$$\begin{aligned} log{\Lambda (\eta )} = R\left( \hat{\Phi }_f|\eta \right) - \frac{1}{2} \log {\det {H_{R}\left( \hat{\Phi }_f|\eta \right) }} + \frac{1}{2} \log {\det {H_{Q'}\left( \hat{\Phi }_f|\eta \right) }} \end{aligned}$$

(14)

where \(H_{R}(\hat{\Phi }_f|\eta )\) and \(H_{Q'}(\hat{\Phi }_f|\eta )\) denote the Hessians of penalized log-likelihood and roughness penalty, respectively, evaluated at \(\hat{\Phi }_f\) corresponding to the peak of penalized log-likelihood function. These \(\hat{\Phi }_f\) are generally unknown; in fact, it is our aim to estimate these. However, both \(\Phi _f\) and \(\eta\) can be estimated by an iterative procedure consisting of two steps: (a) given \(\eta\), \(\Phi _f\) is estimated by penalized MLE and (b) using this estimate \(\hat{\Phi }_f\), \(\eta\) is estimated using Type-II MLE by maximizing the approximated marginal likelihood objective given in Eq. (14). \(\Phi _f\) and \(\eta\) are re-estimated in each iteration, until convergence. This iterative procedure was adopted by Kumazawa and Ogata (2014) to estimate \(\Phi _f\) and \(\eta = \{\Phi _d, \tau \}\) given a priori known \(\Theta\). Even in the case when \(\Theta\) are unknown, this iterative procedure can be used to estimate \(\Phi _f\) (step (a)) and redefined \(\eta = \{\Phi _d, \Theta , \tau \}\) (step (b)) as was done by Ogata (2011) in the context of spatially inhomogeneous ETAS. However, in practice, the results obtained using such an approach remain unsatisfactory, because the strongly dependent parameters \(\Phi\) and \(\Theta\) are estimated in two different steps. Thus, we propose to use the L-curve procedure of Frasso and Eilers (2015) to choose the optimal \(\tau\) and then use penalized MLE to simultaneously estimate all model unknowns \(\Theta\) and \(\Phi\).

#### L-curve method

It is well known that a small smoothing parameter \(\tau\) results in a \(\mu (t)\) which is wiggly (under smoothed), while a large \(\tau\) gives an estimate of \(\mu (t)\) that is over smoothed. Therefore, a good strategy is to perform penalized MLE over a range of \(\tau\), from small to large values, and then pick the optimal \(\tau\). When the roughness penalty values corresponding to the estimates of \(\Phi\) and \(\Theta\) at each \(\tau\) are plotted against the respective negative log-likelihood values, an approximate L-shaped curve is obtained. The vertical part of the curve is associated with \(\tau\) values that provide under smoothed solutions and the horizontal part with over smoothed solutions. Hence, the \(\tau\) value corresponding to the corner of the L-curve can be chosen as the optimal smoothness parameter \(\hat{\tau }\) (Frasso and Eilers 2015). The estimates from penalized MLE corresponding to this \(\hat{\tau }\) are the optimal estimates of \(\hat{\Phi }\) and \(\hat{\Theta }\). This procedure is similar to the L-curve method for choosing the optimal regularization parameter in ill-posed inverse problems (Hansen 1999; Sen and Stoffa 2013).

### Adaptive roughness penalty function

The above-described combination of penalized MLE and L-curve method may work well to effectively describe earthquake occurrences in most cases. However, in situations where the background rate function has significant non-uniform roughness over its time domain, use of a global smoothness parameter as indicated in the proposed method could be insufficient to model the earthquake sequence effectively. Quantile knots can take care of such variable smoothness to some extent. But, this is not always sufficient. To obtain better results in such cases, an adaptive penalty function is needed that allows local variations in roughness. In this study, such an adaptive penalty function is obtained by expressing the smoothness parameter as another spline function (Baladandayuthapani et al. 2005) defined over \(M_\tau (<< M)\) B-spline functions as

$$\begin{aligned} \tau (t) = \sum _{i=1} ^{M_\tau } {\tau _i}B_i(t,\kappa _\tau ,d_\tau ) \end{aligned}$$

(15)

where \(\tau _i\), \(d_\tau\) and \(\kappa _\tau\) are the spline coefficients, degree and sub-knots corresponding to the smoothness parameter function \(\tau (t)\). A small subset of background rate knots (\(\kappa _t\)) are chosen as sub-knots (\(\kappa _\tau\)) for \(\tau\). Thus, the penalized log-likelihood function in Eq. (7) would now be

$$\begin{aligned} R(\Phi , \Theta , \tau ) = \log {L(\Phi ,\Theta )}-\int _S ^T\tau (t)[{\mu ^{(m)}(t)}]^2\hbox {d}t \end{aligned}$$

(16)

where \(\tau (t)\) is given by Eq. (15).

Since there are now more than one smoothness parameters (the spline coefficients \(\tau _i\)), the L-curve methodology cannot be applied. However, the iterative procedure involving penalized MLE and Type-II MLE procedure (Kumazawa and Ogata 2014) described previously can be used to estimate the spline coefficients \(\{\tau _i, i=1,2,{\ldots },M_\tau \}\) by including all of them in the hyper parameters vector \(\eta\). Note that this method involving Type-II MLE gives reliable results only in the case when \(\Phi\) is estimated given known \(\Theta\). Since model parameters are not always known beforehand, estimates are first obtained using the penalized MLE and L-curve approach described above. As we shall see in the section on synthetic tests, these are reasonably close to the true values. Thus, for estimation using the adaptive penalty approach, we first estimate (a) \(\hat{\Theta }_L\), \(\hat{\Phi }_L\) and \(\hat{\tau }_L\) using penalized MLE and L-curve approach. (b) Set \(\Theta =\hat{\Theta }_L\) (c) with \(\Phi = \hat{\Phi }_L\) and \(\tau _i = \hat{\tau }_L\), estimate \(\hat{\tau }(i)\) and \(\hat{\Phi }_A\) using the iterative procedure involving Type-II MLE (d) on convergence of the iterative procedure, use the newly estimated \(\hat{\tau }(t)\) and perform a final penalized MLE step to re-estimate both \(\Phi\) and \(\Theta\). In the case of a rapidly varying background rate function, as demonstrated in the following section, this adaptive procedure provides estimates of background rate that are much better than those obtained using a global smoothness parameter.