Skip to main content

Comparison between the Hamiltonian Monte Carlo method and the Metropolis–Hastings method for coseismic fault model estimation


A rapid source fault estimation and quantitative assessment of the uncertainty of the estimated model can elucidate the occurrence mechanism of earthquakes and inform disaster damage mitigation. The Bayesian statistical method that addresses the posterior distribution of unknowns using the Markov chain Monte Carlo (MCMC) method is significant for uncertainty assessment. The Metropolis–Hastings method, especially the Random walk Metropolis–Hastings (RWMH), has many applications, including coseismic fault estimation. However, RWMH exhibits a trade-off between the transition distance and the acceptance ratio of parameter transition candidates and requires a long mixing time, particularly in solving high-dimensional problems. This necessitates a more efficient Bayesian method. In this study, we developed a fault estimation algorithm using the Hamiltonian Monte Carlo (HMC) method, which is considered more efficient than the other MCMC method, but its applicability has not been sufficiently validated to estimate the coseismic fault for the first time. HMC can conduct sampling more intelligently with the gradient information of the posterior distribution. We applied our algorithm to the 2016 Kumamoto earthquake (MJMA 7.3), and its sampling converged in 2 × 104 samples, including 1 × 103 burn-in samples. The estimated models satisfactorily accounted for the input data; the variance reduction was approximately 88%, and the estimated fault parameters and event magnitude were consistent with those reported in previous studies. HMC could acquire similar results using only 2% of the RWMH chains. Moreover, the power spectral density (PSD) of each model parameter's Markov chain showed this method exhibited a low correlation with the subsequent sample and a long transition distance between samples. These results indicate HMC has advantages in terms of chain length than RWMH, expecting a more efficient estimation for a high-dimensional problem that requires a long mixing time or a problem using nonlinear Green’s function, which has a large computational cost.

Graphical Abstract


To minimize earthquake and tsunami damage, the details of the coseismic fault must be estimated rapidly. Surface displacement data supplied by the global navigation satellite system (GNSS) can provide a stable estimation without underestimating the magnitude, considering that the measurement of a seismometer can be saturated when a megathrust occurs (Ohta et al. 2012). GNSS data are used in the real-time GEONET analysis system for rapid deformation monitoring (REGARD) (Kawamoto et al. 2017), which was jointly developed by the Geospatial Information Authority of Japan and Tohoku University. This system analyzes a GNSS carrier phase in real time to detect seismic events and automatically and rapidly estimate the fault model. Fault models were automatically estimated, for example, the 2016 Kumamoto earthquake (MJMA 7.3) (Kawamoto et al. 2016), the 2019 Yamagata–Oki earthquake (MJMA 6.7), and the 2021 Fukushima–Oki earthquake (MJMA 7.3). These rapid inferences of REGARD include the slip distribution model on a plate interface in a subduction zone and a single rectangular fault estimation to address non-interplate earthquakes, such as inland and intra-slab earthquakes. The fault models estimated by REGARD were also adopted as part of the real-time tsunami inundation and damage forecast system operated by the Cabinet Office, Government of Japan (Musa et al. 2018; Ohta et al. 2018). It is important to accurately understand the estimation uncertainty of the estimated fault models to predict tsunamis and other natural disasters.

For uncertainty evaluation, a Bayesian statistics interpretation is often used to solve the inverse problem. This method is used to quantitatively assess the uncertainty of the prediction using the posterior probability density function (PDF) of model parameters while explicitly integrating prior information. Amey et al. (2018) applied the Bayesian inversion method to the inference of the slip distribution in the 2014 Napa Valley earthquake with self-similarity of the fault slip (Mai and Beroza 2002) as prior information. In addition, many previous studies have used Bayesian inversion (Sambridge and Mosegaard 2002; Fukuda and Johnson 2008; Dettmer et al. 2014; Ohno et al. 2022).

With non-interplate earthquakes such as the 2008 Iwate–Miyagi Nairiku earthquake (MJMA 7.2) or the 2019 Yamagata–Oki earthquake (MJMA 6.7), fault geometries are complex or unknown, and many earthquakes are difficult to assume the fault geometry in advance (Ohta et al. 2008; Ohno et al. 2021). The assumption of fault geometry required in the slip distribution model can make inferences ambiguous and affect the inferred result (Fukahata and Wright 2008; Fukuda and Johnson 2010; Duputel et al. 2014; Agata et al. 2021; Dutta et al. 2021).

Previous studies have satisfied the two requirements of rapid estimation and quantitative uncertainty assessment; for example, Minson et al. (2014) developed a real-time estimation method using Bayesian linear regression including the fault geometry selection with the assessment of its uncertainty and the slip distribution estimation on that geometry. Ohno et al. (2021) also developed Bayesian fault estimation methods that acquire the uncertainty of the fault parameters of a single rectangular fault by directly drawing the PDF using the Markov chain Monte Carlo (MCMC) method for real-time purpose. Their method incorporates the parallel tempering (Swendsen and Wang 1986; Geyer 1991) and an algorithm to automatically determine hyperparameters for real time and automatic fault estimation.

An MCMC sampling method typically used in the context of Bayesian fault estimation is the Random walk Metropolis–Hastings (RWMH) method (Metropolis et al. 1953; Hastings 1970). RWMH forms a Markov chain based on the judgment of acceptance or rejection, depending on the ratio of the posterior distribution between a current parameter set and a transition candidate sampled randomly. Because of its simplicity, this algorithm has been applied to many studies in various fields, including fault estimation. However, RWMH exhibits a disadvantageous trade-off relationship between the acceptance ratio and the transition distance of parameters. RWMH sampling tend to decrease its acceptance ratio in proportion to the transition distance in the parameter space, especially for the sampling after the burn-in. Although the transition distance is often adjusted based on an acceptance ratio of approximately 30 to 50%, this low acceptance makes the inference inefficient. Furthermore, regarding the higher dimensional problems, the relevant volume shrinks in relation to the total model space volume (the curse of dimensionality), thus the acceptance ratio is low and the transition distance is further shortened. To address the trade-off problem, the RWMH method requires a relatively long chain and mixing time.

Considering this problem, we focus on the Hamiltonian Monte Carlo (HMC; Duane et al. 1987) method instead of the RWMH method. The HMC method requires a relatively short Markov chain to converge a sampling because this method can transition long distances while maintaining a higher acceptance ratio by utilizing the gradient information of the posterior PDF. Fichtner and Simute (2018) applied the weakly nonlinear problem of source point determination and developed an efficient and accurate algorithm. This study also proposed application to higher dimensional problems to capitalize on the efficiency of HMC–HMC is frequently used in the study of seismic tomography (Fichtner et al. 2019; Gebraad et al. 2020; Muir and Tkalčić 2020). However, HMC has not been applied to fault estimation problems in the past and its applicability needs to be verified.

In this study, as an application test, we developed an algorithm for a single rectangular fault estimation using HMC. To evaluate its performance, the algorithm was applied to the 2016 Kumamoto earthquake. Furthermore, to examine the accuracy and efficiency of the HMC method estimation, the results were compared with those of the conventional RWMH method. We mainly discuss the applicability of HMC to the estimation of earthquake source fault models and a total sample size required for our estimation, but only have a brief discussion of the computational time for real time usage.


Bayes' theorem

Bayesian inversion is based on Bayes’ theorem (Tarantola 2005; Gelman et al. 2021) is represented by Eq. (1) on a model parameter vector \({\varvec{\theta}}\), and a data vector \({\mathbf{d}}\), used in estimation:

$$\begin{array}{*{20}c} {f\left( {{\varvec{\theta}}{|}{\mathbf{d}}} \right) = \frac{{f\left( {\varvec{\theta}} \right)f\left( {{\mathbf{d}}{|}{\varvec{\theta}}} \right)}}{{f\left( {\mathbf{d}} \right)}} } \\ \end{array}$$

where \(f({\varvec{\theta}}|{\mathbf{d}})\) is the posterior PDF, \(f\left( {\varvec{\theta}} \right)\) is the prior PDF, \(f({\mathbf{d}}|{\varvec{\theta}})\) is the likelihood function, and \(f\left( {\mathbf{d}} \right),\) which does not depend on the model parameters, is the normalization constant. According to Eq. (1), the posterior PDF is proportional to the prior PDF and the likelihood function.

Sampling method

Hamiltonian Monte Carlo method

We developed a method for single rectangular fault estimation using the HMC method (Duane et al. 1987; Neal 2011) and evaluated its ability in this study. First, in the HMC method, an auxiliary parameter vector \({\varvec{p}}\) with the same dimensions as the model parameter vector \({\varvec{\theta}},\) is introduced. Then, the Hamiltonian \(H\) is expressed by Eqs. (2) and (3) in the phase space at position \({\varvec{\theta}}\) and momentum \({\varvec{p}}\):

$$\begin{array}{*{20}c} {H\left( {{\varvec{\theta}}, {\varvec{p}}} \right) = h\left( {\varvec{\theta}} \right) + \frac{1}{2}{\varvec{p}}^{{\text{T}}}{\varvec{p}} } \\ \end{array}$$
$$\begin{array}{*{20}c} {h\left( {\varvec{\theta}} \right) = - \log \left( {f\left( {{\varvec{\theta}}{|}{\mathbf{d}}} \right)} \right) } \\ \end{array}$$

where \(h\left( {\varvec{\theta}} \right)\) is the potential energy as a function of position \({\varvec{\theta}}\), and the Hamiltonian \(H\) is calculated using the set of \({\varvec{\theta}}\) and \({\varvec{p}}\). Subsequently, the transition in the phase space is performed while satisfying the conservation of energy, keeping the \(H\) value, with some error arising from numerical integration. In the HMC method, the leapfrog method is used for the simulation because it ensures volume-preservation, time reversibility, and detailed balance. The standard procedure for a single step of the leapfrog method is shown in Eqs. (4)–(6):

$$\begin{array}{*{20}c} {p_{i} \left( {t + \frac{1}{2}} \right) = p_{i} \left( t \right) - \frac{{\text{e}}}{2}\frac{{\partial h\left( {{\varvec{\theta}}\left( t \right)} \right)}}{{\partial \theta_{i} }} } \\ \end{array}$$
$$\begin{array}{*{20}c} {\theta_{i} \left( {t + 1} \right) = \theta_{i} \left( t \right) + {\text{e}}p_{i} \left( {t + \frac{1}{2}} \right) } \\ \end{array}$$
$$\begin{array}{*{20}c} {p_{i} \left( {t + 1} \right) = p_{i} \left( {t + \frac{1}{2}} \right) - \frac{\mathrm{e}}{2}\frac{{\partial h\left( {{\varvec{\theta}}\left( {t + 1} \right)} \right)}}{{\partial \theta_{i} }} } \\ \end{array}$$

where \(\theta_{i}\) and \(p_{i}\) are the \(i\)th unknown and auxiliary parameters, respectively, \({\text{e}}\) is the step size, \(t\) is the current step number, and \(\text{L}\) is the total number of step. Equations (4)–(6) are iterated from \(t = 1\) to \(\text{L}\), where \(\text{L}\) is the total step size of the leapfrog. We adopted an automatic differentiation in the Python/PyTorch module to calculate the derivative in Eqs. (4) and (6). Note that a gradient calculation may improve its efficiency with the analytical differentiation of Eq. (3). Furthermore, \(\text{e}\) and \(\text{L}\) are hyperparameters, and their settings can considerably affect an estimation efficiency for a general inverse problem. Subsequently, a numerical error, including the final parameters \(\theta_{{\text{L}}}\) and \(p_{{\text{L}}}\), is evaluated using the acceptance ratio \(p_{{{\text{accept}}}} :\)

$$\begin{array}{*{20}c} {p_{{{\text{accept}}}} = \min \left( {1,r} \right)} \\ \end{array}$$
$$\begin{array}{*{20}c} {r = \exp \left( {H\left( {{\varvec{\theta}}^{\left( \tau \right)} ,{\varvec{p}}^{\left( \tau \right)} } \right) - H\left( {{\varvec{\theta}}^{{\left( {\tau^{\prime}} \right)}} ,{\varvec{p}}^{{\left( {\tau^{\prime}} \right)}} } \right)} \right) } \\ \end{array}$$

where \(\left( {{\varvec{\theta}}^{\left( \tau \right)} ,{\varvec{p}}^{\left( \tau \right)} } \right)\) represents the current parameter (\(t = 1\) in leapfrog), where \(\tau\) is the current sample number and \(\left( {{\varvec{\theta}}^{{\left( {\tau^{\prime}} \right)}} ,{\varvec{p}}^{{\left( {\tau^{\prime}} \right)}} } \right)\) dose the candidate parameter (\(t =\text{L}\)). In Eq. (8), the Hamiltonian is theoretically preserved and only a small numerical error affects Eq. (8); therefore, a higher probability of acceptance than RWMH is realized.

The HMC method enables sampling that transitions a long distance in parameter space with a high acceptance ratio through the above procedure. This procedure is conducted as follows:

  1. (i)

     An initial parameter \(\left( {{\varvec{\theta}}^{\left( 1 \right)} } \right)\), total sample size \(\left( \text{T} \right)\), and the hyperparameters \(\text{e}\) and \(\text{L}\) are set.

  2. (ii)

    Random numbers are generated from a standard normal distribution and set to \({\varvec{p}}^{\left( \tau \right)}\).

  3. (iii)

    Transition from the current point \(\left( {{\varvec{\theta}}^{\left( \tau \right)} ,{\varvec{p}}^{\left( \tau \right)} } \right)\) to the candidate point \(\left( {{\varvec{\theta}}^{{\left( {\tau^{\prime}} \right)}} ,{\varvec{p}}^{{\left( {\tau^{\prime}} \right)}} } \right)\) is performed using the leapfrog method and hyperparameters \(\text{e}\) and \(\text{L}\) (Eqs. 46).

  4. (iv)

    The candidate point is accepted or rejected using the acceptance ratio \(p_{{{\text{accept}}}} .\)

  5. (v)

    (ii)–(iv) are iterated from \(\tau = 1\) to \(\text{T}\).

Figure 1 shows an example of exploration using the HMC method. As seen in the figure, HMC acts on the settings where the hyperparameters are \(\left( {\text{e}, \text{L}} \right) = \left( {10^{ - 3} ,10} \right)\).

Fig. 1
figure 1

Example of sampling using the Hamiltonian Monte Carlo (HMC) method. This figure shows the exploration of a mean parameter of a simple normal distribution from synthetic gaussian data \(\left( {f\left( {\theta {\text{|d}}} \right) \propto {\text{exp}}\left( { - \theta^{2} } \right)} \right)\). The contours show the constant Hamiltonian and their values. Triangles, circles, and squares show the start points, transition points, and end points of each leapfrog transition, respectively. The blue triangle particularly represents the initial point of the sampling. In addition, the orange line shows the jump to the next momentum generated by standard normal distribution after acceptance of the candidate point

No-U-Turn sampler

The HMC method requires hyperparameters \(\text{e}\) and \(\text{L}\) which have a significant effect on the sampling efficiency. Therefore, optimization is needed to use this method efficiently. If L is too small, the transition distance between samples may be too short and more iteration is needed. However, if L is too large, the leapfrog will generate “U-Turn” trajectories; therefore, the final transition distance can become shorter than the middle. To address these problems, especially to avoid the “U-Turn” transition, we used the No-U-Turn Sampler (NUTS; Hoffman and Gelman 2014), the extended method of the HMC.

Although the usual HMC uses the constant L throughout the sampling, the NUTS is designed not to need the constant L in sampling. Instead of stopping at the \(\text{L}\)th time’s integration, the leapfrog transition is terminated based on the detection of “U-Turn”. The “U-Turn” in phase space defines the shortening of the distance \(Q\) between a leftmost point, \({\varvec{\theta}}^{ + }\), and a rightmost point, \({\varvec{\theta}}^{ - }\), of leapfrog transition, it is determined using the time derivative of \(Q\):

$$\begin{array}{*{20}c} {Q = \frac{1}{2}\left( {{\varvec{\theta}}^{ + } - {\varvec{\theta}}^{ - } } \right)^{{\text{T}}} \left( {{\varvec{\theta}}^{ + } - {\varvec{\theta}}^{ - } } \right) } \\ \end{array}$$
$$\begin{array}{*{20}c} {\frac{{{\text{d}}Q}}{{{\text{d}}\tau }} = \left( {{\varvec{\theta}}^{ + } - {\varvec{\theta}}^{ - } } \right)^{{\text{T}}} {\varvec{p}}^{ + } < 0 \;{\text{or}}\; \left( {{\varvec{\theta}}^{ + } - {\varvec{\theta}}^{ - } } \right)^{{\text{T}}} {\varvec{p}}^{ - } < 0 } \\ \end{array}$$

where the current sample number \({\tau }\) represents the time in the phase space, and \({\varvec{p}}^{ + }\) and \({\varvec{p}}^{ - }\) are the momentum at \({\varvec{\theta}}^{ + }\) and \({\varvec{\theta}}^{ - }\), respectively.

The above method enables the flexible escape of U-Turn, but the arbitrary stop of the leapfrog breaks the time reversibility. The NUTS algorithm overcomes this issue to introduce a procedure such as slice sampling, one of the  MCMC method. This is done by exploring from the current point \({\varvec{\theta}}^{\left( \tau \right)}\) by building a binary tree to a random time direction (a sign of \(\text{L}\)), selecting the candidate state set \(C\) based on the condition below, and eventually selecting the subsequent states from \(C\) uniformly:

$$\begin{array}{*{20}c} {C = \{ \user2{\theta}^{\prime} | \user2{\theta}^{\prime} \in B \cap H\left( {\user2{\theta}^{\prime},\user2{p}^{\prime}} \right) < u\} } \\ \end{array}$$

where \(B\) is the set of all visited states and \(u\) is the so-called slice generated as a uniform random number between 0 and the current posterior probability \(\exp \left( { - H\left( {{\varvec{\theta}}^{\left( \tau \right)} ,{\varvec{p}}^{\left( \tau \right)} } \right)} \right)\).

Comparison with Metropolis–Hastings

To verify the accuracy and efficiency of our method, we used the Random walk Metropolis–Hastings (RWMH) that was adopted by Ohno et al. (2021) to compare the estimation by the HMC method to that by the RWMH method (Metropolis et al. 1953; Hastings 1970), which is often used to solve the geophysical problem (Fukuda and Johnson 2008; Ito et al. 2012; Dettmer et al. 2014; Minson et al. 2014; Ohno et al. 2022). The basic part of the RWMH we used, is detailed in Ohno et al. (2021). Although Ohno et al. (2021) introduced an automatic determination method for likelihood variance and the parallel tempering, we use the same variance value as HMC and estimate using only a single Markov chain for a single parameter to strictly compare both methods. Ohno et al. (2021) developed the method to automatically adjust the variances of the perturbation of parameters. However, here we adjusted it manually to transition quickly from the initial point of parameters on the sampling without the parallel tempering.

Setting of a single rectangular fault estimation

We estimated parameter \({\varvec{\theta}}\) of the single rectangular fault model in Okada (1992) from permanent displacement data \({\mathbf{d}}\) based on real-time GNSS by the HMC and NUTS algorithms. The parameter vector \({\varvec{\theta}}\) includes length and angle parameters of the rectangular in a uniform elastic half-space, that is, latitude and longitude of the fault plane, top depth, strike, dip, rake, fault length, fault width, and slip amount. In this section, we describe the setting of the estimation, such as the prior distribution or likelihood function.

Prior distribution

We prepared a normal distribution as prior distributions of latitude and longitude because we can use some external information such as the Earthquake Early Warning system by the Japan Meteorological Agency, and a uniform distribution such as a non-informative prior distribution of other parameters. The standard deviation of 2° for latitude and longitude prior, is slightly large but has been set roughly for a weak restriction. In addition, we set a uniform distribution for the stress drop \(\Delta {\text{sd}}\) which is known to fall within a certain range. The upper and lower range of the prior is based on Ohno et al. (2021), which refers to the fault’s scaling law. The stress drop is calculated using the following equation:

$$\begin{array}{*{20}c} {\Delta {\text{sd}} = \frac{2c\mu S}{{\sqrt {{FL} \cdot {FW}} }}} \\ \end{array}$$

where \(c\) and \(\mu\) are a geometric factor and rigidity, respectively. We assumed these parameters as \(0.5\) and \(30 \;\left[ {{\text{GPa}}} \right]\), respectively. \(FL\), \(FW\), and \(S\) denote the length, width, and slip amount of the rectangular fault, respectively. We also established a uniform distribution for the ratio of fault width to fault length because the length is longer than the width usually. All prior distributions are shown in Table 1.

Table 1 Prior parameter for fault parameters and certain calculated parameters

Likelihood function

The likelihood function is expressed by Eqs. (13) and (14):

$$\begin{array}{*{20}c} {f\left( {{\mathbf{d}}{|}{\varvec{\theta}}} \right) = \frac{1}{{\sqrt {2\pi \sigma_{{\text{h}}}^{2} }^{\text{2N}} }}\exp \left( {\frac{ - 1}{{2\sigma_{{\text{h}}}^{2} }}{\varvec{r}}_{{\text{h}}}^{{\text{T}}} {\varvec{r}}_{{\text{h}}} } \right) \times \frac{1}{{\sqrt {2\pi \sigma_{{\text{v}}}^{2} }^{\text{N}} }}\exp \left( {\frac{ - 1}{{2\sigma_{{\text{v}}}^{2} }}{\varvec{r}}_{{\text{v}}}^{{\text{T}}} {\varvec{r}}_{{\text{v}}} } \right) } \\ \end{array}$$
$$\begin{array}{*{20}c} {{\varvec{r}}_{i} = \widehat{{{\varvec{d}}_{i} }}\left( {\varvec{\theta}} \right) - {\mathbf{d}}_{i} } \\ \end{array}$$

where \(\text{N}\) is the number of stations, and \(\widehat{{{\varvec{d}}_{i} }}\left( {\varvec{\theta}} \right)\) is the surface displacement of model \({\varvec{\theta}}\) calculated by the method of Okada (1992). Subscripts \({\text{h}}\) and \({\text{v}}\) represent the components of horizontal and vertical displacements, respectively, and we assumed that they were independent. Equation (13) assumes that the residuals, including the model and observational error, follow the normal distribution with mean 0 and variance \(\sigma_{{\text{h}}}^{2}\) and \(\sigma_{{\text{v}}}^{2}\).

Change in variables

HMC and NUTS use a gradient from the posterior PDF to the transition parameters. We observed the gradients of latitude and longitude to be notably large, except for seven parameters, resulting in the very low efficiency of the exploration. A potential method for correcting the gradients is the application of an adjustment hyperparameter \(\text{e}\), which is the coefficient of the gradient, but the parameter \(\text{e}\) is common among the parameters; therefore, addressing our case is complicated by the differences among each gradient. Therefore, we applied a change of variables to the seven parameters to change the parameter scale. In addition, the sampling can be disturbed by a rejection due to a running to out of the parameter's domain, where 0 probability area, so we adopted log and logit translation as the change of variables to enlarge the sampling range. The depth \(D\), fault length \(FL\), fault width FW, and slip amount \(S\) were applied to the log translation (Eq. 15), and the strike \(\emptyset\), dip \(\delta\), and rake angle \(\lambda\) were applied to logit (Eq. 16).

Log translation from \(x \in \left( {a,\infty } \right)\) to \(x^{\prime} \in \left( { - \infty ,\infty } \right)\):

$$\begin{array}{*{20}c} {x^{\prime} = \log \left( {x - a} \right)} \\ \end{array}$$

Logit translation from \(x \in \left( {a,b} \right)\) to \(x^{\prime} \in \left( { - \infty ,\infty } \right)\):

$$\begin{array}{*{20}c} {x^{\prime} = {\text{logit}}\left( x \right) = \log \left( {\frac{x - a}{{b - x}}} \right)} \\ \end{array}$$

Equations (15) and (16), for example, \(D \in \left( {0,\infty } \right) \left[ {{\text{km}}} \right]\) are translated to \(D^{\prime} = \log \left( D \right)\) and \(\emptyset \in \left( {0,360} \right) \left[^\circ \right]\) to \(\emptyset^{\prime} = \log \left\{ {\emptyset /\left( {360 - \emptyset } \right)} \right\}\). These changes of the sampling parameter implicitly assume not to include each value on the domain boundary of parameters, improve the efficiency of exploration, and expand practically the exploration range to the whole of real space. To correct our PDF using the Jacobian formula along with change variables, we estimated the following PDF by adding Jacobian \({\mathbf{J}}\) (Tarantola 2005; Gelman et al. 2021):

$$\begin{array}{*{20}c} {f\left( {{\varvec{\theta}}{|}{\mathbf{d}}} \right) = \frac{1}{Z} \cdot f\left( {\varvec{\theta}} \right) \cdot f\left( {{\mathbf{d}}{|}{\varvec{\theta}}} \right) \cdot \left| {\mathbf{J}} \right| } \\ \end{array}$$
$$\begin{array}{*{20}c} {\left| {\mathbf{J}} \right| = e^{{D^{\prime}}} e^{{FL^{\prime} }} e^{{FW^{\prime} }} e^{{S^{\prime}}} \frac{{360e^{{\emptyset^{\prime}}} }}{{\left( {1 + e^{{\emptyset^{\prime}}} } \right)^{2} }}\frac{{90e^{{\delta^{\prime}}} }}{{\left( {1 + e^{{\delta^{\prime}}} } \right)^{2} }}\frac{{360e^{{\lambda^{\prime}}} }}{{\left( {1 + e^{{\lambda^{\prime}}} } \right)^{2} }}} \\ \end{array}$$

where \(Z\) is a normalizing constant that is independent of \({\varvec{\theta}}\). Note we set the parameter's domain as \(\delta \in \left( {0,90} \right) \left[^\circ \right]\), \(\lambda \in \left( { - 180,180} \right) \left[^\circ \right]\), \( FL\in \left( {0,\infty } \right) \left[ {{\text{km}}} \right]\), \(FW \in \left( {0,\infty } \right) \left[ {{\text{km}}} \right]\), and \(S \in \left( {0,\infty } \right) \left[ {\text{m}} \right]\), respectively. It should be also described that the depth parameter is multiplied by a constant value before the log variable change to adapt to the 2016 Kumamoto earthquake we discussed later. In this study, we chose \(10^{6}\) as the sufficiently large value, that means we treat depth as millimeter-scale, because the event occurred in a shallow place in the crust and actually the surface deformation was observed. This linear transformation of depth only involves multiplying the Jacobian (Eq. 18) by \(10^{ - 6}\). These log and logit translations in the posterior distribution were naturally reflected in the gradient computation detailed in “Hamiltonian Monte Carlo method” section.

Convergence test

In this study, the convergence of each Markov chain was evaluated using two criteria. First, we confirmed the smoothness of the histograms for each model parameter. Although it is a subjective perspective, it is required for convenience in later discussions using the central tendency value such as mode. Second, we used the method of convergence decision by Gelman (1996) for objective judgment. This method can determine the convergence by comparing a sample’s variance between a full chain and a divided one. We applied this test to all model parameters individually and confirmed their convergence.

If the Markov chain whose length is \(\text{T}\) excluded burn-in, it is divided into \(\text{K}\) equal chains, and \(\text{T}/\text{K}\) length chains are prepared. Then the following two indices of a chain’s variance are calculated, where \(\overline{\theta }\) and \(\overline{\theta }_{k} \left( {k = 1,2,...,\text{K}} \right)\) are the mean values of a parameter in the full chain and the \(k\)th divided chain, respectively:

$$\begin{array}{*{20}c} {{\text{VAR}}_{{\text{B}}} = \frac{\text{T}}{\text{K} - 1}\mathop \sum \limits_{k = 1}^{\text{K}} \left( {\overline{\theta }_{k} - \overline{\theta }} \right)^{2} } \\ \end{array}$$
$$\begin{array}{*{20}c} {{\text{VAR}}_{{\text{W}}} = \frac{1}{{\text{K}\left( {\text{T} - 1} \right)}}\mathop \sum \limits_{k = 1}^{\text{K}} \mathop \sum \limits_{t = 1}^{\text{T}} \left( {\theta_{k}^{\left( t \right)} - \overline{\theta }_{k} } \right)^{2} } \\ \end{array}$$

Equations (19) and (20) roughly represent the variance between the chains and the average variance of each chain, respectively. Then, the index of a convergence decision using the two indices above is expressed in the following equation:

$$\begin{array}{*{20}c} {R = \sqrt {\frac{\text{T} - 1}{\text{T}} + \frac{1}{\text{T}}\frac{{{\text{VAR}}_{{\text{B}}} }}{{{\text{VAR}}_{{\text{W}}} }}} } \\ \end{array}$$

The \(R\) value is almost 1 on the converged chain; in fact, limitless sampling must yield a converged Markov chain; it is explicitly stated for Eq. (21) that the \(R\) value converges to 1, where \( \text{T}\to \infty\). In this study, the convergence threshold is given as \(R < 1.1\), according to Gelman (1996).

Data and the setting of hyperparameters

We applied our method to the 2016 Kumamoto earthquake (MJMA 7.3), which occurred at 16:25:05 (UTC) in the Kyusyu region, southwestern Japan. Ground displacement data including three components (east–west, north–south, and up-down) were obtained using the REGARD system (Kawamoto et al. 2016) in real time, and we used the data at 60 s after the mainshock at 200 stations near the source (Fig. 2).

Fig. 2
figure 2

Location of GNSS stations. A map corresponds to the red rectangular area in an inserted map. The red triangles represent the 200 GNSS stations for estimation

Table 2 shows the settings of the hyperparameters in both methods, HMC and RWMH estimation. The total number of step, \(\text{L}\), was optimized in each leapfrog step using the NUTS algorithm explained above. In contrast, the step size of leapfrog \(\text{e}\) was determined manually. The step size was adjusted a priori by halving or doubling the attempt according to the initial VR’s soaring. There is still a hyperparameter mass matrix to be optimized, which is a covariance matrix of the normal distribution for momentum \({\varvec{p}}\) (that means to use the normal distribution with arbitrary covariance as the proposal distribution of \({\varvec{p}}\)) and relates with a mixing time of sampling. Typically, the mass matrix is optimized adaptively in a burn-in chain and the efficiency of estimation can be ameliorated. However, in this study, we have not adjusted the hyperparameter to simplify the algorithm and leave it for future work. The length of the entire chain was determined using the convergence test explained above, and the length of the burn-in chain was determined by evaluating the trace plot of parameters, especially variance reduction (VR) calculated by the following equation:

$${\text{VR}} = 100 \times \left( {1 - \frac{{{\varvec{r}}^{{\text{T}}} {\varvec{r}}}}{{{\mathbf{d}}^{{\text{T}}} {\mathbf{d}}}}} \right)$$

where \({\mathbf{d}}\) and \({\varvec{r}}\) denote the vector of all the displacement data and residuals between the data and model displacement (\({\mathbf{d}} = \left( {{\mathbf{d}}_{{\text{h}}}^{{\text{T}}} ,\user2{ }{\mathbf{d}}_{{\text{v}}}^{{\text{T}}} } \right)^{{\text{T}}}\) and \({\varvec{r}} = \left( {{\varvec{r}}_{{\text{h}}}^{{\text{T}}} ,\user2{ r}_{{\text{v}}}^{{\text{T}}} } \right)^{{\text{T}}}\)), respectively.

Table 2 Setting of hyperparameters

Additional file 1: Figures S1 and S2 show the results of the convergence test and the entire trace plot of each parameter, respectively. We continued the HMC sampling with 20,000 samples because the parameters should satisfy both criteria; nevertheless, these figures indicate the convergence with 3000 samples. The burn-in samples are often determined using a trace plot of VR or posterior probability (Amey et al. 2018); therefore, we also removed the first 5% of the chain similarly. Although the observation error of GNSS may appear to result from the horizontal error being smaller than the vertical error, we assumed the standard deviation in likelihood, \(\sigma_{{\text{h}}}\) and \(\sigma_{{\text{v}}} ,\) representing the same error value, \(\sigma_{{\text{h}}} = \sigma_{{\text{v}}} = 2\;{\text{cm}}\) to simplify the inversion.

Table 2 also shows the hyperparameters in terms of RWMH sampling. The entire burn-in chain length was determined the same way as the HMC, a factor of 50, compared to HMC (Additional file 1: Figs. S1, S2).


Expression of posterior probability density functions

For geophysical interpretation, posterior PDFs regarding the fault parameters are required. However, we can directly obtain the rescaled PDFs because of the change in variables (detailed in “Change in variables” section). Moreover, the changes using log and logit function whose Jacobian is not identity matrix complicate the restoration of their scale. Therefore, we developed a special plotting for log and logit translation. We firstly made the histograms on a changed parameter scale and inversely changed the upper and lower boundary values of class division while maintaining the class frequency. Then, we plotted the normalized histograms on original parameter’s scale using the boundary values. The change in appearance following the use of our expression method is shown in Additional file 1: Fig. S3.

Results of the application for the Kumamoto earthquake

Figure 3 shows the posterior PDFs of each unknown parameter (latitude, longitude, depth, strike, dip, rake angle, fault length, fault width, and slip amount) and certain seismic parameters [moment magnitude (Mw), stress drop, and VR] computed for each sample inferred by HMC and RWMH.

Fig. 3
figure 3

Inferred posterior probability density function (PDF) due to Hamiltonian Monte Carlo (HMC) and Random walk Metropolis–Hastings (RWMH) estimation. Both results of HMC and RWMH are shown by blue and red lines, respectively. The y-axis represent the normalized frequency of Markov chain Monte Carlo (MCMC). Then the insert values represent statistics (the mean, median, mode, and length of 95% confidence interval of PDFs from top to bottom) similarly presented in blue or red

The HMC results in Fig. 3 denote a shallow right-lateral fault dipping toward the northwest, and the displacement predictions calculated from each sample are consistent with the data (VR ~ 88%). The distribution of Mw is a reasonable estimate with a peak of approximately 7.0, which is consistent with previous studies (Asano and Iwata 2016; Kawamoto et al. 2016; Yarai et al. 2016; Tanaka et al. 2019).

Figure 3 shows considerable features that the PDFs obtained using HMC agree with RWMH in the shape of PDFs or representative values (mean, median, and mode); nevertheless, both methods estimate independently. This means that HMC can be applied to the single rectangular fault estimation. Conversely, there are a few differences in the PDF of strike that RWMH’s has multiple peaks. Although we conducted the RWMH and HMC sampling, whose chain length is 10 times the main result in the manuscript, they remain as they were (Additional file 1: Fig. S4). Both methods may shape different PDFs attributed to the difference between both sampling methodologies and to the implicit assumption of a positive depth on HMC sampling. However, the strike posterior distributes in an extremely narrow range (\(\sim 4^\circ\)), which can only have a little effect on the surface deformation model. Therefore, we concluded their distributions were in rough agreement and will validate it in future by applying them to other events. There are also a few differences in PDFs of the depth. The outstanding peak of the depth’s PDF of the HMC near 0 must be induced by the log variable changes, even if we deal with the millimeter depth. The log change can adjust a gradient of a posterior at almost entirely a domain, but a trend of the gradient changes in \(x < 1\), resulting in that a transition distance of sampling is shortened and \(x\) less than \(1\) is often sampled. This is just an issue derived from the parametrization, so we would like to adopt a more effective one in the future.

For a comparison of surface deformation, Fig. 4a shows the rectangular fault of a mode model (the so-called maximum a posteriori model) solved by HMC, and the predicted displacement at each station assumed the model with the displacement data we used. Figure 4 also shows a mode model solved by RWMH. As already mentioned, the HMC mode model adequately explains the displacement data and is consistent with the RWMH model. As illustrated in Fig. 4b, which is drawn using a method proposed by Ohno et al. (2021), we exhibited the spatial frequency of the fault rectangular. This figure includes the color scale, considering overlaps among the rectangles of each HMC sample on the map. HMC models likely contain only a small uncertainty of estimation because the color scale expands in a significantly narrow region.

Fig. 4
figure 4

Map view results. a Each rectangular denotes the mode model; the red and blue ones filled with slanted lines are inferred using RWMH and HMC, respectively. In this figure, gray lines show known active faults; black and white arrows show the data and the mode model’s horizontal displacements, respectively. The residuals between the data and the model's vertical displacements are also shown in a color scale. b Rectangular fault frequency drawn using each HMC sample. The color scale represents the frequency, whose warm color denotes a region of high frequency, and the region not colored indicates no estimation


Comparison of initial exploration

Figure 5 shows the Markov chains of the first 1000 samples in the HMC and RWMH methods. The HMC exploring can produce estimates approximately the same as the final products within 40 samples and VR of 88%. In contrast, the trace plots of RWMH appear that the method is still seeking a more optimized solution at even 1,000 samples, and then approximately 20,000 samples were spent to improve VR to 88% (Additional file 1: Fig. S2). Although now we compared both the methods in only a single case, that results are important to ensure efficient convergence of HMC, where HMC required less than 1% chains of RWMH’s for model exploration. To use this efficiency, this method should be applied to high-dimensional problems, such as slip distribution estimation or inversion using nonlinear Green’s function, which incurs a high computational cost for the forward calculation.

Fig. 5
figure 5

Trace plots of the first 1000 samples. The chains of nine model parameters (latitude, longitude, depth, strike, dip, rake angle, fault length, fault width, and slip amount) and three calculated parameters (Mw, stress drop, and VR) are shown in each figure. The blue and red solid lines indicate the traces produced by HMC and RWMH, respectively. Note that the black broken lines indicate the mode value of full chains produced by HMC

The motion of exploration appearing in trace plots also denotes an obviously high efficiency. The rake angle or slip amount chains using the RWMH method initially move in the opposite direction of the mode value of whole sampling, but the chains using the HMC method directly select the plausible direction. This shows that the HMC method is superior to the RWMH method because of the transition using gradients of the posterior PDF. Chains of latitude or strike denote that, even if HMC passes by following the mode value, a quick return to the same value is expected. It is an interesting exploration tendency showing the ability of the HMC method to explore a wide range of parameters in a single transition and aim at a minimum posterior immediately. These features cannot be realized by the RWMH method, whose transition is performed by adding only a small amount and whose acceptance ratio is generally low.

Figure 6 shows the marginal PDFs that appear in pairs of parameters, fault length, fault width, and slip amount. Figures using all parameters are also shown in Additional file 1: Figs. S5–S8. Figure 6a (HMC) and Fig. 6b (RWMH) display the marginal PDFs including the burn-in samples, indicating that the HMC method can reach the high probability area effectively with straight and stepping trace. This method only traces the plausible area because of the larger white area which has never been explored and smaller blue area whose samples are low posterior probability rather than the RWMH method. Considering that Fig. 6c (HMC) and Fig. 6d (RWMH) show that the HMC method can generate a similar correlation to RWMH among parameters, and we confirmed HMC can grasp the correlation more efficiently.

Fig. 6
figure 6

Two-dimensional (2D)-marginal posterior PDFs of the HMC and RWMH sampling. Each figure indicates the correlation among parts of the fault parameters (fault length, fault width, and slip amount). The upper figures show the marginal PDFs via the HMC (a) and RWMH (b) including burn-in exploration whose starting point is indicated by the red star. The lowers also show the marginal PDFs via the HMC (c) and RWMH (d) but excluded burn-in samples

However, the HMC or other MCMC methods have the problem of ensuring minimum global exploration. One method to explore more broadly, with certainty, and efficiently in parameter space is the parallel tempering (Dosso et al. 2012; Sambridge 2013; Dettmer et al. 2014; Hallo and Gallovič 2020; Ohno et al. 2021, 2022). In future work, we will develop a method that introduces this expanding method to the HMC and realize a more efficient and stable fault estimation.

Comparison regarding the autocorrelation of Markov chains

The RWMH method conducts the sampling of parameter exploration by slightly perturbating the model parameters. In contrast, HMC transitions from place to place and visits plausible samples if they are far away. Given these sampling characteristics, we analyzed the autocorrelation of Markov chains generated by both methods.

Figure 7 shows the power spectral densities of the Markov chain regarded as a time series and its inclination (“a” value in the figure) using log approximation. In this figure, the results of the RWMH method explicitly denote the inverse proportion to frequency \(f \left[ {/{\text{count}}} \right]\) and show random walk tendency; the inclination is roughly \(f^{ - 2}\). The inclination in the zone over 1000 samples appears to be flat, similar to white noise. However, the results of the HMC method denote an inclination of less than 0.5 and a white noise-like tendency for most parameters. Only the result of the depth parameter denotes an inclination larger than the HMC’s other results, nearly 1.5, although it is slightly smaller than the RWMH method. This high autocorrelation is considered caused by exploration near the domain. Hallo and Gallovič (2020) address similar problems using the unique prior PDF reflected at the parameter domain.

Fig. 7
figure 7

Power spectral density (PSD) of Markov chains of nine model parameters. In each figure, red and blue show the results of RWMH and HMC, respectively. The light color line shows PDF and the heavy color line shows the moving average of PDF (the average of 25 points on the right and left side). The result of the log approximation is visually shown by the straight line and its estimated inclination is inserted on the bottom left of each figure (upper blue: HMC, lower red: RWMH)

For MCMC sampling, a higher autocorrelation induces deterioration of sampling efficiency and complicates the achievement of the proper posterior PDFs by finite-length chains (Geyer 2011). RWMH sampling is often highly correlated between close samples. The thinning method, which discards all but every \(k\)th sampled value, is likely to address this problem (Link and Eaton 2012), but it does not guarantee improved estimation efficiency (Geyer 2011; Link and Eaton 2012). In addition, even if we adopt this method to introduce RWMH, we must abandon several thousand samples for one sample because the PSD analysis shows a high autocorrelation in the zone over 1000 samples. Therefore, it is conceivable that the HMC method can efficiently achieve a lower autocorrelation sampling without wasting samples.

More optimized utilization of the HMC

At the point of the number of samples, compared to the RWMH method, the HMC method can estimate posterior PDFs more efficiently in the Bayesian analysis of coseismic fault estimation. However, the HMC can also be less efficient at the point of the calculation time, especially when applying to a problem where one sample’s computational cost of the forward calculation is low, such as in the problem in this study. Additional file 1: Figure S9 shows the ratio of calculation time in the HMC algorithm. It shows that the backward calculation is more expensive than the forward in this problem, which is the inversion of the nine fault parameters. Therefore, the HMC can be used for problems that require copious amounts of resources for forward computation. We should also pay attention to the gradient calculation. The automatic differentiation we adopted to realize the complicated gradient calculation of Eq. (3) has to execute the forward calculation previously. Additional file 1: Figure S9 shows that around 66% of forwarding is executed for the purpose of automatic differentiation. Thus, refining the gradient calculation method or using the analytic differential must improve the algorithm efficiency drastically.

To reduce the calculation time, a method for optimizing hyperparameter \(\text{e}\), which represents the step size of the leapfrog, should be also developed. For example, Stan (Hoffman and Gelman 2014) employs the dual averaging method developed by Nesterov (2009), a stochastic gradient descent method, which is expected to optimize in the same way as our method.

Besides the hyperparameter, the calculation time of the estimation is based on the number of model parameters. It is difficult for the RWMH method to conduct efficient sampling in a parameter space that comprises many parameters because the space is vast for model exploration (Roberts and Rosenthal 1998; Ringer et al. 2021). However, as reported by Fichtner and Simute (2018), the HMC method can maintain the calculation time by using the differentiation of the posterior PDF which contributes to ensuring the transition to the high posterior probability region. Therefore, the HMC can be superior and play an important role in solving high-dimensional problems, such as slip distribution estimation.


We developed a fault estimation algorithm using the HMC method; this was the first attempt to use the HMC method for fault estimation. The developed method performs efficiently following the adjustment of each parameter scale by changing the variables, well using the log and logit functions. We then applied the developed method to the 2016 Kumamoto earthquake (MJMA 7.3) and assessed its feasibility for the single rectangular fault estimation and its exploration efficiency by comparison with the RWMH method.

We found that the HMC method estimates the informative model that reproduces the observation data and substantially corresponds to the fault model predicted by the RWMH method, although both algorithms perform the estimation independently. This indicates that the HMC method can be used for the single rectangular fault estimation. The length of the Markov chain required for HMC sampling was approximately 2% of that of the RWMH method, and 1% samples were needed to explore the high VR fault models (VR ~ 88%). We also clarified that the HMC method works as a low autocorrelation and a long transition from sample to sample, as described by the PSD analysis of the Markov chains of each parameter. These results are attributed to the theoretical characteristics of the HMC method, which allows for an efficient search following the gradient of the posterior distribution. This reasonable HMC sampling appears to be valid for high-dimensional problems, such as slip distribution estimation or problems using the nonlinear Green’s function, which incurs high computational costs for one sample.

In contrast, the HMC method requires a longer calculation time for one sample than the RWMH method. This indicates that using the RWMH method is more suitable than that of the HMC for the estimation of a single rectangular fault, a low-dimensional problem. Therefore, the choice between the RWMH or HMC methods depends on the cost of forwarding calculation and the number of unknown parameters in the applied problem.

Availability of data and materials

All 30-s GEONET raw data for post-processing can be obtained via the GSI website. Other data used and/or analyzed in this study are available from the corresponding author upon reasonable request.



Hamiltonian Monte Carlo


Random walk Metropolis–Hastings


Probability density function


Markov Chain Monte Carlo


REal-time GEONET analysis system for rapid deformation monitoring


Global Navigation Satellite System


Variance reduction


Power spectral density


Download references


This paper benefited from careful reviews by two anonymous reviewers. This work was supported by the JST FOREST Program (Grant Number: JPMJFR202P, Japan). This study was also supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan, under its “The Second Earthquake and Volcano Hazards Observation and Research” Program (Earthquake and Volcano Hazard Reduction Research) and the Next Generation High-Performance Computing Infrastructures and Applications R&D Program by MEXT. This study was supported by the Japan Society for the Promotion of Science Grant-in-Aid for Scientific Research (KAKENHI; Grant: 21H05001). This work was also supported by the Research Project for Disaster Prevention on the great Earthquakes along the Nankai trough by MEXT.


This work was supported by the JST FOREST Program (Grant Number: JPMJFR202P, Japan). This study was also supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan, under its “The Second Earthquake and Volcano Hazards Observation and Research” Program (Earthquake and Volcano Hazard Reduction Research) and the Next Generation High-Performance Computing Infrastructures and Applications R&D Program by MEXT. This study was supported by the Japan Society for the Promotion of Science Grant-in-Aid for Scientific Research (KAKENHI; Grant: 21H05001). This work was also supported by the Research Project for Disaster Prevention on the great Earthquakes along the Nankai trough by MEXT.

Author information

Authors and Affiliations



TY, KO and YO designed the study, developed the programs, and analyzed the data. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Taisuke Yamada.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Figure S1. Results of the convergence test. Figure S2. Trace plots of entire chains. Figure S3. Example of the expression of posterior PDF for variable changed parameters. Figure S4. Result when the chain length is 10 times the main result. Figure S5. Two dimensional (2D)-marginal posterior PDFs of the HMC sampling, including the Burn-in phase. Figure S6. 2D-marginal posterior PDFs of the RWMH sampling, including the burn-in phase. Figure S7. 2D-marginal posterior PDFs of the HMC sampling excluding the Burn-in phase. Figure S8. 2D-marginal posterior PDFs of the RWMH sampling excluding the burn-in phase. Figure S9. Pie chart of the calculation time HMC spend.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yamada, T., Ohno, K. & Ohta, Y. Comparison between the Hamiltonian Monte Carlo method and the Metropolis–Hastings method for coseismic fault model estimation. Earth Planets Space 74, 86 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: