Stepwise partitioning algorithm
Conventionally, the slip distribution model is estimated with a hyperparameter governing the intensity of slip smoothing constraints determined with an index, such as ABIC (Akaike's Bayesian Information Criterion, Akaike 1980; e.g., Fukahata 2009). The error matrix obtained in this case is essential for interpreting the uncertainty of slip amount at each subfault. These methods are highly relevant in terms of obtaining a single most likely solution and its error. In other words, with these approaches, it is difficult to obtain multiple models that can explain the observed data, and the discussion of the errors remains a challenge. It is important to present several possible models with errors, while satisfying the observed data, to treat the obtained results as disaster information. Recently, many studies have adopted the Bayesian approach to geodetic studies because of its advantages in quantifying uncertainty (Ito et al. 2012; Dettmer et al. 2014; Minson et al. 2014a2014b; Jiang and Simons 2016; Ohno and Ohta 2018; Ohno et al. 2021). Owing to its sampling process, MCMC can generate a large number of model samples and quantify the uncertainty as probability density function (PDF), which is the ensemble of those samples. In this study, we regarded the diversity of samples as individual slip distribution models that can explain the observation data to evaluate the uncertainties of tsunami inundation caused by uncertainties in the estimation of coseismic slip distribution. Furthermore, to get the diversity of samples efficiently in realtime, essentially similar to smoothing constraint, but we tried a different approach to regularization.
We adopted simulated data based on the assumed Mw 8.75 1707 Hoei earthquake (Furumura et al. 2011; Todoriki et al. 2013; Inoue et al. 2016; Kawamoto et al. 2017). The assumed slip reached from Suruga Bay to the westernmost part of Shikoku (Fig. 1). Let \({\varvec{\theta}}\) be a model parameter vector that contains slip amounts on subfaults along the plate boundary in the Nankai Trough, and \({\varvec{d}}\) be a permanent displacement data vector with three components (horizontal and vertical components) based on realtime GNSS observations. We used the same method as Ohno et al. (2021) for MCMC sampling. Bayesian estimation of the unknown model parameters conditional on the observations is based on Bayes’ theorem:
$$\begin{array}{*{20}c} {p\left( {{\varvec{\theta}}{}{\varvec{d}}} \right) = \frac{{{\varvec{p}}\left( {{\varvec{d}}{}{\varvec{\theta}}} \right) {\varvec{p}}\left( {\varvec{\theta}} \right)}}{{{\varvec{p}}\left( {\varvec{d}} \right)}}} \\ \end{array}$$
(1)
where \({\varvec{p}}\left({\varvec{d}}\right)\) is a PDF of the observations, \({\varvec{p}}\left({\varvec{\theta}}\right)\) is a prior PDF of the model parameters, \({\varvec{p}}\left({\varvec{d}}{\varvec{\theta}}\right)\) is a likelihood function, and \({\varvec{p}}\left({\varvec{\theta}}{\varvec{d}}\right)\) is a posterior PDF of the model parameters. Note that \({\varvec{p}}\left({\varvec{d}}\right)\) is constant, because observations are fixed values and, hence, the posterior PDF is proportional to the product of the prior PDF and likelihood function. The likelihood function measures the degree of fit between the observed data \({\varvec{d}},\) and calculated data \(\widehat{{\varvec{d}}}\left({\varvec{\theta}}\right)\). The residuals are given by \({\varvec{r}}\left({\varvec{\theta}}\right)=\widehat{{\varvec{d}}}\left({\varvec{\theta}}\right){\varvec{d}}\). When the number of observation stations is \(N\), the dimension of \({\varvec{d}} \mathrm{is} 3N\). Assuming that the estimation error follows a normal distribution function, the likelihood function is defined as follows:
$$\begin{array}{*{20}c} {p\left( {{\varvec{d}}{}{\varvec{\theta}}} \right) = \mathop \prod \limits_{i = E, N, U} \mathop \prod \limits_{j = 1}^{N} \frac{1}{{\sqrt {2\pi \sigma_{ij}^{2} } }}{\text{exp}}\left[ {  \frac{1}{{2\sigma_{ij}^{2} }}r_{ij}^{2} } \right]} \\ \end{array}$$
(2)
where \({\sigma }_{ij} (i=E (\mathrm{East}), N (\mathrm{North}), U (\mathrm{UpDown}))\) (which is identical at every GNSS station) is an eventdependent hyper parameter that includes modeling and observation errors. This study assumed the hyper parameters for each horizontal and vertical components as follows:
$$\begin{array}{*{20}c} \begin{aligned} \sigma_{Ej} & = \sigma_{Nj} = \max \left( {0.1\sqrt {d_{Ej}^{2} + d_{Nj}^{2} } , \sqrt {e_{Ej}^{2} + e_{Nj}^{2} } } \right), \\ \sigma_{Uj} & = \max \left( {0.1d_{Uj} , e_{Uj} } \right) \\ \end{aligned} \\ \end{array}$$
(3)
where \({e}_{ij}\) is the steady noise of realtime GNSS observations. We used \({0.1d}_{ij}\) as the proxy for the modeling and observation errors and assumed \({\sigma }_{ij}\) on each component at each station is whichever of those two error sources is larger. By assuming that \({\sigma }_{ij}\) depends on the displacement while keeping the realtime GNSS steady noise as a lower bound, we can prevent the model from overfitting the data. We used the Metropolis–Hasting method as the MCMC sampler (hereafter, M–H method; Metropolis et al. 1953; Hastings 1970). We adopted parallel tempering with eight parallel chains to improve the search efficiency (Geyer 1991; Jasra et al. 2007). For simplicity of the realtime analysis, we fixed rake angles to 90°; as such, \({\varvec{\theta}}\) contains only slip amounts, the element number of which is the same as that of the subfaults. In addition, we assumed the slip amounts to be nonnegative, which is equivalent to assuming that the prior PDF formed a uniform distribution \(\mathrm{U}\left(0, \infty \right)\) for \({\varvec{\theta}}\). Incorporating nonnegative slip constraint into the inversion as a search range setting is one of the advantages of the MCMC sampling process over conventional methods. There were 2951 subfaults on the plate boundary. We adopted the analytical solution of Okada (1992) as the Green’s function for rectangular subfaults, where the fault width are approximately 8 km.
In the M–H method, we propose new transition candidates by applying random perturbations \(\Delta{\varvec{\theta}}\) with uniform distribution between \({\Delta{\varvec{\theta}}}_{\mathrm{max}}\) and + \({\Delta{\varvec{\theta}}}_{\mathrm{max}}\). However, applying MCMC to such a problem with many unknown parameters takes a long time to converge. Therefore, in addition to parallel tempering, we improved the search efficiency using a multistage approach with stepwise partitioning of perturbation groups. Figure 2 shows an overview of the "stepwise partitioning algorithm". The Markov chain was divided into four stages (number of steps in each chain): stage 1 (\(3\times {10}^{6}\) steps); stage 2 (\(3\times {10}^{6}\) steps); stage 3 (\(3\times {10}^{6}\) steps); and stage 4 (\(3\times {10}^{6}\) steps). In each stage, we set spatial perturbation groups in advance, because grouping \(\Delta{\varvec{\theta}}\) around subfaults promoted convergence. Thus, we applied the same perturbation \(\Delta{\varvec{\theta}}\) to the samecolored subfaults, as shown in Fig. 2; we refer to these predetermined areas with the same \(\Delta{\varvec{\theta}}\) as "perturbation groups". The number of perturbation groups are 80, 185, 388, and 1451 for stage 1, 2, 3, and 4, respectively. Increasing the number of perturbation groups as the stage progresses allows us to estimate the roughtodetailed features. The utilization of such a grouping is equivalent to reducing the number of unknown parameters. It should be noted that we did not change the shapes of the 2951 rectangular subfaults; that is, only the perturbation groups were changed. Thus, the analytical displacements in each step were calculated for 2951 background subfaults using a constant each subfault’s Green's function.
As shown in Fig. 2, in the transition between stages, we used the median of the posterior PDFs of the previous stage as the initial value for the next stage and used the 95% confidence interval (CI) as the amount of perturbation \({\Delta{\varvec{\theta}}}_{\mathrm{max}}\) for the next stage. By inheriting the uncertainty as the transition amount, we searched for a broader slip amount with more uncertainty in a limited number of samples. Therefore, in the initial 10% of samples for each stage, we adjusted \({\Delta{\varvec{\theta}}}_{\mathrm{max}}\) by equal multiplication so that the acceptance rate would be 20–40% (e.g., Roberts and Rosenthal 2001), and the samples under adjustment were discarded during PDF generation as burnin. In addition to the model parameters, we also calculated Variance Reduction (VR) for evaluation of the estimated model:
$$\begin{array}{*{20}c} {VR = 100\left( {1  {\varvec{r}}^{{\text{T}}} {\varvec{r}} / {\varvec{d}}^{{\text{T}}} {\varvec{d}}} \right).} \\ \end{array}$$
(4)
In the later stage, the partitioning may be excessive for the sensitivity of the data and earthquake magnitude. We calculated the AIC (Akaike's Information Criterion, Akaike 1973) using the following equation to determine the optimum stage:
$$\begin{array}{*{20}c} {AIC =  2\log \left( L \right) + 2M,} \\ \end{array}$$
(5)
where \(L\) indicates the maximum likelihood, and \(M\) indicates the number of unknown parameters, which is the number of perturbation groups in each stage.
The approach presented above is similar to the transdimensional approach (e.g., Dettmer et al. 2014), but our algorithm was designed to force change in one direction to simplify the problem for realtime purposes. Moreover, our procedure is predefined by a set of perturbation groups with different spatial resolutions. Then, an optimum stage is selected based on AIC corresponding to a regularization process similar to smoothing constraints with discrete search (i.e., 4 stages) of regularization parameters. We designed this approach to automatically generate various models with the appropriate spatial resolution for the magnitude of the target earthquake without smoothing constraints between adjacent subfaults. The subsequent tsunami calculation process ("Realtime tsunami inundation risk map" section) utilizes samples from one selected stage rather than multiple stages. In each stage with the constant spatial resolution, we can benefit from sampling without smoothing constraints as a diversity of multiple samples.
The computation time of this approach is within 30 min for each stage (\(3\times {10}^{6}\) steps in each chain) using an SXAurora TSUBASA Type20B processor with eight parallel chains in the case of 642 GNSS stations. In this study, we decided the number of samples with the highest priority to obtain a sufficient number to evaluate the differences between the stages. For realtime utilization, the convergence judge (e.g., Gelman 1996) should be introduced and must move to the next stage automatically. According to the estimated results of this paper, the number of chains in each stage seems to converge at about 10% of the number of advance settings. Based on this, if we could design convergence judges and AIC staging decisions properly, it is possible to complete the calculation up to stage 2 in approximately 6 min, which is a computation time that may be used in real time in the future.
Realtime tsunami inundation risk map
In general, because the tsunami inundation calculation is a nonlinear problem, it is difficult to evaluate the uncertainty of the tsunami inundation based on a single coseismic fault slip model and variation index, such as standard deviation of slip amounts. Therefore, to evaluate this uncertainty, we need to prepare multiple coseismic fault slip models that explain the data well. However, it is not realistic to calculate tsunami inundation for all MCMC samples, even if calculation speed were to improve in the future. Therefore, our algorithm aims to efficiently classify MCMC samples and integrate multiple tsunami inundation scenarios to probabilistically present tsunami inundation risk. Here, assuming that multiple calculations of \({10}^{2}\) orders will be possible within a time of several minutes in the future, we added "realtime" to the flow name in this study.
Figure 3 shows a flow chart of the realtime tsunami inundation risk map process.

i.
Obtain sufficient MCMC samples to evaluate the uncertainty of the slip distribution model ("Stepwise partitioning algorithm" section). Samples from which VR was extracted are equivalent to a representative value based on the posterior PDF.

ii.
Classify these samples into small clusters (K) using the kmeans approach using \({\varvec{\theta}}\) as the feature value.

iii.
Generate K representative slip distributions (using the median value of each subfault), and use them as inputs to calculate individual K tsunami inundations.

iv.
Count the number of inundations on the map.
The resulting map shows the possible tsunami arrival rate in each computation grid based on the uncertainty of fault slip estimated from realtime GNSS data.
The main feature of our algorithm is the clustering of MCMC samples, whereas generally utilize posterior PDFs. As discussed, the nonlinearity of tsunami inundation calculation is a problem, so we need to obtain sufficient fault models to evaluate the tsunami risk. In contrast, samples using the M–H method are not necessarily independent, because the M–H method proposes samples by perturbing the previous one. We adopted the kmeans method (Steinhaus, 1956) to classify the samples into a predetermined number, with the number of clusters K determined by considering the computational cost of tsunami inundation, the availability of denominators for probability display, and the maintenance of the diversity of MCMC samples. The feature value used for clustering is the slip vector \({\varvec{\theta}}\), because the clustering phase does not depend on the specific target region of the tsunami calculation. Samples with the same VR were used to ensure fairness in terms of reproducibility of the observed data.
Here, the target of the tsunami inundation calculation was a 1563 km^{2} area between Tosa City and Aki City in Kochi Prefecture (red box in Additional file 1: Figure S1). Additional file 1: Table S1 lists the conditions of the tsunami inundation calculation. The grid size in the target area was 30 × 30 m, and the number of grids was 1,736,388 (2,082 grids along the east–west direction and 834 grids along the north–south direction). Tsunami propagation and inundation were calculated using the TUNAMI code (Tohoku University's numerical analysis model for investigating tsunami), which numerically solves the nonlinear shallow water equation using the staggered leapflog finite difference method (Imamura 1995). The tsunami simulation was carried out from the time of the earthquake to 6 h later. The fracture propagation of the earthquake was not considered, and the slip was assumed to be instantaneous. We calculated seafloor displacement using the analytical result of Okada (1992), and the uplift of the seawater due to the horizontal movement of the seafloor was considered (Tanioka and Satake 1996). Coastal structures, such as breakwaters, were input as line data every 30 m and were not breached by earthquakes or tsunamis; buildings were not considered. We used the maximum inundation depth up to 6 h after each calculation grid as the inundated depth.