In the present study, two types of CNNs were tested. First, the test using an ordinary CNN is described. A 32 × 32 image served as the input, and hypocentral parameters including the latitude, longitude, depth, occurrence time, and magnitude, were estimated through regression. The network architecture employed was similar to that in previous studies (LeCun et al. 1999; Onishi and Sugiyama 2017). As the convolutional layer, a set of learnable filters were applied to the input image to extract the characteristic features, while a pooling layer was used to reduce sensitivity associated with the location of the characteristic feature. The fully connected layer conducted the reasoning based on the output from the convolutional and pooling layers. These layers were sequentially connected to produce a CNN capable of associating an input image to a specific hypocenter parameter. TensorFlow was employed as the framework to establish a LeNet-based CNN (LeCun et al. 1999).

The input image data created using the seismic wave propagation at the Earth’s surface comprised 32 × 32 pixels. The convolutional layer map and the input image were utilized to generate an output image using convolutional filters. The dimensions of the convolutional filters used were 3 × 3 × *n* or 7 × 7 × *n*, and can be represented as follows, where both nW and nH = 1 for 3 × 3 × *n* and 3 for7 × 7 × *n*:

$$u_{{i,j,k}} = f\left( {b_{k} + \sum\nolimits_{{i^{\prime} = - {\text{nW}}}}^{{{\text{nW}}}} {\sum\nolimits_{{j^{\prime} = - {\text{nH}}}}^{{{\text{nH}}}} {\sum\nolimits_{{k^{\prime} = 1}}^{{{\text{nC}}}} {Z_{{i + i^{\prime},j + j^{\prime},k^{\prime}}} \cdot W_{{i^{\prime},j^{\prime},k^{\prime},k}} } } } } \right),$$

(1)

where \(u_{{i,j,k}}\) and \(z_{{i,j,k}}\) are the (*i*, *j*)-th pixel of the *k*-th channel in the output and input images for the layer, respectively. The weights \(w_{{i^{\prime},j^{\prime},k^{\prime},k}}\) constitute a filter that was applied to the *k′*-th channel of the input image, \(b_{k}\) is the bias for the *k*-th channel, and *n* is the number of channels in the input layer. In this study, as only the vertical seismogram component was used, the number of channels (*n*) is 1. Considering \(f\left( x \right)\), a ReLU function which is commonly used for regression analysis in neural networks, this can be expressed as follows:

$$f\left( x \right) = \left\{ {\begin{array}{*{20}c} {0,} & {x \le 0} \\ {x,} & {x > 0} \\ \end{array} } \right..$$

(2)

Zero padding was not utilized on the input image, and the output image size was reduced by four pixels in two directions. The pooling layer was processed using max pooling with a 2 × 2 filter and a stride of 2, thereby downsampling the image size from *h* × *h* × *n* to *h*/2 × *h*/2 × *n* through the following expression:

$$u_{{i,j,k}} = \max \left\{ {z_{{2j + j^{\prime},2i + i^{\prime},k}} |i^{\prime},j^{\prime} = (0,1)} \right\}.$$

(3)

In the final connected layer, to avoid overfitting, a 50% dropout was applied. Dropout is a procedure for avoiding overfitting by creating a pattern without updating the neural network weight of a designated layer. A modified stochastic gradient descent (Kingma and Ba 2015) served as the optimizer while the learning proceeded using a batch size of 50 and epoch of 250. The calculated root mean square error (RMSE) for the learning data was then employed as a loss function.

Batch normalization was applied (Ioffe and Szegedy 2015) before calculating the activation function. This process renders characteristic features uncorrelated by converting the mean and standard deviation of any selected batch to 0 and 1, respectively.

The mean and standard deviation for a batch \(\mathscr{B} = \left\{ {\mathbf{u}_{1} \cdots \mathbf{u}_{m} } \right\}\) were calculated as follows:

$$\mu _{\mathscr{B}} = \frac{1}{m} \sum \limits_{{i = 1}}^{m} \mathbf{u}_{i} ,$$

$$\sigma _{\mathscr{B}}^{2} = \frac{1}{m} \sum \limits_{{i = 1}}^{m} (\mathbf{u}_{i} - \mu _{\mathscr{B}} )^{2} ,$$

(4)

The normalized data \({\mathbf{y}}_{i}\) can be obtained using the learned parameters \(\gamma \;{\text{and}}\;\beta\) from the following equations:

$$\begin{aligned} \widehat{{\mathbf{x}}}_{i} & = \frac{{{\mathbf{x}}_{i} - \mu }}{{\sqrt {\sigma _{\mathscr{B}}^{2} + \varepsilon } }}, \\ {\mathbf{y}}_{i} & = \gamma \widehat{{\mathbf{x}}}_{i} + \beta . \\ \end{aligned}$$

(5)

where *ε* is a constant error term added for numerical stability.

Bayesian optimization (Shahriari et al. 2015) was used to search for optimized hyperparameters for the proposed learning model to avoid manual tuning via trial and error. A grid search was unsuitable because of the high number of combinations for the search target. The Bayesian optimization is a technique for efficiently searching hyperparameters. These parameters are estimated at high resolution using approximate loss distribution functions, and assuming that the loss distribution of the hyperparameters learning model is Gaussian. Following the examination of each case, a highly accurate model was produced using few parameters. The loss function for the Bayesian optimization was the RMSE estimated for the validation data. The hyperparameter search involved the following conditions: the convolution layers varied between 1 and 5; the fully connected layers ranged between 0 and 3; the convolution filter size was between 3 and 9; the fully connected nodes varied between 50 and 1000; max pooling was applied to each convolution layer; the dropout was between 0 and 50%; and batch normalization was applied on ReLU.