A convolutional neural network-based classification of local earthquakes and tectonic tremors in Sanriku-oki, Japan, using S-net data

Low-frequency tremors have been widely detected in many tectonic zones, and are often located adjacent to megathrust zones, indicating that their spatiotemporal evolution provides important insights into megathrust events. The envelope correlation method (ECM) is commonly used to detect tremors. However, the ECM also detects regular earthquakes, which requires the separation of these two signals after the initial detection. In addition, signals of tremors are weak, so classifying tremors from noises is also an essential problem. We develop a convolutional neural network (CNN)-based method using a single S-net station located off Sanriku region, Northeast Japan, to classify local earthquakes, tremors, and noise. Along the Japan Trench, especially in a region focused in this study, local earthquakes and tremors occurred in coexistence within a small region, so detection, location, and discrimination of these events are the key to understand the relationship between slow and regular earthquakes. The spectrograms of the three-component velocity waveforms that were recorded during 16 August 2016 to 14 August 2018 are used as the training and test datasets for the CNN. The CNN successfully classified 100%, 96%, and 98% of the earthquakes, tremors, and noise, respectively. We also showed a successful application of our method to continuous waveform data including a tremor to explore the feasibility of the proposed method in classifying tremors and noise in continuous streaming data. The output probabilities for the true classifications decrease with increasing epicentral distance and/or decreasing event magnitude. This highlights the need to train the CNN using tremors proximal to the seismic stations for detecting tremors using multiple stations.


Introduction
Slow earthquakes, which are characterized by a longer duration than regular earthquakes of the same magnitude, have been widely detected in many tectonic zones (e.g., Obara and Kato 2016). There have been increasing observational reports that slow earthquake activity has sometimes preceded megathrust earthquakes (Kato et al. 2012;Ito et al. 2013;Graham et al. 2014;Ruiz et al. 2014;Radiguet et al. 2016;Socquet et al. 2017;Voss et al. 2018). This activity has been detected adjacent to the coseismic source regions of some megathrusts, suggesting that slow earthquakes may potentially provide the necessary stress loading to trigger these megathrust ruptures. Therefore, the detection of slow earthquakes is one of the key criteria for advancing our understanding of tectonic processes in subduction zones.
Tectonic tremors are slow earthquakes with dominant frequencies of ~ 2-10 Hz. Tectonic tremors are often detected via the envelope correlation method (ECM) (e.g., Obara 2002) using inland and/or ocean-bottom seismometers. The ECM utilizes the similarity of the envelopes of the observed waveforms among different stations because it is usually difficult to identify clear Open Access *Correspondence: hidenobutkhs@gmail.com 1 Graduate School of Science, Tohoku University, 6-3, Aramaki-aza-aoba, Aoba-ku, Sendai 980-8578, Japan Full list of author information is available at the end of the article P-and S-wave onsets in the tremor signal. While the ECM is a powerful tool for tremor detection, it is also effective in detecting regular earthquake signals. Therefore, a visual inspection of the seismic waveforms is necessary to identify the tremor signals, which are then used to create tremor catalogs.
A machine learning approach is one way to reduce the manual cost of waveform inspections, especially for large seismic data volumes (e.g., Kong et al. 2019). Nakano et al. (2019) developed a supervised convolutional neural network (CNN)-based method to classify local earthquakes, tremors, and noise using the spectrograms obtained by the DONET ocean seismometers deployed in the Nankai subduction zone, Southwest Japan, as input images for the CNN, and achieved an event classification accuracy of 99.5%. The known effectiveness of CNNs in extracting features from input images (e.g., Krizhevsky et al. 2012;Szegedy et al. 2015;Goodfellow et al. 2016) allowed Nakano et al. (2019) to tailor a CNN for tremor detection because tectonic tremors have characteristic dominant frequencies of 2-10 Hz.
The S-net seismic network, a seafloor observation network that covers a wide area of the Japan Trench subduction zone, has recently been established in Northeast Japan ( Fig. 1a) (NIED 2019). Short-period velocity seismometers have been deployed at intervals of a few tens of kilometers in this network. Nishikawa et al. (2019) and Tanaka et al. (2019) have investigated local tremor activity using the S-net observations. While these previous studies detected tremors based on the ECM, we apply a CNN-based approach that was developed by Nakano et al. (2019) to the S-net velocity records. However, the S-net seismometers have a 15-Hz characteristic period, and are therefore less sensitive to the dominant frequencies of tremors (2-10 Hz) compared with the DONET seismometers (Nakano et al. 2019). Therefore, we modify the structures and retrain the parameters in the CNN developed by Nakano et al. (2019) for application to the S-net data. We attempt to classify the known local earthquakes, tremors, and noise via the CNN using a single station to investigate how our CNN-based method works, with the ultimate goal of eventually developing a comprehensive tremor-detection approach. We note that it is important to discriminate local earthquakes and tremors along the Japan Trench subduction zone, due to the coexistence of regular and slow earthquakes within a small region in spite of their different source process (Ide 2008). In contrast to regular earthquake signals, tremor signals are weak, and tremors could be easily mislabeled as noise in the condition of far-source station. Therefore, using a near-source station is a key to detect and classify tremors accurately.

Data
This study utilizes a CNN for classifying the spectrograms of local earthquakes, tectonic tremors, and noise, which are hereafter labeled EQ, T, and N, respectively, that were observed by S-net. We used the three-component velocity waveforms that were recorded at a single station, N.S4N21, at a sampling frequency of 100 Hz Fig. 1 a Map of the study area. The black crosses indicate S-net stations. The red and blue symbols are the tremor and earthquake epicenters determined by Nishikawa et al. (2019) and JMA, respectively. b Example local earthquake and tremor observations at station N.S4N21. The locations of the example events, Eq #1 and Tremor #1, are shown in a. The upper panel shows a 180-s seismogram record of each example event, with its corresponding spectrogram (normalized from 0 to 1) shown in the lower panel during the period from 16 August 2016 to 14 August 2018 . This station was selected based on its location above the plate boundary where tremors frequently occur (Fig. 1a). Training datasets with labeled EQ, T, and N events are necessary since CNNs take a supervised approach. We therefore used the Japan Meteorological Agency (JMA) unified catalog and the tremor catalog created by Nishikawa et al. (2019) for EQ and T event identification, respectively. We selected the events whose epicenters were close to station N.S4N21, which include those located within the area indicated by the rectangle in Fig. 1a. We selected reference time of each image based on the second and minute of the origin time for EQ and T, respectively. Here we used the minute for tremor instead of the second to extract whole excited parts of tremor signals because origin time of cataloged tremors do not necessarily start before excitation of signals unlike regular earthquakes. The training dataset of N was constructed by visually inspecting seismograms every 10 min for 1 day in each month (i.e., the 20th day of each month in this study) when seismograms do not contain transient signal of tremor or regular earthquakes.
For creating all images of EQ, T, and N, we selected 117.76-s time windows that started from each reference time defined above. Either the spectrograms or the power spectrum density (PSD) were then calculated for the waveforms using twenty 20.48-s moving time windows in the 2-10 Hz frequency range with a lag of 5.12 s. We corrected PSD based on the amplitude characteristic in the frequency domain by using a damping constant of 0.707 and the 15 Hz natural frequency. Each spectrogram was normalized to ensure that the input image pixels were in the 0-1 range. Each spectrogram was composed of 20 × 165 pixels (time domain × frequency domain). Figure 1b shows examples of earthquake and tremor waveforms with their corresponding spectrograms. The average of duration of T events in the catalog is about 44 s as shown by Nishikawa et al. (2019), and the window contains whole excited parts of tremor signals. We divided the data period into two datasets for the CNN analysis, with the 16 August 2016-2 December 2017 waveforms used for training and the 3 December 2017-14 August 2018 waveforms used for validation. Because images with low signal-to-noise ratios potentially cause mis-labeling in the training dataset, we selected qualified images in each category (EQ, T, and N) from the training dataset as follows. We first calculated the difference between the maximum and minimum log 10 (PSD) values for each spectrogram. We then selected the images that possessed higher absolute values than the third quartile values for the local earthquakes and tremors, and images with lower absolute values than the first quartile values for noise. Hereafter, we call these events selected events, and events that do not satisfy the abovementioned criteria non-selected events. We confirmed that without this selection step, our CNN resulted in a poor classification accuracy. After this data selection step (Table 1), we obtained 210, 531, and 468 training events that were labeled EQ, T, and N, respectively. The numbers of labeled EQ, T, and N events are 91, 208, and 118, respectively, for the validation dataset (Table 1). Each event is composed of three-component PSD images. Finally, we normalized log 10 (PSD) for each event image over the 0-1 range (see Fig. 1b for an example) to extract the features in the frequency domain.

Methods
CNNs (Fukushima 1980;LeCun et al. 1989) are a de facto standard for extracting non-linear features by learning a large amount of digital filters in combination with fully connected neural networks. CNNs employ these learnable filters to provide a function that maps an input image into an output vector. The inputs for our CNN are two-dimensional (2D) images of the three-component spectrograms, and the output is a three-dimensional vector containing the EQ, T, and N probabilities for the input data.
CNNs consist of a series of layers that sequentially process the input data as follows (Fig. 2): (a) convolve the input images using a set of learnable filters; (b) downsample the convolution outputs; and (c) apply a non-linear transformation known as the activation function to the downsampled outputs. The first step (a) is referred to as the convolution layer. Let {w t : t = 1, . . . , T } be an input waveform consisting of T samples, and let {h i : i = 1, . . . , F } be a learnable filter of length F. The convolution with this filter can be represented as follows: Here, each filter {h i : i = 1, . . . , F } is optimized during the training step (i.e., a learnable filter), and specific

Table 1 Matrix for the training and validation data of EQ-T-N events
We used 210, 531, and 468 images of EQ, T, and N events, respectively. We also used 91, 208, and 118 validation images with actual (true) EQ, T, and N labels, respectively Actual label features are identified in the original input images. The second step (b) is known as the pooling layer. The pooling layer drastically reduces the dimensionality (i.e., the number of parameters and computations in the network) to avoid overfitting due to excess parameters. This step also introduces both a local translation invariance and robustness to local perturbations (for details, see section 9.3 in Goodfellow et al. (2016) and references therein). We adopt max pooling, which is a pooling operation that calculates the maximum value for each point of each waveform that is convolved with a learnable filter. The max pooling operation with size P , on a waveform {w t : t = 1, . . . , T } returns: where we use zero padding ( w T +i = 0 ) for i = 1, . . . , P to ensure that the output dimension of the pooling operation is the same as its input dimension. Max pooling highlights the sharp contrasts in the convolved waveform. The third step (c) is the activation layer, which enables the network to capture complex features by making the output non-linear. We adopt a rectified linear unit (ReLU; x → max{x, 0} ) (Fukushima 1980) as the activation function, which is commonly used in CNN-based seismological research (e.g., Perol et al. 2018;Ross et al. 2018).
We specified the CNN hyperparameters in this study as follows. Twenty-five learnable filters were employed in the first and second convolution layers, with each possessing filter lengths of 10.24 s and 0.29 Hz in the time and frequency domains, respectively. These parameters were chosen through the performance comparison with the other hyperparameter values: we tested the number of learnable filters from 5 to 25, the length of filters in time domain from 10.24 s to 51.2 s, and the length of filters in frequency domain from 0.1 to 0.49 Hz. The max pooling lengths after the first and second convolution (2) max w t+m : m = 0, . . . , P − 1 : t = 1, . . . , T , layers were 25.6 s in the time domain. Max pooling was not conducted in the frequency domain following a previous study (Nakano et al. 2019). The number of units in the fully connected layer was 10. The batch size during the training step was 18. Training-learnable filters were applied by minimizing the cross-entropy loss via stochastic gradient descent optimization with momentum; a learning rate of 0.005 and momentum of 0.9 were employed during the training step. Cross-entropy loss is a function that maps a pair of vectors, the true label vector (y 2 , y 1 , y 0 ) (where y 2 = 1 for EQ and 0 otherwise; y 1 = 1 for T and 0 otherwise; and y 0 = 1 for N and 0 otherwise) and output probability vector (p 2 , p 1 , p 0 ) (with earthquake probability p 2 , tremor probability p 1 , and noise probability p 0 ) into −y 2 logp 2 − y 1 logp 1 − y 0 logp 0 . Stochastic gradient descent optimization (Robbins and Monro 1951) with momentum (Qian 1999) uses a linear combination of the gradient multiplied by the learning rate and the previous update multiplied by the momentum as the next update. An l 2 penalty with a regularization parameter of 1.0 was added to the cross-entropy loss (Ng 2004).

Results
We begin the analysis by applying our CNN-based method to the validation dataset. Table 1 provides the confusion matrix for the EQ-T-N classification. We determine the predicted labels for the input data via the largest output probabilities, whereby our method labels the input data based on the region in Fig. 3a, where the output probability of a given event is located. Our method successfully identified all of the EQ events (100%), and almost all of the T and N events (96% and 98%, respectively). Ternary plots of the output probabilities for each event type in the validation dataset are shown in Fig. 3bd, with the output probabilities for each event type being concentrated mainly in the corner corresponding to the actual label of the event. Note that if we do not use the Ternary plots for the EQ-T-N classification. Each cross mark represents the output probability for a given event. a Ternary plot highlighting the EQ-T-N classification regions. The blue, red, and gray regions correspond to the predicted EQ, T, and N labels, respectively. b Ternary plot for events with actual EQ labels. c Ternary plot for events with actual T labels. d Ternary plot for events with actual N labels. Events #1-#10 denote events for which the predicted labels are different from their actual labels (Table 1 and Additional file 1: Fig. S1). eX-component spectrograms and seismograms of the events with different actual and predicted labels training and validation datasets selected based on signalto-noise ratio, probabilities for each event type in the validation dataset drop to 98.6%, 80.9%, and 94.2% for EQ, T, N, respectively, which suggests selection procedure is essential to robust classification. We then confirm that all of the misclassified events were truly misclassified with a different label (Events #1-#10). This misclassification can be explained by the spectrogram features of these events, with the X spectrogram components of Events #1, #8, and #9 shown in Fig. 3e. The spectrograms of the other misclassified events are summarized in Additional file 1: Fig. S1. Our CNN labeled Event #1 as EQ event, which is labeled as a T event by Nishikawa et al. (2019). A comparison of the Event #1 spectrogram with the typical EQ and T spectrograms in Fig. 1b highlights that the Event #1 spectrogram has a peak in the relatively high-frequency range and possesses a similar spectrogram signal to that of an EQ event. The Event #1 waveforms possess a P-wave onset at station N.S4N21 and S-wave onsets at neighboring stations, thereby implying that Event #1 is an EQ event (Additional file 1: Fig. S2a). Our CNN labeled Event #8 as N, whereas the event #8 was detected as T event (Nishikawa et al. 2019). As shown in Fig. 3e, the Event #8 spectrogram does not contain signals with characteristic dominant frequencies in the 2-10 Hz range, but rather has a spectral peak at around 2-3 Hz. A possible reason for this misclassification is a large geometrical spreading due to the long epicentral distance, about 60 km. The tremor waveforms with gradual rise time were observed at the nearest station, N.S4N24, from the source (Additional file 1: Fig. S2b). Our CNN labeled Event #9 as a T event, whereas we labeled the event as N by visual inspection. The Event #9 spectrogram displays a signal with dominant frequencies in the 2-10 Hz range in the analyzed time window, which caused our CNN to give this event a T classification.
Next, we investigated the relationships between the probabilities of the CNN-based classification on the true event labels and the epicentral distances, and the probabilities and M w . Figure 4 shows the M w -epicentral distance plot for the local earthquakes and tremors, where their output probabilities are color-coded. The moment magnitudes of EQ are within the range of 0.7 to 6.4, while those of T are within the range of − 1.33 to 0.35. Remark that as denoted in Fig. 2, our CNN-based classification yields the probabilities of (EQ, T, N) for each event, so, for each event, we picked a probability of the true event label from its output probabilities to draw Fig. 4. The presented images were selected via the procedure outlined in the "Data" section, and yielded high (> 0.5) output probabilities of the true event labels, as confirmed in the "Results" section. However, the number of local earthquakes with low probabilities (< 0.5) appears to increase in the non-selected dataset for events with larger epicentral distances and smaller magnitudes (blue histogram in Fig. 4a). The number of tremors with low probabilities (< 0.5) also appears to increase in the nonselected dataset with larger epicentral distances (blue histogram in Fig. 4b). We do not investigate the relationship between M w of tremors and the CNN probabilities of them in detail because the omission of the amplification factor in Nishikawa et al. (2019) from the tremor M w estimations may have resulted in large uncertainties.
We used the signal-to-noise ratio as a threshold for selecting the training data. The training dataset includes many events with shorter epicentral distances since these events generally possess higher signal-to-noise ratios. Therefore, there was a higher probability of classifying the events in the validation dataset that possessed shorter epicentral distances into their true event labels (Fig. 4).

Discussion
In this section, we discuss three topics: the application to continuous data, the impact of noise on classification results, and the evaluation of our methods compared with other classifier methods.
We first discuss the application of our method to continuous data. We applied our method to 5-min continuous seismograms containing a tremor listed in Nishikawa et al. (2019). We successively created images with 5.12 s lag-time, and applied our trained CNN to each image to obtain the probabilities for EQ, T, and N. Figure 5 shows continuous spectrograms and probability for EQ, T, and N, respectively. From the beginning of observation at 13:48:00 (JST) to 13:48:50, we cannot see any clear signal in each image with about 2-min time-window. The probability of N is dominant in this period. At the beginning of the increase of probability of T around 13:48:50, the corresponding image begins to contain the excitation of tremor below 5 Hz. The high probability of T (> 0.9) continues for 1.5 min when the images include tremor signals. At the end of the high probability of T around 13:50:30, the corresponding image contains the tremor signal at the edge of the time-window. Tremor catalog in Nishikawa et al. (2019) determined origin time of tremor (13:50:37.8) based on the S-wave arrival time as maximum stacked envelopes after correcting the sourceto-station travel times. This origin time is included in the period when the probability of T is high. These results show that the CNN-based methods successfully identified tremors and noise just around observed tremor, suggesting the possibility of applying CNN-based methods in continuous dataset in the future (e.g., simultaneous occurrences, noise we cannot explain by oceanic or cultural noise). Takahashi et al. Earth, Planets and Space (2021) . 4 a M w -epicentral distance plot for the local earthquakes, which are color-coded by their output probabilities, in the selected (training) and non-selected datasets. The black line shows the training dataset for the local earthquakes. b M w -epicentral distance plot for the tectonic tremors, which are color-coded by their output probabilities, in the selected and non-selected datasets. The black, red, and blue histograms indicate the training dataset, high-probability (≥ 0.5) events, and low-probability (< 0.5) events, respectively We next discuss impact of noise on classification results. We used input noise dataset for about 1 year, and as a result we successfully identified signals from noise with very high probabilities. However, as suggested by Takagi et al. (2020), the noise level shows changes of PSDs on the order of 2 above 1 Hz for 3 years along the Japan Trench at each S-net station. To be a more reliable classification, we should consider various noise patterns during observation periods. To do so, we need to appropriately select training and validation noise dataset to create homogeneous dataset by investigating temporal changes of noise characteristics in detail.
We finally discuss the evaluation of our methods compared with other classifier methods. Single station detection and classification methods of tectonic tremors have been previously proposed. Brudzinski and Allen (2007), Kao et al. (2007), and Sit et al. (2012) mainly focused on the dominant frequency of tremor (1-5 Hz) and successfully detected tremors in Cascadia using seismic records observed at a single station. These methods were only applied to hour-long tremor episodes. However, along the Japan Trench, tremors are detected with the duration of dozens of seconds, and sometimes coexist with regular earthquakes. Therefore, we should focus on not only the dominant frequency band of tremor, but also time-length used by classification. Our CNN-based method can classify seismograms at least about 120-s segments, suggesting the high temporal resolution of classification. Recently, Liu et al. (2019) successfully distinguished tremors from local earthquakes and noise with high accuracy of 86.6% to 98.9% from three stations using a machine learning approach, k-nearest neighbors classifier in Taiwan. In the methods, 27 seismic features were calculated. Then they suggested efficient features for a better classification between tremor and noise, that is, maximum amplitude, number of peaks, and energy of the 2 to 8 Hz-filtered signals and number of peaks in the curve showing the temporal evolution of the discrete Fourier transforms median. We can use these features directly as the input of CNN methods, which may lead to better classification.

Conclusion
We developed a CNN to classify local earthquakes, tremors, and noise that were observed by S-net in Northeast Japan. The CNN successfully classified 100%, 96%, and 98% of the local earthquakes, tectonic tremors, and noise, respectively. All of the misclassified events were thoroughly investigated to validate that our CNN-based approach yielded explainable classifications. We also showed an example of successful classification of continuous waveforms including tremor. The classification abilities for local earthquakes and tectonic tremors degraded as the epicentral distance increased and/or event magnitude decreased. The utilization of multiple stations appears to be a promising approach to circumvent this degradation, as the subsequent analysis would include a broad spatial distribution of earthquakes and tremors with various epicentral distances.