### The shift-invariant sparse coding model

In the traditional model of sparse representation (Mallat and Zhang 1993), a signal or image is represented as a linear combination of the redundant dictionary and coefficients. Thus similar or identical feature structures at different locations in the time series require multiple atoms to represent. In addition, the redundant dictionary in the traditional sparse representation is manually predefined. In actual demand, the predefined dictionary is not flexible enough for complex signals (Jafari and Plumbley 2011; Chen 2017).

Shift-invariant sparse coding is a machine learning algorithm based on data-driven framework. It employs convolution as a shift operator to satisfy the property of shift-invariant, and represents the signal as a convolution of the dictionary and the coefficients. This allows feature atoms to be translated, flipped, and scaled anywhere in the time series, thereby facilitating the use of one atom to conveniently represent multiple features at different locations. What is more, the redundant dictionary in SISC is learned from the raw time series adaptively. In other words, SISC is able to learn the laws of signals autonomously and is effective for all kinds of morphological features, which is highly suitable for the processing of complex time series.

For the discrete signal set \(Y = \left[ {y_{1} ,y_{2} , \ldots y_{k} } \right]^{T}\), shift-invariant sparse coding expresses **y**_{k} as the sum of convolutions of the atoms **d**_{m} and sparse coding coefficients **s**_{m,k}:

$${\mathbf{y}}_{k}^{{}} = \sum\limits_{m = 1}^{M} {{\mathbf{d}}_{m} * {\mathbf{s}}_{m,k} } + {\varvec{\upvarepsilon}} ,$$

(1)

where \(Y_{k} = \left[ {y_{1} ,y_{2} , \cdots y_{N} } \right]^{T}\) is a time-series segment with *N* sampling points. \(D = \left[ {d_{1} ,d_{2} , \cdots d_{M} } \right]^{T} \in R^{Q \times M}\) is the so-called redundant dictionary, or called over-complete dictionary, in which **d**_{m} is an atom in dictionary **D** and the number of atoms *M* is larger than *N*. In most cases the number of atoms *M* is much larger than *N*. In other words, signal **y**_{k} can be represented using the dictionary **D** in many ways. This is why dictionary **D** is called a redundant dictionary. \({\varvec{\upvarepsilon}}\) stands for Gaussian white noise. “*” denotes the operation of convolution. \({\mathbf{s}}_{m,k} \in R^{P}\) and **s**_{m,k} is sparse (most of the elements in **s**_{m,k} are zero). *Q *< *N*, *P *< *N* and *Q *+ *P*−1 = *N*.

As shown in Eq. (1), both the atom **d**_{m} and the coefficient **s**_{m,k} in the model of SISC are unknown. The optimization problem is non-convex and difficult to obtain a stable solution if **d**_{m} and **s**_{m,k} are obtained simultaneously. Many scholars solved this problem by turning it into a convex optimization problem. The atom **d**_{m} and coefficients **s**_{m,k} are updated alternately in their schemes (Smith and Lewicki 2006; Plumbley et al. 2006; Aharon et al. 2006). When **d**_{m} is known, **s**_{m,k} can be obtained based on the convex optimization method; correspondingly, when **s**_{m,k} is constant, **d**_{m} can be solved based on the convex optimization method. Sparsity is the common goal of the above two optimization problems, and the cost function for evaluating the sparsity of **y**_{k} is (Liu et al., 2011; Wang et al., 2015):

$$\psi (\theta ) = \mathop {\hbox{min} }\limits_{{{\mathbf{d}},{\mathbf{s}}}} \sum\limits_{k = 1}^{K} {\left\| {{\mathbf{y}}_{k} - \sum\limits_{m = 1}^{M} {{\mathbf{d}}_{m} * {\mathbf{s}}_{m,k} } } \right\|}_{2}^{2} + \beta \cdot \sum\limits_{m,k} {\left\| {{\mathbf{s}}_{m,k} } \right\|_{1} } ,$$

(2)

where \(\left\| . \right\|_{F}\) represents the *F*-order norm, β is a parameter of constraint used to balance reconstruction error and sparsity. The learned atom **d**_{m} usually needs to be normalized, i.e., \(\left\| {d_{m} } \right\|_{2}^{2} = 1.\)

### Solving the sparse coding coefficients

Keeping the dictionary unchanged, the sparse representation coefficients can be obtained by matching pursuit (MP) algorithm (Mallat and Zhang 1993) or orthogonal matching pursuit (OMP) algorithm (Pati et al. 1993). OMP is improved from MP by adding the step of orthogonalization. We use OMP to solve the coding coefficients because it has a better characteristic of convergence.

Imaging **y**_{k} is the signal to be represented, **g**_{i,u} is the atom obtained by shifting *u* points from the learned feature structure, its length is the same as signal **y**_{k}, and || **g**_{i,u} || = 1. *L* denotes the number of iterations, **r**^{L} represents the residual after the *L*th iteration, **ψ**_{L} represents the selected set of atoms after the *L*th iteration. The steps of the OMP algorithm are as follows:

- 1.
Initializations, **r**^{0} = **y**_{k}, **ψ**_{0} = ∅, *L* = 1;

- 2.
Pursuit of the atom **g**_{i,u} that satisfies the following equation:

$$\left| {\left\langle {{\mathbf{r}}^{L} ,{\mathbf{g}}_{i,u}^{L} } \right\rangle } \right| = \mathop {\sup }\limits_{1 \le i \le Q} \left( {\mathop {\sup }\limits_{0 \le u \le P} \left| {\left\langle {{\mathbf{r}}^{L} ,{\mathbf{g}}_{i,u}^{L} } \right\rangle } \right|} \right);$$

(3)

- 3.
Update the selected set of atoms, \({\varvec{\uppsi}}_{L} = {\varvec{\uppsi}}_{L - 1} \cup \{ {\mathbf{g}}_{i,u}^{L} \}\);

- 4.
Calculate the projection coefficients according to least squares method \({\mathbf{s}}_{L} = ({\tilde{\varPsi }}_{L}^{T}\Psi _{L} )^{ - 1} \cdot {\tilde{\varPsi }}_{L}^{T} \tilde{y}_{k}\), and subsequently, residual \({\mathbf{r}}^{L} = {\mathbf{y}}_{k} - {\mathbf{s}}_{L}\Psi _{L}\), reconstructed signal \(\hat{y}_{k} = {\mathbf{s}}_{L}\Psi _{L}\);

- 5.
*L *= *L*+1 and return to step 2 if *L* is smaller than the maximum number of iterations. Otherwise, output the current reconstructed signal and the corresponding residual.

### Dictionary learning

Make full use of the idea of K-SVD algorithm (Aharon et al. 2006), Wang et al. (2015) updated the atoms one by one, rather than all at once. They termed the new method ISISC and show that ISISC is superior to the gradient-based SISC (GSISC) in accuracy and efficiency. In the stage of dictionary learning, the atoms are updated while the coding coefficients stay unchanged. The optimization function can be simplified as:

$$\begin{aligned} \bar{\psi }\left( \theta \right) = & \mathop {\hbox{min} }\limits_{d} \sum\limits_{i = 1}^{N} {\left\| {y_{k} - \sum\limits_{m = 1}^{M} {d_{m} *s_{m,k} } } \right\|}_{2}^{2} \\ \mathop { = \hbox{min} }\limits_{d} \sum\limits_{k = 1}^{K} {\left\| {\left( {y_{k} - \sum\limits_{m \ne i}^{M} {d_{m} *s_{m,k} } } \right) - d_{i} *s_{i,k} } \right\|}_{2}^{2} \\ \mathop {= \hbox{min} }\limits_{d} \sum\limits_{k = 1}^{K} {\left\| {E_{i,k} - d_{i} *s_{i,k} } \right\|}_{2}^{2} , \\ \end{aligned}$$

(4)

where E_{i,k} represents the recovery error of all the atoms except the *i*th atom with respect to the *k*th signal. The update of the *i*th atom can be translated into solving an equation for **d**_{i}. Since \({\mathbf{d}}_{i} * {\mathbf{s}}_{i,k} = {\mathbf{s}}_{i,k} * {\mathbf{d}}_{i}\), when only taking the *k*th signal into consideration, the optimization of Eq. (4) equals to the solution of the following equation (Zhu et al. 2016):

$$\left[ {\begin{array}{*{20}c} {s_{i,k}^{1} } & {} & {} & {} \\ {s_{i,k}^{2} } & {s_{i,k}^{1} } & {} & {} \\ \vdots & {s_{i,k}^{2} } & \ddots & {} \\ {s_{i,k}^{P} } & \vdots & \ddots & {s_{i,k}^{1} } \\ {} & {s_{i,k}^{P} } & \ddots & {s_{i,k}^{2} } \\ {} & {} & \ddots & \vdots \\ {} & {} & {} & {s_{i,k}^{P} } \\ \end{array} } \right] \cdot \left[ {\begin{array}{*{20}c} {d_{m}^{1} } \\ {d_{m}^{2} } \\ \vdots \\ {d_{m}^{Q} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {E_{i,k}^{1} } \\ {E_{i,k}^{2} } \\ \vdots \\ {E_{i,k}^{N} } \\ \end{array} } \right] .$$

(5)

Considering the matrix on the left side of Eq. (5) as a special Toeplitz matrix of coefficients \({\mathbf{s}}_{i,k}\), the above equation can be written as \(Toep({\mathbf{s}}_{i,k} ) \cdot {\mathbf{d}}_{i} = {\mathbf{E}}_{i,k}\). Since the coefficient \({\mathbf{s}}_{i,k}\) is sparse, many of the row vectors in the matrix \({\text{Toep}}({\mathbf{s}}_{i,k} )\) are 0 vectors, and these 0 vectors have no effect on the result. Remove these 0 vectors from \({\text{Toep}}({\mathbf{s}}_{i,k} )\) and the corresponding row vectors from **E**_{i}, then Eq. (5) can be written as \({\text{Toep}}(\tilde{s}_{i,k} ) \cdot {\mathbf{d}}_{i} = \widetilde{{\mathbf{E}}}_{i,k}\). When taking all *K* signals into consideration, the optimization function can be expressed as:

$$\left[ {\begin{array}{*{20}c} {{\text{Toep}}(\tilde{s}_{{_{i,1} }} )} \\ {{\text{Toep}}(\tilde{s}_{{_{i,2} }} )} \\ \vdots \\ {{\text{Toep}}(\tilde{s}_{{_{i,K} }} )} \\ \end{array} } \right] \cdot {\mathbf{d}}_{i} = \left[ {\begin{array}{*{20}c} {\tilde{E}_{{_{i,1} }} } \\ {\tilde{E}_{{_{i,2} }} } \\ \vdots \\ {\tilde{E}_{{_{i,K} }} } \\ \end{array} } \right].$$

(6)

Simplifying Eq. (6) to **S**∙**d**_{i}= **E**, and according to the least squares method, the feature atom can be derived as \({\mathbf{d}}_{i} = ({\mathbf{S}}^{T} {\mathbf{S}})^{ - 1} ({\mathbf{S}}^{T} {\mathbf{E}})\). Since \(({\mathbf{S}}^{T} {\mathbf{S}}) \in {\text{R}}^{Q \times Q}\) and in most cases, *Q *≪ *N*, the above equation can be transformed into a small-scale linear equation. The solution can be directly obtained by Cholesky decomposition, LU decomposition and other methods. Each atom is updated in a random order, and finally gets all the feature structures, i.e., feature atoms. The learned dictionary is obtained by normalizing all the feature atoms, i.e., \({\mathbf{d}}_{i} = {\mathbf{d}}_{i} /||{\mathbf{d}}_{i} ||_{2}^{2}\).

### The flow diagram of data processing

Here, we give the flow diagram of data processing for MT data:

Input: raw MT data **Y**.

Initialization: give the initial value to the dictionary **D** and the sparse representation coefficient **S** randomly, *z *= 0.

Repeat the following:

{

*z *= *z*+1;

Solving the sparse coding coefficients;

Update the dictionary;

}

Until *z* reaches the maximum number of iterations.

Output: the learned dictionary **D**, sparse coding coefficients **S** and the reconstructed signal \(\bar{Y}.\)

The reconstructed signals are components with abnormally large amplitude or obvious regularity. These components are noise according to the characteristics of natural magnetotelluric signals. Therefore, the de-noised MT data can be obtained by subtracting the reconstructed signal from the original signal.