Skip to main content

Operational solar flare prediction model using Deep Flare Net


We developed an operational solar flare prediction model using deep neural networks, named Deep Flare Net (DeFN). DeFN can issue probabilistic forecasts of solar flares in two categories, such as ≥ M-class and < M-class events or ≥  C-class and < C-class events, occurring in the next 24 h after observations and the maximum class of flares occurring in the next 24 h. DeFN is set to run every 6 h and has been operated since January 2019. The input database of solar observation images taken by the Solar Dynamic Observatory (SDO) is downloaded from the data archive operated by the Joint Science Operations Center (JSOC) of Stanford University. Active regions are automatically detected from magnetograms, and 79 features are extracted from each region nearly in real time using multiwavelength observation data. Flare labels are attached to the feature database, and then, the database is standardized and input into DeFN for prediction. DeFN was pretrained using the datasets obtained from 2010 to 2015. The model was evaluated with the skill score of the true skill statistics (TSS) and achieved predictions with TSS = 0.80 for ≥  M-class flares and TSS = 0.63 for ≥  C-class flares. For comparison, we evaluated the operationally forecast results from January 2019 to June 2020. We found that operational DeFN forecasts achieved TSS = 0.70 (0.84) for ≥  C-class flares with the probability threshold of 50 (40)%, although there were very few M-class flares during this period and we should continue monitoring the results for a longer time. Here, we adopted a chronological split to divide the database into two for training and testing. The chronological split appears suitable for evaluating operational models. Furthermore, we proposed the use of time-series cross-validation. The procedure achieved TSS = 0.70 for ≥  M-class flares and 0.59 for ≥  C-class flares using the datasets obtained from 2010 to 2017. Finally, we discuss the standard evaluation methods for operational forecasting models, such as the preparation of observation, training, and testing datasets, and the selection of verification metrics.


The mechanism of solar flares is a long-standing puzzle. Solar flares emit X-rays, highly energetic particles, and coronal mass ejections (CMEs) into the interplanetary space in the heliosphere, whereby these flares become one of the origins of space weather phenomena (e.g., Schwenn et al. 2005; Fletcher 2011; Liu et al. 2014; Möstl 2015). The prediction of flares is essential for reducing the damage to technological infrastructures on Earth. Solar flares are triggered by a newly emerging magnetic flux or magnetohydrodynamic instability to release excess magnetic energy stored in the solar atmosphere (e.g., Aulanier et al. 2010; Shibata and Magara 2011; Cheung and Isobe 2014; Wang et al. 2017; Toriumi and Wang 2019). Such phenomena are monitored by the Solar Dynamic Observatory (SDO; Pesnell et al. 2012) and the Geostationary Orbital Environment Satellite (GOES), and observation data are used for the prediction of flares.

Currently, flare prediction is tackled by the following four approaches: (i) empirical human forecasting (Crown 2012; Devos et al. 2014; Kubo et al. 2017; Murray et al. 2017), (ii) statistical prediction methods (Lee et al. 2012; Bloomfield et al. 2012; McCloskey et al. 2016; Leka et al. 2018), (iii) machine learning methods (e.g., Bobra and Couvidat 2015; Muranushi et al. 2015; Nishizuka et al. 2017, and references therein), and (iv) numerical simulations based on physics equations (e.g., Kusano et al. 2012, 2020; Inoue et al. 2018; Korsós et al. 2020). Some of the models have been made available for community use at the Community Coordinated Modeling Center (CCMC) of NASA (e.g., Gallagher et al. 2002; Shih and Kowalsky 2003; Colak and Qahwaji 2008, 2009; Krista and Gallagher 2009; Steward et al. 2011; Falconer et al. 2011, 2012). It is useful to show the robust performance of each model, and in benchmark workshops, prediction models were evaluated for comparison, where methods that included machine learning algorithms as part of their system were also discussed (Barnes et al. 2016; Leka 2019; Park 2020).

Recently, the application of supervised machine learning methods, especially deep neural networks (DNNs), to solar flare prediction has been a hot topic, and their successful application in research has been reported (Huang et al. 2018; Nishizuka et al. 2018; Park et al. 2018; Chen et al. 2019; Domijan et al. 2019; Liu et al. 2019; Zheng et al. 2019; Bhattacharjee et al. 2020; Jiao et al. 2020; Li et al. 2020; Panos and Kleint 2020; Yi et al. 2020). However, there is insufficient discussion on how to develop the methods available to real-time operations in space weather forecasting offices, including the methods for validation and verification of the models. Currently, new physical and geometrical (topological) features are applied to flare prediction using machine learning (e.g., Wang et al. 2020a; Deshmukh et al. 2020), and it has been noted that training sets may be sensitive to which period in the solar cycle they are drawn from. (Wang et al. 2020b).

It has been 1 year, since we started operating our flare prediction model using DNNs, which we named Deep Flare Net (DeFN: Nishizuka et al. 2018). Here, we evaluate the prediction results during real-time operations at the NICT space weather forecasting office in Tokyo, Japan. In this paper, we introduce the operational version of DeFN in “Flare forecasting tool in real-time operation” section and “Operation forecasts using DeFN”. We show the prediction results and propose the use of time-series cross-validation (CV) to evaluate operational models in “Forecast results and evaluation” section. We summarize our results and discuss the selection of a suitable evaluation method for models used in operational settings in “Summary and discussion“ section.

Flare forecasting tool in real-time operation

Procedures of operational DeFN

DeFN is designed to predict solar flares occurring in the following 24 h after observing magnetogram images, which are categorized into the two categories: (≥  M-class and < M-class) or (≥  C-class and < C-class). In the operational system of DeFN forecasts, observation images are automatically downloaded, active regions (ARs) are detected, and 79 physics-based features are extracted for each region. Each feature is standardized by the average value and standard deviation, and is input into the DNN model, DeFN. The output is the flare occurrence probabilities for the two categories. Finally, the maximum class of flares occurring in the following 24 h is forecast by taking the maximum probability of the forecasts.

Operational DeFN was redesigned for automated real-time forecasting with operational redundancy. All the programs written in IDL and Python languages are driven by cron scripts at the prescribed forecast issuance time as scheduled. There are a few differences from the original DeFN used for research, as explained in the next subsection. A generalized flowchart of operational DeFN is shown in Fig. 1.

Fig. 1

Flowchart of operational DeFN. It is executed using the NRT data and the pretrained model. The best-performing model among the models trained using the 2010–2015 datasets is chosen as the pretrained model

NRT observation data

The first difference between development DeFN and operational DeFN is the use of near real-time (NRT) observation data. We use the observation data of line-of-sight magnetograms and vector magnetograms taken from Helioseismic and Magnetic Imager (HMI; Scherrer 2012; Schou 2012; Hoeksema et al. 2014) on board SDO, ultraviolet (UV) and extreme ultraviolet (EUV) images obtained from Atmospheric Imaging Assembly (AIA; Lemen 2012) through 1600  Å and 131  Å filters; and the full-disk integrated X-ray emission over the range of 1–8  Å observed by GOES. For visualization, we also use white light images taken from HMI and EUV images obtained using AIA through 304  Å and 193  Å filters. The time cadence of the vector magnetograms is 12 min, that of the line-of-sight magnetograms is 45 s, those of the 1600  Å and 131  Å filters are both 12 s, and that of GOES is less than 1 min.

The data product of SDO is provided by the Joint Science Operations Center (JSOC) of Stanford University. The HMI NRT data are generally processed and available for transfer within 90 min after observations (Leka et al. 2018). This is why, DeFN was designed to download the observation dataset 1 h earlier. If the observation data are unavailable because of processing or transfer delays, the target of downloading is moved back in time from 1 to 5 h earlier in the operational DeFN system. When no data can be found beyond 5 h earlier, it is considered that the data are missing. Here, the time of 5 h was determined by trial-and-error. Forecasting takes 20–40 min for each prediction; thus, it is reasonable to set the forecasting time to as often as once per hour. The 1 h cadence is comparable to that of the time evolution of the magnetic field configuration in active regions due to flux emergence or changes before and after a flare. However, DeFN started operating in the minimum phase of solar activity, so we started forecasting with a 6 h cadence instead of a 1 h cadence.

The NRT vector magnetograms taken by HMI/SDO are used for operational forecasts, whereas the calibrated HMI ‘definitive’ series of vector magnetograms are used for scientific research. The NRT vector magnetograms are accessed from the data series ‘hmi.bharp_720s_nrt’ with segmentations of ‘field’, ‘inclination’, ‘azimuth’, and ‘ambig’. These segmentations indicate the components of field strength, inclination angle, azimuth angle, and the disambiguation of magnetic field in the photosphere, respectively. Additionally, the NRT line-of-sight magnetograms are downloaded from the data series ‘hmi.M_720s_nrt’, and the NRT white light images are from the ‘hmi.Ic_noLimbDark_720s_nrt’ (jsoc2) series. The NRT data of AIA 131 Å, 193 Å, 304 Å, and 1600 Å filters are retrieved from the ‘aia.lev1_nrt2’ (jsoc2) series.

Note that the HMI NRT vector magnetogram is not for the full disk, in contrast to the HMI definitive series data. HMI Active Region Patches (HARP) are automatically detected in the pipeline of HMI data processing (Bobra et al. 2014), and the HMI NRT vector magnetogram is limited to the HARP areas plus a buffer, on which we overlaid our active region frames detected by DeFN and extracted 79 physics-based features (Fig. 2; also see the detection algorithms and extracted features in Nishizuka et al. (2017, 2018), and details of the HMI NRT vector magnetogram in Leka et al. 2018). Furthermore, the correlation between the HMI NRT data and the definitive data has not been fully statistically revealed. A future task is to reveal how the difference between the HMI NRT and definitive series data affects the forecasting results. The same comments can be made for the AIA NRT and definitive series data.

Fig. 2

Full solar disk vector magnetograms taken by HMI/SDO on 6 September 2017. a HMI definitive series data for the full disk and the areas detected by DeFN with red frames. b HMI NRT data showing the HARP areas in gray scale and the areas detected by DeFN. Note that both HARP and DeFN areas overlapped each other. If DeFN areas are outside the HARP areas, the data are set to zero

Implementation of operational DeFN

Operational DeFN runs autonomously every 6 h by default, forecasting at 03:00, 09:00, 15:00, and 21:00 UT. The forecasting time of 03:00 UT was set to be before the daily forecasting meeting of NICT at 14:30 JST. The weights of multi-layer perceptrons of DeFN were trained with the 2010–2014 observation datasets, and we selected representative hyperparameters by the observation datasets in 2015.

For the classification problem, parameters are optimized to minimize the cross entropy loss function. However, since the flare occurrence ratio is imbalanced, we adopted a loss function with normalizations of prior distributions. It is the sum of the weighted cross entropy:

$$\begin{aligned} J_{{\text{WCE}}} = - \sum _{n=1}^N \sum _{k=1}^K w_k y_{nk}^* \log p(y_{nk}). \end{aligned}$$

Here, \(p(y_{nk}^*)\) is the initial probability of correct labels \(y_{nk}^*\), i.e., 1 or 0, whereas \(p(y_{nk})\) is the estimated value of probability. The components of \(y_{nk}^*\) are 1 or 0; thus, \(p(y_{nk}^*)\)=\(y_{nk}^*\). \(w_k\) is the weight of each class and is the inverse of the class occurrence ratio, i.e., [1, 50] for ≥  M-class flares and [1, 4] for ≥  C-class flares. Parameters are stochastically optimized by adaptive moment estimation (Adam; Kingma and Ba 2014) with learning rate = 0.001, \(\beta _1\) = 0.9, and \(\beta _2\) = 0.999. The batch size was set to 150 (for details, see Nishizuka et al. 2018).

Theoretically, the positive and negative events, i.e., whether ≥  M-class flares occur or not, are predicted in the following manner. The following equation is commonly used in machine learning:

$$\begin{aligned} {{\hat{y}}} = \mathop {{\text{argmax}}}\limits _{k} p(y_k). \end{aligned}$$

Here, \({{\hat{y}}}\) is the prediction result, and the threshold is usually fixed. For example, in the case of two-class classifications, the events with a probability greater than 50% are output. When we use the model as a probabilistic prediction model, we also tried smaller threshold values for safety in operations, although there are no obvious theoretical meanings.

Note that the loss function weights cannot be selected arbitrarily. The positive-to-negative event ratios of ≥  M-class and < M-class or ≥  C-class and < C-class flares, which are called the occurrence frequency and the climatological base rate, are 1:50 and 1:4 during 2010–2015, respectively, in the standard. Only when the cross entropy is applied to the weight of the inverse ratio of positive-to-negative events does it become theoretically valid to output the prediction by Eq. (2). Therefore, we used the base rate as the weight of cross entropy.

The DNN model of the operational DeFN was developed as in Nishizuka et al. (2018). Because the full HMI and AIA datasets obtained from 2010 to 2015 were too large to save and analyze, the cadence was reduced to 1 h, although, in general, a larger amount of data is useful for better predictions. We divided the feature dataset into two for training and validation with a chronological split: the dataset obtained from 2010 to 2014 for training and the 2015 dataset for validation. The point of this paper is to contrast how well the DeFN model can predict solar flares in the real-time operations and in the research using time-series CV methods (=shuffle and divide CV is insufficient). Then, we will discuss that the gap between the prediction accuracies in operations and in research using a time-series CV is small (see “Time-series CV” section).

The time-series CV is stricter than a K-fold CV on data split by active region. It might be true that a K-fold CV on data split by active region can also prevent data from a single active region being used in training and testing (e.g., Bobra and Couvidat 2015). However, a K-fold CV on data split by active region allows the training set to contain future samples from different active regions. This may affect the prediction results, when there is a long-term variation of solar activity. As well, the number of active regions which produced X-class and M-class flares is not so large that a K-fold CV on data split by active region may be biased and not equal.

Indeed, solar flare prediction in operation has been done in a very strict condition, where no future data are available. Our focus is not to deny a K-fold CV on data split by active region. Instead, our focus is to discuss more appropriate CVs in operational setting.

The model was evaluated with a skill score, the true skill statistic (TSS; Hanssen and Kuipers 1965), which is a metric of the discrimination performance. Then, the model succeeded in predicting flares with TSS = 0.80 for ≥  M-class and TSS = 0.63 for ≥  C-class (Table 1). Note that the data for 2016–2018 were not used, because there were fewer flares in this period than in the period between 2010 and 2015.

Table 1 Contingency tables of DeFN using 2010–2015 datasets

Flare labels were attached to the 2010–2015 feature database for supervised learning. We collected on the disk all the flare samples that occurred from the flare event list. We visually checked the locations of the flares, compared them with NOAA numbers, and found the corresponding active regions in our database when there were two or more active regions. Then, we attached flare labels to predict the maximum class of flares occurring in the following 24 h. If ≥ M-class flares are observed within 24 h after observations, the data are attached with the label (0, 1)\(_{\text{M}}\); otherwise, they are attached with the label (1, 0)\(_{\text{M}}\). When two M-class flares occur in 24 h, the period with the label (0, 1)\(_{\text{M}}\) is extended. Similarly, the labels (0, 1)\(_{\text{C}}\) and (1, 0)\(_{\text{C}}\) are separately attached for the prediction model of C-class flares. The training was executed using these labels. On the other hand, in real-time operation, we do not know the true labels of flares, so we attached the NRT feature database with dummy labels (1, 0), which are not used in the predictions. It is possible to update the model by retraining it using the latest datasets if the prediction accuracy decreases. However, the pretrained operational model is currently fixed and has not been changed

Operation forecasts using DeFN

Graphical output

The graphical output is automatically generated and shown on a website (Fig. 3). The website was designed to be easy for professional space weather forecasters who are often not scientists to understand. Prediction results for both the full-disk and region-by-region images are shown on the website, and the risk level is indicated by a mark, “Danger flares”, “Warning”, and “Quiet”. Images are updated every 6 h as new data are downloaded. Details of the DeFN website are described below:

  • Solar full-disk images and detected ARs:

    Images obtained by multiwavelength observations, such as magnetograms and white light, 131 Å, 193 Å, 304 Å, and 1600 Å images taken by SDO, are shown along with ARs detected by DeFN, where the threshold is set to 140 G in the line-of-sight magnetograms taken by HMI/SDO [see details of the detection method in Nishizuka et al. (2017)].

  • Probabilistic forecasts at each AR:

    Probabilistic forecasts of flare occurrence at each AR are shown for ≥ M-class and <M-class flares or ≥ C-class and <C-class flares by bar graphs, in analogy with the probabilistic forecasts of precipitation. Note that this forecasted probability does not indicate the real observation frequency, because the prior distributions are normalized to the peak at 50% by the weighted cross entropy, where the loss function weights are the inverse of the flare occurrence ratio (Nishizuka et al. 2020). Thus, operational DeFN is optimized for forecasting with the default probability threshold of 50%. That is, operational DeFN forecasts flare if the real occurrence probability, which is forecast by the non-weighted cross entropy loss function, is greater than the climatological event rate, and it does not forecast flares if the real occurrence probability is less than the climatological event rate (see also Nishizuka et al. 2020). Therefore, the normalized forecasted probability of ≥ M-class flares sometimes becomes larger than that of ≥ C-class flares.

  • Full-disk probability forecasts and alert marks:

    The full-disk flare occurrence probability of ≥ M-class flares, \(P_{{\text{FD}}}\), is calculated using:

    $$\begin{aligned} P_{{\text{FD}}} = 1.0 - \prod _{{\text{AR}}_i \in S} (1.0 - P_i), \end{aligned}$$

    where S is the set of ARs on the disk, i is the number of ARs on the disk, element \({\text{AR}}_i\) is a member of set S, and \(P_i\) is the probabilistic forecast at each \({\text{AR}}_i\) (Leka et al. 2018). The risk level is indicated by a mark based on \(P_{{\text{FD}}}\) and is categorized into three categories: “Danger flares” (\(P_{{\text{FD}}}\) ≥ 80%), “Warning” (\(P_{{\text{FD}}}\) ≥ 50%), and “Quiet” (\(P_{{\text{FD}}}\) <50%). This is analogous to weather forecasting, e.g., sunny, cloudy, and rainy.

  • List of comments and remarks:

    Forecasted probabilities (percentages), comments, and remarks are summarized in a list.

Fig. 3

Website of operational DeFN showing a full-disk solar image and detected ARs, in addition to the full-disk forecasts, region-by-region forecasts, an alert mark, and a comment list (

Details of operations and redundancy

Operational DeFN has operated stable forecasts since January 2019. In this subsection, we explain the redundancy and operational details of operational DeFN.

  • Forecast outages: A forecast is the category with the maximum probability of a flare in each of the categories in the following 24 h after the forecast issue time. A forecast is normally issued every 6 h. If problems occur when downloading or processing data, the forecast is skipped and handled as a forecast failure.

  • Data outages of SDO/HMI, AIA: There are delays in the HMI/SDO data processing when no applicable NRT data are available for a forecast. In this case, the NRT data to download are moved back in time from 1 to 5 h earlier. In such as case, the forecasting target will change from 24 h to 25–29 h, though the operational DeFN is not retrained. If no data can be found beyond 5 h earlier, the “no data” value is assigned and the forecast is skipped.

  • No sunspots or ARs with strong magnetic field on disk: If there are no sunspots or ARs detected with the threshold of 140 G on the disk image of the line-of-sight magnetogram, feature extraction is skipped, a forecast of “no flare” with a probability forecast of 1% is issued, and the “no sunspot” value is assigned.

  • Forecasts at ARs near/across limb: DeFN is currently not applicable to limb events. If an AR is detected across a limb, it is ignored in forecast targets.

  • Flares not assigned to an active region: Detected active regions by operational DeFN are not completely the same as active regions registered by NOAA. There are cases where flares occur in decaying or emerging active regions which are not detected by DeFN with the threshold of 140 G. This occurs most often for C-class and lower intensity flares, for example, C2.0 flare in NOAA 12741 on 2019 May 15. Such a flare is missed in real-time forecasts but included in evaluations.

  • Retraining: DeFN can be retrained on demand, and a newly trained model can be used for forecasting. Currently, the pretrained model is fixed and has not been changed so far.

  • Alternative data input after SDO era (option): Since DeFN is designed to detect ARs and extract features by itself, it can be revised and trained to include other space- and ground-based observation data in DeFN, even when SDO data are no longer available.

Forecast results and evaluation

Operational benchmark

The purpose of machine learning techniques is to maximize the performance for unseen data. This is called generalization performance. Because it is hard to measure generalization performance, it is usually approximated by test-set performance, where there is no overlap between the training and test sets.

On the other hand, as the community continues to use a fixed test set, on the surface, the performance of newly proposed models will seem to improve year by year. In reality, generalization performance is not constantly improving, but there will be more models that are effective only for the test set. This is partially because that models with lower performance than state-of-the-art are not reported. In other words, there are more and more models that are not always valid for unseen datasets. It is essentially indistinguishable whether the improvement is due to an improvement in generalization performance or because it is a method that is effective only for the test set.

The above facts are well known in the machine learning community, and the evaluation conditions are mainly divided into two, basic and strict. Under the strict evaluation conditions, only an independent evaluation body evaluates each model using the test set only once. The test set is not published to the model builders (see, e.g., Ardila et al. 2019). The solar flare prediction is originally a prediction of future solar activity using a present observation dataset, and the data available to researchers are only the past data. This fact is consistent with the strict evaluation condition in machine learning community.

In this section, we evaluate our operational forecasting results. We call this the “operational benchmark” in this paper. In the machine learning community, a benchmark using a fixed test set is used only for basic benchmark tests. The basic approach is simple but is known to be insufficient. This is because no one can guarantee that the test set is used only once. In strict machine learning benchmarks, evaluation with a completely unseen test set is required. Only organizer can see the “completely unseen test set”, which cannot be seen by each researcher. This is because, if researchers use the test set many times, they implicitly tend to select models effective only for the fixed test set.

We think that the evaluation methods of operational solar flare prediction models are not limited to evaluations using a fixed test set. However, this paper does not deny the performance evaluation using a fixed test set. The purpose of this paper is to show that the operational evaluation is important. From a fairness perspective, the strict benchmarking approach takes precedence over the basic approach. Our operational evaluation is based on the strict benchmarking approach. We did not retrain our model after the deployment of our system.

Forecast results and evaluation

We evaluated the forecast results from January 2019 to June 2020, when we operated operational DeFN in real time. During this period, 24 C-class flares and one M-class flare were observed. The M-class flare was observed on 6 May 2019 as M1.0, which was originally reported as C9.9 and corrected to M1.0 later. The forecast results are shown in Table 2. Each contingency table shows the prediction results for ≥ M-class and ≥ C-class flares. Operational DeFN was originally trained with the probability threshold of 50% to decide the classification, but in operations, users can change it according to their purposes. In Table 2, we show three cases for ≥ M-class and ≥ C-class predictions using different probability thresholds, such as 50%, 45%, and 40% for reference.

Table 2 Contingency tables of DeFN forecasts in operation from January 2019 to June 2020

Each skill score can be computed from the items shown in contingency tables, and not vice versa. This is a well-known fact. No matter how many skill scores you show, you will not have more information than one contingency table. The relative operating characteristic (ROC) curve and the reliability diagram, which are shown in Leka et al. (2019), can also be reproduced from the contingency table if it is related to the deterministic forecast (forecast of this paper). The ROC curve is a curve or straight line made by plots on a probability of false detection (POFD)–probability of detection (POD) plane. The ROC curve for a deterministic forecast is made by connecting three points (0,0), (POFD, POD) for a deterministic forecast, and (1,1) (see, e.g., Richardson 2000; Jolliffe and Stephenson 2012). For reference, we introduce skill scores used in Leka et al. (2019), such as the accuracy, Hanssen & Kuiper skill score/Pierce skill score/True skill statistic (TSS/PSS), Appleman skill score (ApSS), equitable threat score (ETS), Brier skill score, mean-square-error skill score (MSESS), Gini coefficient, and frequency bias (FB).

According to Table 2, the flare occurrence was very rare and imbalanced in the solar minimum phase. Most of the forecasts are true negative. When we decrease the probability threshold, the number of forecast events increases. We evaluated our results with the four verification metrics in Table 3: accuracy, TSS, false alarm ratio (FAR), and Heidke skill score (HSS) (Murphy 1993; Barnes et al. 2009; Kubo et al. 2017). They show that operational DeFN optimized for ≥ C-class flare prediction achieved accuracy of 0.99 and TSS of 0.70 with the probability threshold of 50%, whereas they were 0.98 and 0.83 with the probability threshold of 40%. DeFN optimized for ≥ M-class flare prediction achieved accuracy of 0.99, but TSS was only 0.24, because only a single M1.0 flare occurred. Operational DeFN did not predict this flare, because it was at the boundary of the two categories of ≥ M-class and <M-class flares. This happens a lot in real operations, and this is a weakness of binary classification systems used in operational settings.

Table 3 Evaluations of operational forecast results by DeFN from January 2019 to June 2020 with three verification metrics

The trends of the contingency tables are similar to those evaluated in the model development phase. (Table 2). However, there are two differences. First, the data used were the NRT data, whereas the definitive series was used for development. However, in this case, there was negligible difference between them. Second, the evaluation methods are different. The operational DeFN was evaluated on the actual data from 2019 to 2020, whereas the development model was validated with the 2010–2014 dataset and tested with the 2015 dataset. It appears that the chronological split provides more suitable evaluation results for operations than the common methods, namely, shuffle and split CV and K-fold CV.

Time-series CV

Here, we propose the use of time-series CV for evaluations of operational forecasting models. In the previous papers on flare predictions, we used hold-out CV, where a subset of the data split chronologically was reserved for validation and testing, rather than the naïve K-fold CV. This is because it is necessary to be careful when splitting the time-series data to prevent data leakage (Nishizuka et al. 2018). To accurately evaluate prediction models in an operational setting, we must not use all the data about events that occur chronologically after the events used for training.

The time-series CV is illustrated in Fig. 4. In this procedure, there are a series of testing datasets, each consisting of a set of observations and used for prediction error. The corresponding training dataset consists of observations that occurred prior to the observations that formed the testing dataset and is used for parameter tuning. Thus, model testing is not done on data that may have pre-dated the training set. Furthermore, the training dataset is divided into training and validation datasets. The model prediction accuracy is calculated by averaging over the testing datasets. This procedure is called rolling forecasting origin-based CV (Tashman 2000). In this paper, we call it time-series CV, and it provides an almost unbiased estimate of the true error (Varma and Simon 2006).

Fig. 4

Explanation of time-series CV. The series of training, validation, and test sets, where the blue observations form the training sets, the green observations form the validation sets, and the red observations form the test sets

Note that the time-series CV has the following advantages: (i) The time-series CV is the standard validation scheme in time-series prediction. (ii) A single chronological split does not always reflect low generalization error (Bishop 2006). In other words, the trained model is not guaranteed to work for unseen test set. To avoid this, the time-series CV applies multiple chronological splits. The ability to predict new examples correctly that differ from those used for training is known as generalization performance (Bishop 2006). Therefore, the time-series CV is more generic and appropriate.

The evaluation results obtained by time-series CV using the 2010–2017 datasets are summarized in Table 4. The datasets were chronologically split to form the training, validation, and testing datasets. TSS is largest with the 2010–2014 datasets for training, the 2015 datasets for validation, and the 2016 datasets for testing. This is probably because it is not possible to obtain a reliable forecast based on a small training dataset obtained from 2010 to 2012. By averaging over the five testing datasets, we found that TSS is 0.70 for ≥ M-class flares and 0.59 for ≥ C-class flares. This procedure will be more suitable for an observation dataset with a longer time period.

Table 4 Evaluation of DeFN forecasts using the time-series CV

Summary and discussion

We developed an operational flare prediction model using DNNs, which was based on a research version of the DeFN model, for operational forecasts. It can provide probabilistic forecasts of flares in two categories occurring in the next 24 h from observations: ≥ M-class and <M-class flares or ≥ C-class and <C-class flares. DeFN has been continuously used for operational forecasting since January 2019, and we evaluated its performance using the forecast and actual flare occurrences between January 2019 and June 2020. We found that operational DeFN achieved an accuracy of 0.99 and TSS of 0.70 for ≥ C-class flare predictions, whereas the accuracy was 0.99, but TSS was only 0.24 for ≥ M-class flare prediction using a probability threshold of 50%. Using a probability threshold of 40%, the accuracy was 0.98 and TSS was 0.83 for ≥ C-class flares, whereas they were 0.98 and 0.48 for ≥ M-class flares.

Operational DeFN has the advantages of a large TSS, good discrimination performance, and the low probability of missed detection of observed flares. This is why, it is useful for operations that require that no flares are missed, such as human activities in space and critical operations of satellites. On the other hand, it tends to over-forecast and the false alarm ratio (FAR) increases. Because the number of true negatives is very large in an imbalanced problem such as solar flare prediction, TSS is less sensitive to false positives than to false negatives. Currently, the prior distributions of ≥ M-class and <M-class flares are renormalized to increase TSS at threshold probability of 50%, but this results in an increase in FAR.

When we compared the evaluation results, we observed no significant difference between the pretrained and operational results. This means that, at least during January 2019–June 2020, the difference between NRT and definitive series science data did not greatly affect the forecasts. We found a TSS of 0.63 for the ≥ C-class model evaluated using the pretrained model was maintained and even increased to 0.70 (0.83) for operational forecasts with the probability threshold of 50 (40)%. This suggests that the chronological split is more suitable for the training and validation of the operational model than shuffle and split CV.

Here, we discuss how to train and evaluate machine learning models for operational forecasting. For an exact comparison, it is desirable to use the same datasets among participants. If this is not possible, there are three points that require attention.

  1. (i)

    Observation database: The ratio of positive-to-negative events should not be artificially changed, and datasets should not be selected artificially. Data should be the climatological event rate and kept natural. This is because some metrics are affected by controlling the positive-to-negative event ratio of datasets, especially HSS, which will result in a difference from the operational evaluations. For operational evaluations, it is also desirable to include ARs near the limb, although they are excluded in most papers, because the values of magnetograms are unreliable owing to the projection effect. Currently, in machine learning models, limb flares are not considered, but they also need to be considered in the near future, using GOES X-ray statistics as in human forecasting or magnetograms reproduced by STEREO EUV images (Kim et al. 2019).

  2. (ii)

    Datasets for Training and Testing: We recommend that a chronological split or time-series CV is used for training and evaluation of operational models. Although K-fold CV using random shuffling is common in solar flare predictions, it has a problem for a time-series dataset divided into two for training and testing when the time variation is very small, e.g., the time evolution of magnetic field. If the two neighboring datasets, which are very similar, are divided into both training and testing sets, the model becomes biased to overpredict flares. It might be true that a K-fold CV on data split by active region can also prevent data from a single active region being used in training and testing. However, a K-fold CV on data split by active region allows the training set to contain future samples from different active regions. Therefore, in the point of view of generalization performance, a time-series CV is stricter and more suitable for operational evaluation.

  3. (iii)

    Selection of metrics: The ranking of models is easily affected by the selection of the metric. Depending on the purpose, users should select their preferred model by looking at the contingency tables and skill scores of each model. After understanding that each skill score can evaluate one aspect of performance, verification methods should be discussed in the space weather community (see also Pagano et al. 2019; Cinto et al. 2020).

In this paper, we showed contingency tables of our prediction results. No matter how many skill scores you show, you will not have more information than one contingency table. We evaluated our prediction results as a deterministic forecasting model. The ROC curve and the reliability diagram, which are shown in Barnes et al. (2016) and Leka et al. (2019), can also be reproduced from the contingency table if it is related to the deterministic forecast.

We demonstrated the performance of a machine learning model in an operational flare forecasting scenario. The same methods and discussion of prediction using machine learning algorithms can be applied to other forecasting models of space weather in the magnetosphere and ionosphere. Our future aim is to extend our model to predicting CMEs and social impacts on Earth by extending our database to include geoeffective phenomena and technological infrastructures.

Availability of data and materials

The code is available at In the README file, we explain the architecture and selected hyper parameters. The feature database of DeFN is available at the world data center of NICT ( The SDO data are available from the SDO data center ( and JSOC ( The GOES data are available at


  1. Ardila D, Kiraly A, Bharadwaj S, Choi B, Reicher JJ, Peng L, Tse D, Etemadi M, Ye W, Corrado G, Naidich DP, Shetty S (2019) End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med 25(954):961.

    Article  Google Scholar 

  2. Aulanier G, Török T, Démoulin P, DeLuca EE (2010) Formation of torus-unstable flux ropes and electric currents in erupting sigmoids. Astrophys J 708:314–333.

    Article  Google Scholar 

  3. Barnes LR, Schults DM, Gruntfest EC, Hayden MH, Benight CC (2009) Corrigendum: false alarm rate or false alarm ratio? Weather Forecast 24:1452–1454.

    Article  Google Scholar 

  4. Barnes G, Leka KD, Schrijver CJ, Colak T et al (2016) A comparison of flare forecasting methods. I. Results from the “All-Clear’’ Workshop. Astrophys J 829:89.

    Article  Google Scholar 

  5. Bhattacharjee S, Alshehhi R, Dhuri DB, Hanasoge SM (2020) Supervised convolutional neural networks for classification of flaring and nonflaring active regions using line-of-sight magnetograms. Astrophysical J 898:98.

    Article  Google Scholar 

  6. Bishop CM (2006) Pattern recognition and machine learning. In: Jordan M, Kleinberg J, Schölkopf M (eds) Information science and statistics. Springer, New York, p 738

    Google Scholar 

  7. Bloomfield DS, Higgins PA, McAteer RTJ, Gallagher PT (2012) Toward reliable benchmarking of solar flare forecasting methods. Astropphys J Lett 747:L41.

    Article  Google Scholar 

  8. Bobra MG, Couvidat S (2015) Solar flare prediction using SDO/HMI vector magnetic field data with a machine-learning algorithm. Astropphys J 798:135.

    Article  Google Scholar 

  9. Bobra MG, Sun X, Hoeksema JT, Turmon M, Liu Y, Hayashi K, Barnes G, Leka KD (2014) The helioseismic and magnetic imager (HMI) vector magnetic field pipeline: SHARPs—space-weather HMI Active region Patches. Solar Phys 289(9):3549–3578.

    Article  Google Scholar 

  10. Chen Y, Manchester WB, Hero AO, Toth G, DuFumier B, Zhou T et al (2019) Identifying solar flare precursors using time series of SDO/HMI images and SHARP parameters. Space Weather 17:1404–1426.

    Article  Google Scholar 

  11. Cheung MCM, Isobe H (2014) Flux emergence (Theory). Living Rev Solar Phys 11:3.

    Article  Google Scholar 

  12. Cinto T, Gradvohl A, Coelho GP, da Silva AEA (2020) A framework for designing and evaluating solar flare forecasting systems. Monthly Notices R Astron Soc 495:3332–3349.

    Article  Google Scholar 

  13. Colak T, Qahwaji R (2008) Automated McIntosh-based classification of sunspot groups using MDI images. Solar Phys 248:277–296.

    Article  Google Scholar 

  14. Colak T, Qahwaji R (2009) Automated solar activity prediction: a hybrid computer platform using machine learning and solar imaging for automated prediction of solar flares. Space Weather 7(6):S06001.

    Article  Google Scholar 

  15. Crown MD (2012) Validation of the NOAA Space Weather Prediction Center’s solar flare forecasting look-up table and forecaster-issued probabilities. Space Weather 10:S06006.

    Article  Google Scholar 

  16. Deshmukh V, Berger T, Bradley E, Meiss JD (2020) Leveraging the mathematics of shape for solar magnetic eruption prediction. J Space Weather Space Clim 10:13.

    Article  Google Scholar 

  17. Devos A, Verbeeck C, Robbrecht E (2014) Verification of space weather forecasting at the Regional Warning Center in Belgium. J Space Weather Space Clim 4:A29.

    Article  Google Scholar 

  18. Domijan K, Bloomfield DS, Pitié F (2019) Solar flare forecasting from magnetic feature properties generated by the Solar Monitor Active Region Tracker. Solar Phys 294:6.

    Article  Google Scholar 

  19. Falconer D, Barghouty AF, Khazanov I, Moore R (2011) A tool for empirical forecasting of major flares, coronal mass ejections, and solar particle events from a proxy of active-region free magnetic energy. Space Weather 9:S04003.

    Article  Google Scholar 

  20. Falconer DA, Moore RL, Barghouty AF, Khazanov I (2012) Prior flaring as a complement to free magnetic energy for forecasting solar eruptions. Astropphys J 757:32.

    Article  Google Scholar 

  21. Fletcher L et al (2011) An observational overview of solar flares. Space Sci Rev 159:19–106.

    Article  Google Scholar 

  22. Gallagher PT, Moon Y-J, Wang H (2002) Active-region monitoring and flare forecasting I. Data processing and first results. Solar Phys 209:171–183.

    Article  Google Scholar 

  23. Hanssen AW, Kuipers WJA (1965) On the relationship between the frequency of rain and various meteorological parameters: (with reference to the problem ob objective forecasting) Mededelingen en verhandelingen, 81, Royal Netherlans Meteorological Institute (65pp)

  24. Hoeksema JT, Liu Y, Hayashi K, Sun X, Schou J, Couvidat S, Norton A, Bobra M, Centeno R, Leka KD, Barnes G, Turmon M (2014) The Helioseismic and Magnetic Imager (HMI) vector magnetic field pipeline: overview and performance. Solar Phys 289:3483–3530.

    Article  Google Scholar 

  25. Huang X, Wang H, Xu L, Liu J, Li R, Dai X (2018) Deep learning based solar flare forecasting model. I. Results for line-of-sight magnetograms. Astrophys J 856:7.

    Article  Google Scholar 

  26. Inoue S, Kusano K, Büchner J, Skála J (2018) Formation and dynamics of a solar eruptive flux tube. Nat Commun 9:174.

    Article  Google Scholar 

  27. Jiao Z, Sun H, Wang X, Manchester W, Gombosi T, Hero A, Chen Y (2020) Solar flare intensity prediction with machine learning models. Space Weather 18:e2020SW002440.

    Article  Google Scholar 

  28. Jolliffe IT, Stephenson DB (2012) Forecast verification: a practitioner’s guide in atmospheric science, 2nd edn. John Wiley & Sons Ltd., Hoboken.

    Google Scholar 

  29. Kim T, Park E, Lee H, Moon Y-, Bae S-, Lim D, Jang S, Kim L, Cho I-H, Choi M, Cho K-S (2019) Solar farside magnetograms from deep learning analysis of STEREO/EUVI data. Nat Astron 3:397–400.

    Article  Google Scholar 

  30. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. In: International conference on learning representations (ICLR) 2015. arXiv preprint arXiv:1412.6980

  31. Korsós MB, Georgoulis MK, Gyenge N, Bisoi SK, Yu S, Poedts S, Nelson CJ, Liu J, Yan Y, Erdélyi R (2020) Solar flare prediction using magnetic field diagnostics above the photosphere. Astrophys J 896:119.

    Article  Google Scholar 

  32. Krista LD, Gallagher PT (2009) Automated coronal hole detection using local intensity thresholding techniques. Solar Phys 256:87–100.

    Article  Google Scholar 

  33. Kubo Y, Den M, Ishii M (2017) Verification of operational solar flare forecast: case of Regional Warning Center Japan. J Space Weather Space Clim 7:A20.

    Article  Google Scholar 

  34. Kusano K, Bamba Y, Yamamoto TT, Iida Y, Toriumi S, Asai A (2012) Magnetic field structures triggering solar flares and coronal mass ejections. Astropphys J 760:31.

    Article  Google Scholar 

  35. Kusano K, Iju T, Bamba Y, Inoue S (2020) A physics-based method that can predict imminent large solar flares. Science 369(6503):587–591.

    Article  Google Scholar 

  36. Lee K, Moon Y-J, Lee J-Y, Lee K-S, Na H (2012) Solar flare occurrence rate and probability in terms of the sunspot classification supplemented with sunspot area and its changes. Solar Phys 281:639–650.

    Article  Google Scholar 

  37. Leka KD, Barnes G, Wagner E (2018) The NWRA classification infrastructure: description and extension to the Discriminant Analysis Flare Forecasting System (DAFFS). Space Weather Space Clim. 8:A25.

    Article  Google Scholar 

  38. Leka KD et al (2019) A comparison of flare forecasting methods. II. benchmarks, metrics, and performance results for operational solar flare forecasting systems. Astropphys J S 243:36.

    Article  Google Scholar 

  39. Lemen J et al (2012) The atmospheric imaging assembly (AIA) on the solar dynamics observatory (SDO). Solar Phys 275:17–40.

    Article  Google Scholar 

  40. Li X, Zheng Y, Wang X, Wang L (2020) Predicting solar flares using a novel deep convolutional neural network. Astrophys J 891:10.

    Article  Google Scholar 

  41. Liu YD, Luhmann JG, Kajdič P, Kilpua EKJ, Lugaz N, Nitta NV, Möstl C, Lavraud B, Bale SD, Farrugia CJ, Galvin AB (2014) Observations of an extreme storm in interplanetary space caused by successive coronal mass ejections. Nat Commun 5:3481.

    Article  Google Scholar 

  42. Liu H, Liu C, Wang JTL, Wang H (2019) Predicting solar flares using a long short-term memory network. Astrophys J 877:121.

    Article  Google Scholar 

  43. McCloskey AE, Gallagher PT, Bloomfield DS (2016) Flaring rates and the evolution of sunspot group McIntosh classifications. Solar Phys 291:1711–1738.

    Article  Google Scholar 

  44. Möstl C et al (2015) Strong coronal channeling and interplanetary evolution of a solar storm up to Earth and Mars. Nat Commun 6:7135.

    Article  Google Scholar 

  45. Muranushi T, Shibayama T, Muranushi YH, Isobe H, Nemoto S, Komazaki K, Shibata K (2015) UFCORIN: a fully automated predictor of solar flares in GOES X-ray flux. Space Weather 13(11):778–796.

    Article  Google Scholar 

  46. Murphy AH (1993) What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather Forecasting 8:281–293

    Article  Google Scholar 

  47. Murray SA, Bingham S, Sharpe M, Jackson DR (2017) Flare forecasting at the Met Office Space Weather Operations Centre. Space Weather 15(4):577–588.

    Article  Google Scholar 

  48. Nishizuka N, Sugiura K, Kubo Y, Den M, Watari S, Ishii M (2017) Solar flare prediction model with three machine-learning algorithms using ultraviolet brightening and vector magnetograms. Astropphys J 835(2):156.

    Article  Google Scholar 

  49. Nishizuka N, Sugiura K, Kubo Y, Den M, Ishii M (2018) Deep Flare Net (DeFN) model for solar flare prediction. Astropphys J 858(2):113.

    Article  Google Scholar 

  50. Nishizuka N, Kubo Y, Sugiura K, Den M, Ishii M (2020) Reliable probability forecast of solar flares: Deep Flare Net-Reliable (DeFN-R). Astropphys J 899:150.

    Article  Google Scholar 

  51. Pagano P, Mackay DH, Yardley SL (2019) A new space weather tool for identifying eruptive active regions. Astrophys J 886:81.

    Article  Google Scholar 

  52. Panos B, Kleint L (2020) Real-time flare prediction based on distinctions between flaring and non-flaring active region spectra. Astrophys J 891:17.

    Article  Google Scholar 

  53. Park E, Moon Y-J, Shin S, Yi K, Lim D, Lee H, Shin G (2018) Application of the deep convolutional neural network to the forecast of solar flare occurrence using full-disk solar magnetograms. Astrophys J 869:91.

    Article  Google Scholar 

  54. Park S-H et al (2020) A comparison of flare forecasting methods. IV. Evaluating consecutive-day forecasting patterns. Astropphys J 890(2):124.

    Article  Google Scholar 

  55. Pesnell WD, Thompson BJ, Chamberlin PC (2012) The solar dynamics observatory (SDO). Solar Phys 275:3–15.

    Article  Google Scholar 

  56. Richardson DS (2000) Skill and relative economic value of the ECMWF ensemble prediction system. Quart J R Meteorol Soc 126:649.

    Article  Google Scholar 

  57. Scherrer PH et al (2012) The helioseismic and magnetic imager (HMI) investigation for the solar dynamics observatory (SDO). Solar Phys 275:207–227.

    Article  Google Scholar 

  58. Schou J et al (2012) Design and ground calibration of the helioseismic and magnetic imager (HMI) instrument on the solar dynamics observatory (SDO). Solar Phys 275:229–259.

    Article  Google Scholar 

  59. Schwenn R, dal Lago A, Huttunen E, Gonzalez WD (2005) The association of coronal mass ejections with their effects near the Earth. Ann Geophys 23:1033–1059.

    Article  Google Scholar 

  60. Shibata K, Magara T (2011) Solar flares: magnetohydrodynamic processes. Living Rev Solar Phys 8:6.

    Article  Google Scholar 

  61. Shih FY, Kowalsky AJ (2003) Automatic extraction of filaments in H\(\alpha\) solar images. Solar Phys 218:99–122.

    Article  Google Scholar 

  62. Steward GA, Lobzin VV, Wilkinson PJ, Cairns IH, Robinson PA (2011) Automatic recognition of complex magnetic regions on the sun in GONG magnetogram images and prediction of flares: techniques for the flare warning program Flarecast. Space Weather 9:S11004.

    Article  Google Scholar 

  63. Tashman LJ (2000) Out-of-sample tests of forecasting accuracy: an analysis and review. Int J Forecasting 16(4):437–450.

    Article  Google Scholar 

  64. Toriumi S, Wang H (2019) Flare-productive active regions. Invited Rev Living Rev Solar Phys 16:3.

    Article  Google Scholar 

  65. Varma S, Simon R (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinform 7:91.

    Article  Google Scholar 

  66. Wang H, Liu C, Ahn K, Xu Y, Jing J, Deng N, Huang N, Liu R, Kusano K, Fleishman GD, Gary DE, Cao W (2017) High-resolution observations of flare precursors in the low solar atmosphere. Nat Astron 1:0085.

    Article  Google Scholar 

  67. Wang J, Zhang Y, Hess W, Shea A, Liu S, Meng X, Wang T (2020a) Solar flare predictive features derived from polarity inversion line masks in active regions using an unsupervised machine learning algorithm. Astrophys J 892:140.

    Article  Google Scholar 

  68. Wang X, Chen Y, Toth G, Manchester WB, Gombosi TI, Hero AO, Jiao Z, Sun H, Jin M, Liu Y (2020b) Predicting solar flares with machine learning: investigating solar cycle dependence. Astrophys J 895:3.

    Article  Google Scholar 

  69. Yi K, Moon Y-J, Shin G, Lim D (2020) Forecast of major solar x-ray flare flux profiles using novel deep learning models. Astrophys J Lett 890:L5.

    Article  Google Scholar 

  70. Zheng Y, Li X, Wang X (2019) Solar flare prediction with the hybrid deep convolutional neural network. Astrophys J 885:73.

    Article  Google Scholar 

Download references


We thank all members of JSOC of Stanford University for their support and allowing us to use the SDO NRT data. The data used here are courtesy of NASA/SDO, the HMI & AIA science teams, JSOC of Stanford University, and the GOES team.


This work was partially supported by JSPS KAKENHI Grant Number JP18H04451 and NEDO. A part of these research results was obtained within “Promotion of observation and analysis of radio wave propagation”, commissioned research of the Ministry of Internal Affairs and Communications, Japan.

Author information




NN, YK, and KS developed the model. NN analyzed the data and wrote the manuscript. MD and MI participated in discussing the results. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Naoto Nishizuka.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nishizuka, N., Kubo, Y., Sugiura, K. et al. Operational solar flare prediction model using Deep Flare Net. Earth Planets Space 73, 64 (2021).

Download citation


  • Solar flares
  • Space weather forecasting
  • Prediction
  • Operational model
  • Deep neural networks
  • Verification