Evaluation of candidate geomagnetic field models for IGRF-12

The 12th revision of the International Geomagnetic Reference Field (IGRF) was issued in December 2014 by the International Association of Geomagnetism and Aeronomy (IAGA) Division V Working Group V-MOD (http://www.ngdc.noaa.gov/IAGA/vmod/igrf.html). This revision comprises new spherical harmonic main field models for epochs 2010.0 (DGRF-2010) and 2015.0 (IGRF-2015) and predictive linear secular variation for the interval 2015.0-2020.0 (SV-2010-2015). The models were derived from weighted averages of candidate models submitted by ten international teams. Teams were led by the British Geological Survey (UK), DTU Space (Denmark), ISTerre (France), IZMIRAN (Russia), NOAA/NGDC (USA), GFZ Potsdam (Germany), NASA/GSFC (USA), IPGP (France), LPG Nantes (France), and ETH Zurich (Switzerland). Each candidate model was carefully evaluated and compared to all other models and a mean model using well-defined statistical criteria in the spectral domain and maps in the physical space. These analyses were made to pinpoint both troublesome coefficients and the geographical regions where the candidate models most significantly differ. Some models showed clear deviation from other candidate models. However, a majority of the task force members appointed by IAGA thought that the differences were not sufficient to exclude models that were well documented and based on different techniques. The task force thus voted for and applied an iterative robust estimation scheme in space. In this paper, we report on the evaluations of the candidate models and provide details of the algorithm that was used to derive the IGRF-12 product.


Introduction
The International Association of Geomagnetism and Aeronomy (IAGA) released the 12th generation of the International Geomagnetic Reference Field (IGRF) in December 2014 . The IGRF is a series of standard mathematical models describing the largescale internal part of the Earth's magnetic field between epochs 1900 A.D. and the present (see for instance Finlay 2011 for a review andFinlay et al. 2010a for the preceding generation). It is used by scientists *Correspondence: erwan.thebault@univ-nantes.fr 1 University of Nantes, Laboratoire de Planétologie et Géodynamique de Nantes, UMR 6112-CNRS, Nantes, France Full list of author information is available at the end of the article in a wide variety of studies including long-term dynamics of the Earth's core field, space weather (e.g., Bilitza and Reinisch 2008), local magnetic anomalies imprinted in the Earth's crust, or land surveying. It is also used by commercial organizations and individuals as a source of orientation information for drilling or navigation (Meyers and Minor Davis 1990) and has become of increasing interest for space science during the last decade. A task force appointed by IAGA approved the specifications of IGRF-12 and issued a call in May 2014. The call requested candidate models for the main field (MF) for the Definitive Geomagnetic Reference Field for epoch 2010 (DGRF-2010), for a provisional IGRF model for epoch (IGRF-2015 both to spherical harmonic (SH) degree 13, and for a prediction of its annual rate of change, the secular variation (SV), over the upcoming 5 years (SV-2015(SV- -2020 to SH degree 8. The term "definitive" is used because any further substantial improvement of these retrospectively determined models is unlikely. In contrast, the provisional IGRF model will eventually be replaced by a definitive model in a future revision of the IGRF when the community will have more complete knowledge concerning the Earth's magnetic field for epoch 2015.0. Ten teams answered the call and submitted candidate models. They were led by the British Geological Survey (UK), DTU Space (Denmark), ISTerre (France), IZMIRAN (Russia), NOAA/NGDC (USA), GFZ Potsdam (Germany), NASA/GSFC (USA), IPGP (France), LPG Nantes (France), and ETH Zurich (Switzerland). Seven candidate MF models for the DGRF epoch 2010.0, ten candidate MF models for the IGRF epoch 2015.0, and nine SV models for the predictive part covering epochs 2015.0-2020.0 were submitted for evaluation in October 2014 (see Table 1). The number of institutions participating in IGRF-12 was larger than for any previous generation. This reflects cooperation between scientists involved in modeling the magnetic field, the institutions archiving and disseminating the ground magnetic field data, and the national and the various space agencies who distribute well-documented magnetic satellite data from the satellite missions. Modellers made extensive use of the data from the international network of ground geomagnetic observatories either directly, or indirectly, in the form of magnetic indices monitoring the level of magnetic activities. The candidate models were also primarily built using data from the German satellite CHAMP (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010), the Danish satellite Ørsted (1999-), the Argentine-U.S.-Danish satellite SAC-C (2001SAC-C ( -2013, and, especially for the epoch close to 2015.0, the European Swarm constellation (launched in November 2013; https://earth.esa.int/web/ guest/missions/esa-operational-eo-missions/swarm). The teams adopted a variety of data selection, processing, and modeling procedures. Details concerning the techniques used to derive the individual candidate models can be found in the papers appearing in this special issue. Some teams derived their candidate models from parent models describing the magnetic field over periods longer than requested by IGRF Gillet et al. 2015;Hamilton et al. 2015;Sabaka et al. 2015). Such models involve internal temporal parameterization using splines and external field parameterization of varying complexity and differing reference frames. Such models co-estimate the various source fields through a grand inversion and perform a mathematical separation of the field into its internal and external parts. Other teams focused their effort on dedicated internal candidate models for each of the epochs requested by the call, thus using data within windows centered on epochs of interest Lesur et al. 2015;Saturnino et al. 2015;Vigneron et al. 2015). This necessarily involved less complex parameterization in space and in time and sometimes more drastic data selection and pre-processing to minimize the effects of unwanted magnetic field contributions arising from external fields. In general, all candidate models rely on some geographical and/or iterative statistical weighting schemes to down-weight measurements in particular geographic regions or those poorly fit by the model. A special difficulty arises for the predictive SV part of the magnetic field for epochs 2015-2020. Forecasting the temporal evolution of the main geomagnetic field is no trivial matter. The geodynamo is a deterministic system with chaotic dynamics involving complex interaction of fluid flow and magnetic field within Earth's core. A consistent prediction would therefore require a correct mathematical description of these physical interactions which, despite important progress during the last decade, remains a frontier scientific subject. Some teams nevertheless considered a physics-based approach applying the tools of geophysical assimilation  or setting a priori hypothesis on the core flow (e.g., Gillet et al. 2015;Hamilton et al. 2015). Other teams (e.g., Alken et al. 2015;Finlay et al. 2015;Lesur et al. 2015;Saturnino et al. 2015) relied on simple analytical extrapolation assuming that the magnetic field will evolve linearly over the next 5 years. From October to early December 2014, some members of the task force and interested parties carried out evaluations of the candidate models submitted by the different teams. In the call for IGRF-12, the internal field ("main field") was requested to degree and order 13 for epochs 2010.0 and 2015.0. Some teams interpreted this as requesting all global scale fields whose sources were internal, while other teams interpreted it as referring only to contributions from the Earth's "core" field. The notion, of course, could be clarified by IAGA but this ambiguity helps to ensures that some candidate models are independent and aids in making the standard model valid on average for a wide range of disparate scientific applications. Such variety amongst the candidate models, however, often complicated the work of the evaluators. Assessment of candidate models was primarily based on statistical criteria. Some MF and SV models showed greater consistency than others. However, good statistical agreement between models does not necessarily mean that these models are the most realistic. It can also be a consequence of them using very similar data selection or modeling techniques. For this reason, the evaluators also relied on the companion descriptions of the candidate model submitted for evaluation and finally often on their empirical experience.
The first section of this paper summarizes the statistical criteria used by the task force members and the evaluators for the testing and the inter-comparison of the candidate models to IGRF-12. The results of this analysis that served as the basis for internal discussion are then detailed. The task force chair prepared a ballot paper containing a selection of weighting options that was voted on by the task force in December 2014. The last section gives some details of adopted weighting scheme. The resulting IGRF-12 coefficients were prepared and checked before being made available to the public through the IAGA Div V, WG V-MOD webpage http://www.ngdc.noaa.gov/IAGA/ vmod/igrf.html before 1 January 2015.

Mathematical definitions and criteria used in evaluations
We present the formulae employed by the task force members and the evaluators using the notation of Finlay et al. (2010b). The IGRF is a series of mathematical models of the internal geomagnetic field B(r, θ, φ, t) and its annual rate of change (secular variation -SV). On and above the Earth's surface, assuming μ = μ 0 and no local electric current, the magnetic field B is defined in terms of a magnetic scalar potential V by B = −∇V and where in spherical polar coordinates V is approximated by the finite series with r denoting the radial distance from the center of the Earth, a = 6371.2 km being the Earth's mean reference spherical radius conventionally used in geomagnetic modelling, θ denoting geocentric co-latitude, and φ denoting east longitude. The functions P m n (cos θ) are the Schmidt quasi-normalized associated Legendre functions of degree n and order m. The Gauss coefficients g m n , h m n are functions of time t and are conventionally given in units of nano-Tesla (nT).

• Difference between models
Evaluations often involve differences between a candidate model i whose coefficients are denoted by i g m n and i h m n and another model whose coefficients are labelled j and denoted by j g m n and j h m n . The differences between the coefficients of two such models denoted by i,j g m n = i g m n − j g m n and i,j h m are sometimes analyzed coefficient by coefficient. Due to linearity, they may also be used to compute the difference between the models' magnetic field components using the gradient of Eq. 1 in the spatial domain on a reference sphere, such as the Earth's mean radius r = a or the core mantle boundary at r = 3485 km.
• Weighted mean model Comparisons between candidate models are also often made with respect to their differences from the mean model estimated from the K candidate models. The simple arithmetic mean model is defined by the coefficients This special case corresponding to all candidate models having identical unit weight can be generalized. Each model i can be allocated a certain weight i w m n that varies with the degree n and order m. The Gauss coefficients of such a weighted mean model are then In previous generations of IGRF, the evaluation process led to some estimates of the individual weights i w m n in order to calculate weighted mean (Finlay et al. 2010b, for instance). We will discuss below some of the arithmetic weighted means suggested by some evaluators in the case of IGRF-12. In the following, we also compute the median model from the K candidate models that is expected to be less influenced by a small number of models showing strong discrepancies.
• Spherical harmonic power spectrum and total root mean square vector field The mean square vector field averaged over the spherical surface of a candidate model expanded in spherical harmonics (Eq. 1) per spherical harmonic degree i,0 R n is defined by (see for example , Lowes 1966;1974) i,0 R n = (n + 1) This so-called Lowes-Mauersberger geomagnetic power spectrum is also computed for the difference between two models i and j and denoted i,j R n such that where i,j g m n and i,j h m n are defined by Eq. 2. The power spectrum of Eq. 5 can be calculated on any spherical surface of radius r where the magnetic field is source free. Comparisons are then often carried out at the Earth's mean spherical surface r = a, which corresponds to the surface where the IGRF standard model is often employed by users. Summing the power spectrum i,j R n (Eq. 6) over degrees n from 1 to the truncation degree N of the model provides the mean square vector field at the altitude r. Then, taking the square root yields the root mean square (RMS) vector difference field between the models i and j In addition to calculating the RMS difference between two models i and j, we can define and compute the mean value of the RMS difference of the ith model to the (K − 1) other candidates such that In the following, we also estimate the RMS error due to the rounding of the candidate models. For a given precision p, each Gauss coefficient has a standard deviation due to the rounding error equal to p/ √ 12 (see also Lowes 2000). The RMS R p of the truncation error then equals (n + 1)(2n + 1).
• Azimuthal power spectrum When analyzing the difference between two individual candidate models, it is sometimes informative to compute the azimuthal rather than the Lowes-Mauersberger geomagnetic power spectrum of their difference. For this, we re-organize the square of the coefficients as a function of the azimuthal ratio az = m/n, which varies from 0 for purely zonal terms to 1 for sectoral terms (see Sabaka et al. 2004 for a description of the procedure) and we define az positive for the i,j g m n and negative for the i,j h m n model coefficients. This azimuthal power spectrum is denoted i,j Ra.

• Sensitivity matrix
One can also analyze the difference between two models (Eq. 2) normalized by the Lowes-Mauersberger geomagnetic power spectrum. This so-called sensitivity matrix S(n, m) expressed in percent for each degree n and order m is written • Spherical harmonic correlation Finally, two models may have systematic differences in amplitude causing large RMS but still be linearly correlated. The correlation per degree between two models i,j ρ n is the quantity (Langel and Hinze 1998, p. 81 which is, following the standard definition of the Pearson's correlation coefficient, the covariance between two models divided by the product of their standard deviation. It gives 1 for a full positive correlation, 0 for no correlation, and −1 for a full anticorrelation degree per degree.

Analysis of DGRF-2010 candidate models
The call for the DGRF-2010 model requested models describing the large-scale internal field up to SH degree 13 with Gauss coefficients quoted to 0.01 nT precision. Table 2 lists the seven candidate models received for evaluation for epoch 2010.0. The teams are identified with capital letters and major information about the data sources used and very brief comments concerning the various modeling approaches adopted are included. The teams derived their candidate from parent models (see Table 1 for the list of references) and evaluated these parent models at epoch 2010.0.
In Table 3, we present the RMS vector field differences i,j R (Eq. 7 in nT) between the individual candidate models i and j at the Earth's reference radius r = a. The two last columns show the RMS difference between each candidate model i and the simple arithmetic mean model M (Eq. 3) and to the median model M med . The requested precision of p = 0.01 nT leads to an estimate of the error due to truncation (Eq. 9) equal to Rp = 0.13 nT. All models show RMS differences well above this value showing the effect of the varying modeling strategies adopted by the different teams. Models A, B, E, and F are in closer agreement with the arithmetic mean and median models than models C, D, and G. The RMS difference between each model of this first group is also smaller than the RMS difference between models of the second. The final row of Table 3 gives the arithmetic means of the RMS vector field differences of i,j R of model i from the other models j confirm that A, B, E, and F have the smallest i R . Model C is the most discrepant model with a mean RMS difference to all models almost 1.8 larger than the model A to all models. These general features are confirmed by Fig. 1-left which presents the Lowes-Mauersberger spectra i,M R n (defined in Eq. 6) of the difference between DGRF-2010 candidate models and the simple arithmetic mean model M. The three models C, D, and G are noticeably different from the mean. Differences between the models are largest for the dipole term. The maximum deviation is found for the dipole term for model C, and large differences are visible for n = 5 for model D and for n = 3 for model G. The azimuthal power spectra i,j Ra of the difference between all candidate models and the mean model shown in Fig. 1-right provide additional information. All models differ for the axial dipole term (az = 0), and their spectra show two wings for az ±1. The increasing difference between all models for the sectoral coefficients illustrates a problem inherent to the available satellite magnetic field measurements. Because of rapid external field variations, the vector satellite data have correlated errors along the near polar circular satellite orbits that do not allow accurate recovery of the magnetic field components perpendicular to the flight direction. In addition, models using mostly scalar measurements in some geographical regions, for instance over the Earth's polar caps, may also be subject to the "Backus effect" (Backus 1970) that causes an ambiguity also on the sectoral Gauss coefficients. We are dealing here with features that are well known to modellers and evaluators and for which an objective evaluation is challenging. However, noticeable differences occur for models C, D, and G at positive azimuthal numbers (corresponding to i g m n coefficients) whose origin is not easy to determine.
Further information is provided by the analysis of the coefficient by coefficient difference i,j g m n as defined in Eq. 2 between each candidate model and the simple arithmetic mean model (a comparison with the median showed similar general features). The comparison is made here using a logarithmic scale to better highlight the differences  at low SH degrees. Figure 2-left illustrates that differences for all models are indeed comparatively larger for the zonal terms (m = 0) and more particularly for the axial dipole coefficient g 0 1 . This well-known feature is related to the difficulty in accurately separating the large-scale main internal and external magnetospheric fields given the available data (e.g., Olsen et al. 2010, Thébault et al. 2012. Interestingly, although models A, B, E, and F have small RMS differences to the mean, they differ more in their dipole coefficients than at other degrees. Model C has the strongest deviation for its g 0 1 term compared to other candidate models but otherwise compares reasonably well with the mean model from SH degrees larger than 2. This informs us that the comparatively large RMS displayed in Table 3 for this model is mostly due to a difference on the axial dipole coefficient. Candidate models D and G further show clear deviations at larger SH degrees such as for SH degree n = 5 and order m = 0 for model D and n = 3 and order m = 0 for model G. The spherical degree correlation i,M ρ n as defined in Eq. 11 between the DGRF-2010 candidate models and the arithmetic mean model M is shown in Fig. 2-right. All models correlate well up to SH degree 10. Again, the degree correlation of candidates C, D, and G to M above SH degree 10 is slightly lower than that of A, B, E, and F which appear close to M. The most noticeable differences occur for candidate C at degree 13 but with a correlation better than 0.99 illustrating the high level of correlation between all candidate models despite these statistical differences. A final analysis is provided by the differences between the candidate models and the simple arithmetic mean model M in space. In Fig. 3, we show the radial component of these differences calculated at the Earth's reference mean radius r = a. Visual inspection reveals some additional disparities between the candidate models. The residual fields for candidate models A, B, E, and F show  Left plot shows differences i,j g m n as defined in Eq. 2 between DGRF-2010 candidate models and their arithmetic mean model M as a function of the index of the spherical harmonic coefficient (running from g 0 1 , g 1 1 , h 1 1 , g 0 2 , h 1 1 , etc. indexed 1,2,3,4,5, etc). The vertical line locates the zonal coefficient (m = 0) for each SH degree n. Right plot shows the SH degree correlation i,M ρ n (Eq. 11) of DGRF-2010 candidate models with the arithmetic mean model M clear but weak dipolar signatures. The residuals for model A, B, E, and F are mostly positive in the northern hemisphere and negative in the southern hemisphere. The sign in the polar regions is different between the group of candidate models comprising C, D, and G and the group of candidate models comprising A, B, E, and F. The candidate model C involves the most striking deviations from M, locally as large as 20 nT. This large difference implies that the simple arithmetic mean model is probably biased towards model C for the dipole terms. This also illustrates one of the limitations of testing the candidate models against the simple arithmetic mean model M. Apart from these general large-scale dipolar structures, the residual maps show that the differences are not confined to any particular geographical location except for models D and G. For model D, we see a positive signature following the magnetic equator. We recall that model D is the only model that does not rely on data selection in time while other candidate models select measurements according to the local time to mitigate the leakage of the diurnal ionospheric magnetic field. Therefore, candidate model D may be contaminated by small scales with structures characteristic of the equatorial electrojet (EEJ) ionospheric field at noon local time. Model G has a large-scale structure that arises from the difference at SH degree 3 and order 0. Candidate model G is the only model among the candidates that seeks to estimate and then to explicitly remove the ionospheric night-time primary and induced contribution (Sabaka et al. 2015) that mostly contribute to zonal terms of SH degree 1 and 3. Nearly two-thirds of the variance between models G and M can be explained by the removal or inclusion of the night-time ionospheric induced field, respectively. Models D and G therefore rely on two different interpretations of how to best construct the IGRF for its disparate community of users.

Analysis of IGRF candidate models for epoch 2015
We carried out a similar analysis for the candidate models submitted to IGRF for epoch 2015.0 (Table 4). The call requested models to 0.1 nT precision, and nine candidate models were submitted. Further details about the candidate models are provided in the papers submitted to this special issue. The derivation of the main magnetic field candidate models for epoch 2015.0 was more challenging than for epoch 2010.0. Magnetic measurements to derive IGRF magnetic field models were not available close to epoch 2015.0 so that all candidate models rely on extrapolation schemes. Moreover, the launch of the European Swarm satellite mission in November 2013 was followed by a phase of data calibration and validation. The satellite magnetic field measurements were promptly made available by the European Space Agency to the IGRF members with useful documentation, but this left a relatively short amount of time for the individual teams to become familiar with this new dataset. Six out of the nine candidate models are dedicated models relying on data covering less than a year (D, E, F, H, I, and J) and three models (A, B, and C) relied on a continuous representation of the magnetic field over about a decade using measurements from observatories and previous satellite missions. The RMS differences between the candidates (   Table 2 for details) and the mean model M plotted at Earth's reference radius in Mollweide projection. Note that the color bar is saturated  less than 10 nT but models D and J show significantly larger deviations. Compared with the RMS to the mean model, the RMS to the median model are reduced for candidates A, B, E, F, and H suggesting a better agreement between these five models than between C, D, I, and J. Figure 4-left indeed confirms that models D and J show noticeable differences over all SH degrees and are statistically significantly different from the mean model. The model I compares better to the mean model for the SH degree 1-3 but the differences then increase with the degree and finally differs from the M model as much as models D and J at larger SH degrees. Model C shows a significant deviation for SH degrees 1-3 but then a better agreement to the mean model at larger SH degrees. A, B, E, F, and H models have comparable mean square difference to the mean model at all SH degrees except for H which deviates significantly for SH degrees 4-6. Models D, I, and J exhibit a saw-tooth behavior that could indicate the presence of aliasing or at least illustrate difficulties with noise being mapped into some model coefficients at  all SH degrees. The azimuthal power spectrum ( Fig. 4right) shows general differences that are larger in the zonal and the sectoral harmonics for all candidate models with some exceptions for models D and J. The differences for candidate models A, B, and F are clearly lower than for other candidate models at low SH orders. Candidate model A has, however, comparatively larger differences for the sectoral than for the zonal terms. On the contrary, the model H is in better agreement with the mean model at larger SH orders but agrees less well at low SH orders. The absolute coefficient by coefficient differences between the candidates and the mean model ( Fig. 5-left) shows that candidate model C is the most different for the axial dipole term. The largest difference in magnitude is for the coefficient h 1 1 of the candidate model J. Model H is close to the mean model for the zonal terms for SH degrees 2 and 3 but then increases for zonal terms for SH degrees 4-6, which explains the rise in the power spectrum of its difference to the mean model for these degrees (Fig. 4-left). All candidate models have differences decreasing in magnitude with the SH degree with the exception of model D which shows largest differences at all SH degrees and low orders. Figure 5-right indeed illustrates that model D has the smallest correlation to the Fig. 5 Left plot shows differences i,j g m n as defined in Eq. 2 between IGRF-2015 candidate models and their arithmetic mean model M as a function of the index of the spherical harmonic coefficient (running from g 0 1 , g 1 1 , h 1 1 , g 0 2 , h 1 1 etc indexed 1,2,3,4,5, etc). The vertical line locates the zonal coefficient (m = 0) for each SH degree n. Right plot shows the SH degree correlation i,M ρ n (Eq. 11) of IGRF-2015 candidate models with the arithmetic mean model M mean model at large degrees n with a most significant difference from SH degree 11, thus corresponding to small spatial scales. Models B, E, and F show the best correlation through all SH degrees.
The Fig. 6 compares the sensitivity matrix S(n, m) as defined by Eq. 10 of the difference between candidate model B and the mean model M (left) and also between the candidate model J and the mean model M (right). It shows that the mismatch between the coefficients of the candidate model B and the M is about 18 %. For model J, the mismatch is twice as large and about 40 % on average. All other candidate models show differences for each coefficients between these two extreme cases.
In the residual maps shown in Fig. 7 at the Earth's reference surface, we see that models B and F have the smallest deviation to the mean model with no specific spatial structures. Models E and I show large-scale residuals as anticipated by the Fig. 5-left indicating that most differences to the mean model are for coefficients of low SH degree. All other residual maps have different characteristics. Model A possesses low latitude anomalies linked to its anomalous sectoral harmonics (Fig. 4-right). For models C, E, and H, large-scale residual structures are noticeable with model C showing again a strong global opposite sign compared to other models. Models D, H, I, and J show also strong differences in the polar areas and to a lesser extent along the dip equator. Since these models were built upon Swarm measurements over less than a year, some of the polar regions may not yet have been sufficiently surveyed to separate the internal and external fields (especially for those models relying on night time data selections). The residual structures for candidate model D are prominent at low latitudes and a contamination of external ionospheric field is likely. Model J is strikingly different from all other candidate models. The residual map is peculiar and not easily explained from the description of the model parameterization or the data selection.

Analysis of IGRF-12 SV-2010-2015 candidate models
The final evaluation was for candidates submitted for the IGRF-12 average predictive SV for the interval 2015-2020. In this section, the Gauss coefficients g m n and h m n refer to the predicted annual average rate of change in the coefficients between epochs 2015.0 and 2020.0 and are now given in units of nT/yr. The call requested candidate models to 0.1 nT/yr precision and up to SH degree 8. Nine teams submitted SV-2015-2020 candidate models. Seven of these teams also submitted candidate models to DGRF-2010 or to IGRF-2015 (A, B, C, D, E, F, and I). Candidate models G and H were submitted by the same lead institutions (respectively NASA and IPGP) but prepared by independent teams in collaboration with different partners. The colors used for these two candidate models in the following line plots are thus different from the ones used in previous sections in order to avoid confusion. Important details about the team models submitted for SV are collected for reference in Table 6. Five of these models rely on mathematical extrapolation of parent models (B, D, E, F, and I). Two of them (models G and H) rely on geodynamo simulation based on a model of core dynamics and assimilation of magnetic field models in SH in order to obtain a SV forecast for the upcoming 5 years. The last two candidate models rely on statistical analyses of the main magnetic field variation over the past centuries and on  Table 4 for details) and the mean model M plotted at Earth's reference radius in Mollweide projection. Note that the color bar is saturated  core flow hypotheses to forecast the field (candidate models A and C). This is the first time that the call for IGRF candidate models received so many contributions relying on physical assumptions concerning the flow in the Earth's core.
The simple mathematical extrapolations are often poor predictors as was illustrated by Finlay et al. (2010b, their figure thirteen). For this reason, models B, D, E, F, and I propose candidate models to the SV centered on an epoch close to the available data that does not exceed 2015.0. Forecasting the field to more distant epochs is numerically easier with physically based models although they also have a limited horizon of predictability (for instance Lhuillier et al. 2011). The teams who favored the physically based approach therefore submitted candidates averaged over the upcoming 5 years and centered them on epoch 2017.5. These two families of candidate models, hereafter referred to as the mathematical and physical models therefore follow distinct philosophies. The mathematical models aim to better predict the main field changes for the next 1-2 years simply assuming that its rate of change will be statistically identical to the one observed in recent epochs. On the contrary, physical models hope to predict the field better on average over the full 5-year interval.  Table 7 shows that there is a much wider scatter of differences between the candidate models for SV-2015.0-2020 than between those for the main field at epochs 2010.0 and 2015.0. This reflects the difference between the two strategies adopted by the teams and the intrinsic difficulty in forecasting the change of the magnetic field. The mean RMS of the difference between the candidate and the simple arithmetic mean models is between 4.2 nT/yr for model B and 11.8 nT/yr for model G at the Earth's reference radius r = a. Mathematical models B, E, and F show less scatter about the mean and median models but there is no systematically better agreement with the other mathematical models. Models A and H, for instance, have lower RMS to the mean than models D and I, both belonging to the family of mathematical models (note that model D is not extrapolated but estimated in 2014.7; i.e, in September 2014). Therefore, mathematical and physically based models do not show complete internal consistency within their group, although the physical models in general show larger RMS between themselves than the mathematical models.
The power spectrum of the difference to the simple arithmetic mean model M (Fig. 8-left) shows that the largest difference for models (C, D, and G) are for the SH degrees 1-3. In particular, the model G is the most notably different model. Models B, E, and F have similar differences over all SH degrees. The difference to the mean model for other models do not have clear characteristics except maybe for model A, showing a large difference for SH degree 5, and for model I, comparing well to the mean model up to SH degree 4 but then showing stronger differences. Figure 8-right shows that the coefficients h m n of models C and G are particularly different from the mean model coefficients. The candidate model A shows differences that distribute over all SH degrees and orders. For the other models, we recognize the now familiar central and side increased differences in the azimuthal spectrum corresponding to difficulties in recovering the zonal and the sectoral terms, because of internal/external field separation and noise correlated along the satellite orbits. Table 7 suggests that models D and G show the largest differences. This is confirmed by Fig. 9-left showing the absolute value of the difference to the mean model coefficient by coefficient. The largest absolute differences of more than 5 and 3.5 nT/yr are seen on the coefficients h 1 1 and h 1 2 of model G. Note that model C also has strong anomalous values for these coefficients. Model D is dissimilar for the zonal term as anticipated by Fig. 8-right, in particular on coefficient g 0 2 . In spite of these differences, we learn from Fig. 9-right that candidate model A is the least correlated on average to the mean model with in particular a significant loss of correlation to the mean model M from SH degree 4. There appears to be no obvious way to choose between the candidates and to ascertain which are valid as the conclusions differ depending on the selected criterion. Therefore, it is instructive to look at the difference maps (Fig. 10). They confirm that few models are similar to each other. The best agreement is for models B, E, and F. This is to be expected because these three candidates rely on similar mathematical modeling procedures. Some mathematical models relying on Swarm magnetic field measurements only (D and I) show increasing differences in the polar regions possibly coming from the fact that the time window covered by the available data is too small to constrain accurately the SV up to SH degree 8. Models C and G exhibit two strong patches of opposite sign whose origin is unclear. For physically Fig. 8 Lowes-Mauersberger spectra i,M R n from Eq. 5 of the difference between the SV-2015-2020 candidate models and the simple arithmetic mean model M at the Earth's mean radius r = a. The right plot shows the azimuthal power spectrum i,M Ra of this difference Fig. 9 Left plot shows differences i,j g m n as defined in Eq. 2 between SV-2015-2020 candidate models and their arithmetic mean model M as a function of the index of the spherical harmonic coefficient (running from g 0 1 , g 1 1 , h 1 1 , g 0 2 , h 1 1 etc indexed 1,2,3,4,5, etc). The vertical line locates the zonal coefficient (m = 0) for each SH degree n. Right plot shows the SH degree correlation i,M ρ n (Eq. 11) of SV-2015-2020 candidate models with the arithmetic mean model M based models (A, C, G, and H), we expect some differences at small spatial scales as each of them is given as an average over the 5-year interval. This construction tends to generate candidate models with smoother small scales compared to mathematical models providing candidates as snapshots at a given epoch. Interestingly, the residual map for the SV candidate model A when compared to the mean model is similar in shape to the residual map that was observed between the IGRF-2015 candidate model A and the mean model (Fig. 7). This suggests that most of the observed difference to the arithmetic mean model for epoch 2015.0 was caused by the SV model used to forecast the main field from earlier epoch to epoch 2015.0. Similar correlations between the differences of IGRF-2015 and SV-2015-2020 candidate models to the arithmetic mean models are found for candidate models C, E, and I. As was the case for DGRF-2010 and IGRF-2015, there are clearly features in some models which are not in consensus with others, though it is impossible to ascertain which are valid.

Weighting scheme applied to derive the IGRF-12 models
The analyses described above were used by the task force members in order to inform the construction of the IGRF-12 models out of the K candidate models i x = i g m n , i h m n submitted by the different teams for DGRF-2010, IGRF-2015, and SV-2015-2020. The mathematical problem is the following. We wish to average over the i x so as to obtain the most accurate estimate of x 0 , the "true" field. Assuming that the i x contains only random errors and given no other information about their relative random errors, the standard least squares (LS) is the best linear estimator. It assumes that each of the K models are samples from the same probability density function. Minimizing the functional where x 0 is the estimate of the true value x 0 , is the arithmetic mean in which each i x is given equal weight. The variance σ 2 , the mean-square error of the i x about x 0 , can then be estimated. However, when some of the i x's have different, possibly non-normal, errors other methods need to be considered. For the previous generation of IGRF Finlay et al. 2010a, fixed weights were assigned to each candidate model based on information collected from evaluation criteria similar to those above, as described in Finlay et al. (2010b). The problem then became to define weights i ω to be included when minimizing In the past, the weights have been allocated via a procedure whereby one first identified those models showing the smallest scatter about the arithmetic mean of the candidate models. Models showing the largest variance about the mean were either rejected or down-weighted, and the arithmetic mean model was recalculated from the remaining weighted models until some kind of convergence. The procedure is similar to an iteratively reweighted least squares (IRLS) approach in that the weights depend on the residuals, the residuals on the estimated arithmetic mean coefficients, and finally the estimated coefficients upon  Table 6 for details) and the mean model M plotted at Earth's reference radius in Mollweide projection. Note that the color bar is saturated the weights. In practice, this manual procedure permitted the identification of a group of models having much smaller variance than the others and a simplification was to give this group unit weight and most of the others zero weight. Following a similar approach this time, some task force members again proposed fixed weights for the coefficients and discussed more specifically those provided in Table 8 for DGRF-2010, for IGRF-2015, for the predictive SV-2015-2020. We reproduce the proposed weights here for the purpose of discussion and for comparisons.
For some of the candidate models submitted for the IGRF-12, systematic deviation to the weighted arithmetic mean could be understood in the light of the model descriptions as coming from the scientific choices made during their construction. As a result, a majority of the task force thought that the internal discrepancies between different groups of models were not sufficient to reject any of the models.
The problem of estimating an unbiased weighted mean from K candidate models was posed during the evaluation procedure. As a result, an approach relying on the IRLS using robust weights was devised in an attempt to accommodate contributions from all candidate models in the final IGRF-12 models. One general difficulty that cannot be overcome in optimization is to choose a priori a realistic probability density function for the error distribution and an estimation of its variance. From the analyses above, the IGRF-12 task force concluded that the candidate models at epoch 2010.0, 2015.0 and their predictive parts for 2015.0-2020.0 have in general common structures in the spectral or/and physical domains so that most, but not all, of them may reasonably be considered as samples of the true Earth's magnetic main field. In this light, it was suggested that for simplicity, one should assume a common global error distribution that would follow the normal law in its central region but that a small number of rather different models cause the distribution to be longer tailed. Following a vote by the IGRF-12 task force, and after some debate, the weights entering the IRLS calculation were allocated by a hypothesis on the error distribution that is Table 8 Fixed weights proposed (but not finally allocated) during evaluation procedure to each candidate model for DGRF-2010, IGRF-2015, and SV-2015-2020 based on RMS analyses (see Tables 3, 5, and 7). The symbol "-" indicates that no candidate model was available where is the departure from the best estimation of the mean model. The constant c has to be chosen as a compromise between a Laplace distribution (obtained when c = 0) and a Gaussian distribution (obtained when c −→ ∞). Following a suggestion described by Finlay et al. (2010b), the applicability of the IRLS using Huber weights on the candidate model coefficients, thus treating the set of K values for each coefficient g m n (or h m n ) independently, was considered. In this case, the problem involves minimizing for each degree n and order m This provided more significant weights to the coefficients of each candidate models in agreement with residuals observed between the mean and the candidate models in both the spectral and physical domains. However, as already mentioned by Finlay et al. (2010b), this form of IRLS treats each spherical harmonic coefficient g m n (or h m n ) as independent and thus neglects possible correlation between Gauss coefficients of a single candidate model. It was argued that an application of the Huber weighting in space would be more appropriate since the IGRF is mainly used for mapping purposes.
The field B i,p was calculated for each candidate model i x at each points p of a uniform grid over a sphere of radius r = a. These synthetic values i B p were then treated as were the x i in Eq. 13 and the weights i ω k , p were estimated numerically by IRLS, where k = 1, 2, 3 is the index corresponding to each of the three components of the magnetic field. The problem was to minimize the cost function with i k,p = i B k,p − B k,p whose solution in algebraic form is defined at iteration it + 1 by where A is the SH design matrix (with denoting its transpose) computed on the geographic coordinates of the nodes of the regular grid and B the vector of magnetic field components on all nodes computed from the K models. We therefore had K values of the magnetic field on each of the nodes of the regular grid. The number of uniformly distributed nodes was set to P = 5000, a number that guarantees a nearly perfect numerical orthogonality of the SH up to degree and order 13, which is the maximum current truncation of the series of SH (Eq. 1) of the IGRF-12 models. At each iteration it, the matrix of weights for position p on the reference sphere, component k of the magnetic field of the model k was updated following Huber with the normalized absolute value of the error i k,p = i k,p /σ it and σ it the standard deviation estimated in a robust way at each iteration it using its approximate relationship with the mean absolute deviation (MAD) The tuning factor c was set equal to 1.345. The Huber weights were computed in space for all K values of each of the vector components estimated using all candidate models on the P grid points uniformly distributed on the sphere. This analysis was carried out for the candidate models for DGRF-2010, for IGRF-2015, and for SV-2015-2020. Figures 11, 12 and 13 show the weights that were allocated by the algorithm to each of the models on the radial component of the magnetic field computed at the Earth's mean radius. For epoch 2010.0, it shows how the robust weighting scheme down-weights parts of the models C, D, and G. Some of the strongly down-weighted features correlate well with the spatial differences between the candidate models and the arithmetic mean model (Fig. 4). Similar conclusions are reached for IGRF-2015. Models C, D, and J are the most strongly down-weighted in space and for other models, Huber weights automatically give lower weights in polar areas. Models B and E receive weight almost equal to 1 everywhere and are the most similar to the mean Huber weighted model. For SV-2015-2020, most of the discrepancies in space identified in Fig. 10 are down-weighted and spatial features common to all models receive equal weight. The Huber weights for the horizontal magnetic field components are equally consistent.

Discussion and conclusion
In the previous sections, we have described some of the statistical tests carried out by the IGRF-12 task force in order to evaluate candidate models for DGRF-2010, IGRF-2015, and SV-2015-2020. Evaluation results clearly illustrate that some models agree better amongst themselves than others. Investigating whether a specific model is flawed, however, is no trivial matter, and selfconsistency between some models was not always thought sufficient to exclude candidate models that are well documented and based on solid but different scientific choices. Most of the differences between models result from different choices of data selection, the removal of (or correction for) the disturbing fields of other sources, choice of analytical method and weighting, and physical hypotheses on the nature of the sources. We faced the situation where in general there was little uncertainty about the parameterization of the candidate models. ESA's Swarm satellite mission promises further insights concerning the leakage and contamination from different source fields; in particular regarding models describing the Earth's internal main field. Models of the various magnetic field sources will be derived throughout the Swarm mission's lifetime by the Swarm Satellite Constellation Application and Research Facility (SCARF) using both comprehensive and dedicated sequential approaches . The different source fields derived from the comprehensive description of the Earth's magnetic field  will then be compared to dedicated models for the main , magnetospheric (Hamilton 2013), ionospheric , and lithospheric ) fields using a common global dataset. The results of these intercomparisons will help the community to better identify those structures in the spectral and physical domains that are the most robust .
For the previous generation of the IGRF model (Finlay et al. 2010b), the weighting approach for the DGRF and IGRF models involved giving most weight to the group of models that show smallest scatter about an appropriate mean, such as the simple arithmetic mean, and allocating zero weight to the others. For this generation of IGRF, applying the same philosophy would have led to the rejection of more than half of the candidate models, including all of the physically based candidate models to the predictive SV-2015-2020 (see the possible set of weights summarized Table 8 that were discussed by the task force). However, we know from past experience that the secular variation is not constant in time and can change rapidly on a time-scale of perhaps only 1 or 2 years as a result of rapid (e.g., Olsen et al. 2006;Lesur et al. 2008) or strong acceleration (e.g., Chulliat et al. 2010) thus making extrapolation short-sighted and a poor predictor at the end of the 5-year interval. Rejecting all magnetic field models incorporating recent advances in modeling capability, including predictive SV or for the internal induction parts, could lead to biased solutions estimated from candidate models merely relying on similar approaches. The IGRF-12 task force voted in favor (but not unanimously) to allocate Huber weights in space and to compute the Gauss coefficients using a robust iterative weighted least squares algorithm. This allowed the inclusion of all candidate models in the calculation of IGRF-12 but to down-weight the most dissimilar aspects of certain models in space. This was thought to be a compromise solution that would both encourage modeling improvements and keep IGRF activities at the forefront of magnetic field modeling. The robust weighting scheme in space is based on a well-defined distribution of misfit and associated statistical measures in order to obtain  Table 2 for details) the final weights. For epochs 2010.0, 2015.0, and 2015-2020, we see that models that were statistically different from the simple arithmetic mean still receive full weight in many geographical regions, mostly at mid-latitudes. Conversely, none of the candidate models that compared well to the arithmetic mean model are allocated full weight for all three components. The regions where all models are in good agreement are highlighted by the Huber weighing scheme in space, thus providing an interesting indicator of where each of the candidate models agreed. These regions show where the IGRF-12 constituent models are best constrained in space by all candidate models.
The Huber weighting in space is however not a technique without drawbacks. First, all models must be treated as a whole and the procedure cannot be applied when certain coefficients of a candidate model are wrong or  Table 4 for details) Fig. 13 Huber weight in space assigned to the radial component B r of each candidate model to the SV-2015-2020 (labeled with capital letter, see Table 6 for details) cannot be easily explained from a carefully evaluated scientific compromise. In such clearly identified situations, the manually defined fixed weighting scheme on the coefficients might be more appropriate and defensible. Secondly, this is a purely statistical approach that allows little control on the weights assigned numerically to the candidate models. It is therefore important to test the output against better controlled techniques. To do this, we verified that the models computed using the fixed weights (as was discussed among the task force, see Table 8) and the Huber weighted estimates were not far apart using all of the above defined criteria. This analysis can be summarized by the estimates of the RMS of their difference, which were found to be respectively