Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition
© The Society of Geomagnetism and Earth, Planetary and Space Sciences (SGEPSS); The Seismological Society of Japan; The Volcanological Society of Japan; The Geodetic Society of Japan; The Japanese Society for Planetary Sciences; TERRAPUB 2010
Received: 16 September 2010
Accepted: 21 November 2010
Published: 3 February 2011
We adopted the GPU (graphics processing unit) to accelerate the large-scale finite-difference simulation of seismic wave propagation. The simulation can benefit from the high-memory bandwidth of GPU because it is a “memory intensive” problem. In a single-GPU case we achieved a performance of about 56 GFlops, which was about 45-fold faster than that achieved by a single core of the host central processing unit (CPU). We confirmed that the optimized use of fast shared memory and registers were essential for performance. In the multi-GPU case with three-dimensional domain decomposition, the non-contiguous memory alignment in the ghost zones was found to impose quite long time in data transfer between GPU and the host node. This problem was solved by using contiguous memory buffers for ghost zones. We achieved a performance of about 2.2 TFlops by using 120 GPUs and 330 GB of total memory: nearly (or more than) 2200 cores of host CPUs would be required to achieve the same performance. The weak scaling was nearly proportional to the number of GPUs. We therefore conclude that GPU computing for large-scale simulation of seismic wave propagation is a promising approach as a faster simulation is possible with reduced computational resources compared to CPUs.
Simulation of seismic wave propagation is essential in modern seismology: the effects of irregular topography, irregular internal discontinuities, and internal heterogeneity on the seismic waveforms must be precisely modeled in order to probe the Earth’s and other planets’ interiors, to study earthquake sources, and to evaluate the strong ground motions due to earthquakes. Developing methods for large-scale, high-performance simulation is important because in real applications more than one billion grid points are required (e.g., Olsen et al., 2008; Furumura, 2009).
We describe here our approach to adopt GPU computing to the large-scale simulation of seismic wave propagation based on the finite-difference method (FDM). First, we discuss the implementation of the core part of the simulation for the single-GPU case. Second, we discuss the multi-GPU case in order to treat large-scale problems.
2. Single-GPU Case
We use the TSUBAME-1.2 grid cluster in the Global Scientific Information and Computing Center, Tokyo Institute of Technology. The processors of the host nodes are eight dual-core AMD Opteron 880 (16 cores per 1 node, 2.4 GHz) and the interconnect is Infiniband (10 Gbps). The GPUs installed in the cluster are NVIDIA Tesla S1070s (1.44 GHz): each S1070 has four GPUs, and each host node is connected to two GPUs in a S1070 via a PCI-Express 1.0x8 (i.e., two GPUs per 1 node). We use NVIDIA CUDA C for the GPU programs. For the FDM program on the host CPU, we use PGI Fortran. Single-precision arithmetic is used in all of the computations on both the GPU and CPU. The theoretical peak single-precision performances are 9.6 GFlops for a single core of the host CPU and 1036 GFlops for a single GPU in the S1070 unit: the GPU is 108-fold faster in terms of computation time than a single core of the host CPU. The memory bandwidths are 5.4 GB/s (gigabyte/s) for a single host CPU and 102 GB/s for a single GPU: GPU is 19-fold faster in bandwidth than the host CPU.
In GPU computing, all data are stored in the global memory (Fig. 1). As described above, the bandwidth of the global memory is much higher than that of CPUs (e.g., 25.6 GB/s in the case of Intel Core i7). However, 400–600 clock cycles of memory latency still occurs in transferring the data between the global memory and the multiprocessors. Thus, we must use the fast (but small) memories in the multiprocessors, i.e., the registers and the shared memory, as software-managed cache memories to reduce the amount of data transfer from the global memory. That is, we explicitly copy the data in a small part (a block) of the FDM domain from the global memory to the shared memory and registers, and we reuse the data stored in the shared memory and registers (e.g., Aoki, 2009). Also, the transfer rate is better for grouped memory transaction using blocks of the proper size than that for serialized, one-by-one memory transaction.
In the examples above, we have fixed the blocksize to 16 × 16 (Fig. 2(b)). When we reduced the blocksize to 2 × 2, the performance markedly decreased to 3.5 GFlops for the FDM domain of 384 × 384 × 384. The reason for this is that, for small blocksize, the access to the global memory increases and the memory transfer rate decreases. Thus, the optimized use of the fast memories is essential for performance.
3. Multi-GPU Case
Also, we overlap the communication and computation by using the asynchronous GPU-host data transfer function and non-blocking MPI functions (e.g., Abdelkhalek et al., 2009; Aoki, 2010; Ogawa et al., 2010): the former function is used for data transfer between GPU and the host concurrent with the computations in GPU, and the latter functions for data transfer between the nodes concurrent with the computations. Since we adopt the 3D domain decomposition, we first compute for the ghost points—not only on the top and bottom but also on the sides: 16 × 16 points (Fig. 2(b)) in the outermost blocks on the sides are processed. Second we start the computation for the remaining internal grid points and the communication procedures simultaneously. For the asynchronous data transfer between GPU and host, we must use the page-locked host memory (Fig. 1) which is not compatible with the MPI library on TSUBAME-1.2. Thus, we further copy the data to a (usual) memory buffer (“memcpy” in Fig. 1): this results in additional time on TSUBAME-1.2.
Part-wise processing time (s) of the multi-GPU program.
The weak scaling,defined for a fixed subdomain size, is observed as the variation in performance in the cases with smallest number of GPUs for each domain size. It was nearly proportional to N, i.e., near ideal (Fig. 7).
In Fig. 7, we also compared the performance of the GPU and CPU programs. The performance of the CPU program scaled with the number of cores up to eight but not beyond that (partly because we used OpenMP and 1D decomposition for the CPU program). Even if a complete scalability beyond eight cores were to be assumed, nearly 2200 cores of CPU would be required to obtain the same performance as that achieved by 120 GPUs.
Based on the above results, we conclude that GPU computing for the large-scale simulation of seismic wave propagation is a promising approach: faster simulation is possible at reduced computational resources compared to CPUs.
The strong scaling,defined as the performance variation for a fixed total FDM domain size, was not proportional to the number of GPUs, N, but appeared to be proportional to N2/3. This (non-ideal) scalability was the result of long communication and computation times for the ghost points because the number of the ghost points was proportional to the surface area of the subdomain, and the surface area was proportional to N-2/3. Indeed, we measured the time for processing the side points (including ghost points), internal points, and the communications separately and determined that the communication time was the longest (Table 1).
As already pointed out by Aoki (2010) and Ogawa et al. (2010), the time for copying the data between the page-locked memory and the MPI buffer (“memcpy” time) was about one-third of the total communication time. As a result, in the above cases, the “memcpy” time was longer than the computation time for internal points. Nevertheless, the overlapping method is important as the technology for eliminating the “memcpy” procedure has recently been released. Peripheral bus and interconnect faster than those adopted by TSUBAME-1.2 cluster are now also available.
We also found that the time for side points was long: the memory mapping procedure between the ghost points in the shared memory and the contiguous memory buffer took a long time (Table 1). As a result, single-GPU performance fell by about 47% (from 55.8 GFlops for the single-GPU program to 29.7 GFlops for the multi-GPU program) due to the processing time for ghost point mapping. Further optimization is necessary for the mapping procedure.
In real applications, not only performance but also accuracy is important. We compared the waveforms computed by multi-GPUs (GPU-waveforms) and multi-CPUs (CPU-waveforms) for a case of 512 × 512 × 256 and 1500 time steps (7.5 s). The root mean square residuals between the CPU- and GPU-waveforms normalized by the RMS amplitude of corresponding CPU-waveforms were (2–9) × 10-6. Thus, we confirmed that both waveforms were almost identical.
The course on GPU computing held at Global Scientific Information and Computing Center, Tokyo Institute of Technology was quite helpful. We are grateful to Tsugunobu Nagai for supporting this research. Comments made by two anonymous reviewers were very helpful in improving the manuscript.
- Abdelkhalek, R., H. Calandra, O. Coulaud, J. Roman, and G. Latu, Fast seismic modeling and reverse time migration on a GPU cluster, International Conference on High Performance Computing & Simulation, 36–43, 2009.Google Scholar
- Aoi, S., N. Nishizawa, and T. Aoki, 3-D wave propagation simulation using GPGPU, Programme and Abstracts, Seismol. Soc. Jpn., 2009 Fall Meeting, abstract A12-09, 2009.Google Scholar
- Aoki, T., Full-GPU CFD applications, IPSJ Mag., 50(2), 107–115, 2009.Google Scholar
- Aoki, T., Multi-GPU Scalabilities for Mesh-based HPC Applications, SIAM Conf. Parallel Processing for Scientific Computing (PP10), Seattle, Washington, February 26, 2010.Google Scholar
- Cerjan, C., D. Kosloff, R. Kosloff, and M. Reshef, A nonreflecting boundary conditions for discrete acoustic and elastic wave equations, Geophysics, 50, 705–708, 1985.View ArticleGoogle Scholar
- Clayton, R. and B. Engquist, Absorbing boundary conditions for acoustic and elastic wave equations, Bull. Seismol. Soc. Am., 67, 1529–1540, 1977.Google Scholar
- Furumura, T., Large-scale simulation of seismic wave propagation in 3D heterogeneous structure using the finite-difference method, J. Seismol. Soc. Jpn. (Zisin), 61, S83–S92, 2009.Google Scholar
- Graves, R. W., Simulating seismic wave propagation in 3D elastic media using staggered-grid finite differences, Bull. Seismol. Soc. Am., 86, 1091–1106, 1996.Google Scholar
- Komatitsch, D., D. Michéa, and G. Erlebacher, Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA, J. Parallel Distrib. Comput., 69, 451–460, 2009.View ArticleGoogle Scholar
- Komatitsch, D., G. Erlebacher, D. Göddek, and D. Michéa, High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster, J. Comp. Phys., 229, 7692–7714, 2010.View ArticleGoogle Scholar
- Michea, D. and D. Komatitsch, Accelerating a three-dimensional finite-difference wave propagation code using GPU graphics cards, Geophys. J. Int., doi:10.1111/j.1365-246X.2010.04616.x, 2010.
- Micikevicius, P., 3D finite-difference computation on GPUs using CUDA, in GPGPU-2: Proc. 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84, Washington DC, USA, 2009.Google Scholar
- Ogawa, S., T. Aoki, and A. Yamanaka, Multi-GPU scalability of phase-field simulation for phase transition—5 TFlop/s performance on 40 GPUs, Trans. IPSS Japan, Advanced Computing Systems, 3,67–75, 2010.Google Scholar
- Okamoto, T., H. Takenaka, and T. Nakamura, Computation of Seismic Wave Propagation With GPGPU, Programme and Abstracts, Seismol. Soc. Jpn,, 2009 Fall Meeting, abstract P3–22, 2009.Google Scholar
- Okamoto, T., H. Takenaka, and T. Nakamura, Simulation of seismic wave propagation by GPU, Proceedings ofSymposium on AdvancedComput-ing Systems and Infrastructures, 141–142, 2010.Google Scholar
- Olsen, K. B., S. M. Day, J. B. Minster, Y. Cui, A. Chourasia, D. Okaya, P. Maechling, and T. Jordan, TeraShake2: Spontaneous rupture simulations of Mw 7.7 Earthquakes on the Southern San Andreas Fault, Bull Seismol. Soc. Am., 98, 1162–1185, 2008.View ArticleGoogle Scholar
- Takenaka, H., T. Nakamura, T. Okamoto, and Y. Kaneda, A unified approach implementing land and ocean-bottom topographies in the staggered-grid finite-difference method for seismic wave modeling, Proc. 9th SEGJInt. Symp., CD-ROM Paper No.37, 2009.Google Scholar