SIMPIC - Simple 1D PIC prototype¶

SIMPIC is extracted from [OOPD1] source code to demonstrate [StarPU] possibilities.

The code is a simple one-dimensional electrostatic code (meaning \(\nabla\times\mathbf{B}=\partial\mathbf{B}/\partial t \approx 0\) so that \(\mathbf{E}=-\nabla\phi\)), with self and applied electric fields E directed along coordinate x. There are no variations in y or z directions. The plane parallel problem consists of single (electron) species.

Description of SIMPIC¶

Particle in cell (PIC) is a particle-mesh method. This means that the phase space distribution of particles \(f(v)\), is represented by collocation of discrete computational particles, with each computational particle representing np2c real particles. Instead of computing the interactions between each particle, the particle density (and higher moments for some PIC methods) is interpolated (weighed) to the mesh. A field equation (Poisson’s equation) is solved using this information and the computed force is interpolated back to the particle which is used to update its velocity and position. This alternating field solve and particle push is repeated for each time step. For electrostatic PIC, Poisson’s equation is solved for potential and then numerically differentiated to determine the electric field.

SIMPIC workflow scheme functions.¶

Poisson solver¶

Poisson’s equation for electrostatics is applicable when the magnetic field is not present or not time varying (magnetostatic). In this case, the electric field may be written as the gradient of a scalar potential field, \(\phi\):

\[\mathbf{E}=-\nabla\phi\]

Gauss’s law then becomes:

\[\nabla\cdot(\epsilon\mathbf{E})=-\nabla\cdot(\epsilon\nabla\phi)=\rho\]

In this equation \(\rho\) is charge density, obtained from particle date (for a single species of particles, \(\rho=qn\), where \(q\) is the charge per particle and \(n\) is the number density of the species).

For meshes with a uniform value of \(\epsilon\), the Laplacian operator may be used:

Cartesian:

\[\nabla^2\phi=\frac{\mathrm{d^2}\phi}{\mathrm{d}x^2}\]

Cylindrical:

\[\nabla^2\phi=\frac{1}{r}\frac{\mathrm{d}}{\mathrm{d}r}\left (r\frac{\mathrm{d}\phi}{\mathrm{d}r}\right ) = \frac{\mathrm{d^2}\phi}{\mathrm{d}r^2} + \frac{1}{r}\frac{\mathrm{d\phi}}{\mathrm{d}r}\]

Spherical:

\[\nabla^2\phi=\frac{1}{r^2}\frac{\mathrm{d}}{\mathrm{d}r}\left (r^2\frac{\mathrm{d}\phi}{\mathrm{d}r}\right ) = \frac{1}{r}\frac{\mathrm{d^2}}{\mathrm{d}r^2}\left (r\phi \right ) = \frac{\mathrm{d^2}\phi}{\mathrm{d}r^2} + \frac{2}{r}\frac{\mathrm{d\phi}}{\mathrm{d}r}\]

In finite difference form for a uniform mesh:

Cartesian:

\[\nabla^2\phi_i=\frac{\phi_{i-1}-2\phi_{i}+\phi_{i+1}}{\Delta x^2}+\mathcal{O}(\Delta x^2)\]

Cylindrical:

\[\nabla^2\phi_i=\frac{\phi_{i-1}-2\phi_{i}+\phi_{i+1}}{\Delta r^2} + \frac{1}{r_i}\frac{\phi_{i+1}-\phi_{i-1}}{2\Delta r}+\mathcal{O}(\Delta r^2)\]

Spherical:

\[\nabla^2\phi_i=\frac{\phi_{i-1}-2\phi_{i}+\phi_{i+1}}{\Delta r^2} + \frac{2}{r_i}\frac{\phi_{i+1}-\phi_{i-1}}{2\Delta r}+\mathcal{O}(\Delta r^2)\]

For a mesh with variable \(\epsilon\), the left hand side of Poisson’s equation is more complex:

\[\nabla\cdot(\epsilon\nabla\phi)=-\rho\]

In one (radial) dimension, the gradient operator is of the same form regardless of coordinate system (Cartesian, cylindrical, spherical):

\[-(E_x)=\nabla\phi\Rightarrow\frac{\mathrm{d}\phi}{\mathrm{d}x}=\frac{\phi_{i+1}-\phi_{i-1}}{2\Delta x}+\mathcal{O}(\Delta x^2)\]

For one-dimensional planar meshes with uniform permivity \(\epsilon_0\), as studied here, the solution to Poisson’s equation yields a tridiagonal matrix where the potential at node \(i\) may be solved for in terms of nodes \(i+1\) and \(i-1\):

\[\phi_{i-1}-2\phi_{i}+\phi_{i+1}=-\frac{\rho_i(\Delta x)^2}{\epsilon_0}\]

\(\rho_i\) is the charge density at node \(i\) and is obtained from particle data. \(\Delta x\) is the mesh spacing. Boundary conditions are needed to close the system at the mesh points. In this study, ideal conductors were used (Dirichlet boundary conditions). oopd1 deals with all three 1D coordinate systems as described above with constant and variable permitivity as well several different types of boundary conditions.

Particle Push¶

The coordinates, \(x\), and velocities, \(v\), of the particles are initially prescribed. If only the force due to the electric field is considered, the non-relativistic equations of motion for particle \(j\) are:

\[\frac{\mathrm{d}x_j}{\mathrm{d}t}=v_j\]

\[\frac{\mathrm{d}v_j}{\mathrm{d}t}=\frac{qE(x_j)}{m}\]

\(x_j\) and \(v_j\) are the \(x\)-coordinate and component of velocity, respectively. \(q\) and \(m\) are the charge and mass of particle \(j\) (respectively), and \(E(x_j)\) is the \(x\)-component of electric field at particle \(j\).

Linear interpolation is used to find the electric field at particle \(j\) from the nodal values:

\[E(x_j)=E_{i+1}\cdot\frac{x_j-x_i}{x_{i+1}-x_i}+E_i\cdot\left (1-\frac{x_j-x_i}{x_{i+1}-x_i} \right )\]

The particle is assumed to be located between node \(i\) and \(i+1\) in this equation.

A discrete scheme used in PIC to advance the equations of motion is as follows:

\[v_{new}=v_{old}+\frac{qE}{m}\cdot\Delta t\]

\[x_{new}=x_{old}+v_{new}\cdot\Delta t\]

\(\Delta t\) is the time step used. If the position and velocity components are known at the same time, this is (forward) Euler’s method (1st order accurate). If the velocity is staggered half a time step with respect to position, this is the “leap-frog” (midpoint rule) method (2nd order accurate). The latter is acoomplished by initially advancing the velocity of all particles half a time step backwards using the initial field.

Once the new position of particle \(j\) is known, its charge is weighed to the mesh using the inverse of the formula used to gather the electric field to the particle. The resulting charge density is then used in the Poisson solve.

Accuracy Requirements for PIC¶

The basic discretization parameters for (explicit) PIC are mesh spacing, time step, and the number of particles per cell. For electrostatic PIC, the mesh spacing, \(\Delta x\), and time step, \(\Delta t\), must be determined from Debye length, \(\lambda_D\), and the plasma frequency, \(\omega_P\), respectively, for accuracy and stability:

\[\Delta x < \lambda_D=\left ( \frac{\epsilon_0 T}{en} \right )^{1/2}\]

\[(\Delta t)^{-1}\gg\omega_P=\left ( \frac{e^2n}{\epsilon_0 m} \right )^{1/2}\]

In these equations, \(T\) is the plasma temperature in volts, \(n\) is the plasma density, \(\epsilon_0\) is the permitivity of free space, \(e\) is the elementary charge, and \(m\) is the mass of the lightest species.

Violation of these conditions will result in inaccurate and possibly unstable solutions. Typically, \(\Delta x\approx 0.5\lambda_D\) and \(\Delta t\approx 0.1\omega_P^{-1}\).

Building¶

MPI C++ compiler is required to build and run SIMPIC.

git clone git@bitbucket.org:kosl/elmfireng.git
cd elmfireng
git fetch && git checkout feature/simpic
cd simpic
make
make html
xdg-open build/html/index.html

Running with MPI¶

./runsimpic.sh 2> run.log
mpirun -np 4 a.out -ppc 100 -ncpp 64 -nt 200 -dtfactor 0.001 -lhsv 25000
rank 0 of 4 processors
MPI version: 3.1, MPI_Wtime precision: 1e-09
t=1.12e-09, dt=5.6e-12, ntimesteps=200, density=1e+13, wp=1.79e+08,
np2c=3.91+08 6400 particles, ng=65, nc=64, L=1, Ll=0.25, dx=0.00390625,
area=1

Timings:
Particles time: 0.111255 [5.0%]
Fields time: 0.000620347 [0.0%]
MPI time: 0.154564 [6.9%]
Diagnostics time: 1.93947 [86.9%]
Initialization time: 0.00460355 [0.2%]

Total time: 2.23218 (2.21051 [99.0%])
569640 (1.14291e+07) particles/second

Diagnostics¶

Change to diagnosticsflag = true in simpic.cpp to get diagnostic .dat files of:

density

E electric field

phi potential

vxx (phase space)

nt is time evolution of number of particles per processor

File format for each processor is:

time relative_position value

Defining DEBUG symbol in global.h produces additional prints to stderr stream.

Note

Timings are not representative when defining DEBUG and diagnosticsflag = true.

Variables¶

NT=200  # of time steps
PPC=100 # of particles per cell
CPP=64  # of cell per process
DTFACTOR=0.001 #defines a fraction of dt vs. time steps from plasma frequency;
               #must be positive
NPROC=4 # of processors
LHSV=25000 #applied voltage on left-hand side; RHS is grounded;

density = 1.E13: density of electrons (\(n\))
epsilon = 8.85E-12: permitivity of free space (\(\epsilon_0\))
nc: number of cells per processor (CPP=64)
ng: number of gridpoints is always one more than number of cells. ng=nc+1
npart: number of (super)particles per processor = number of cells * number of particles per cell
L: total length of the system
q, m: elementary charge and electron mass
xl, xr: x position of left and right bounaries for each processor
Ll: length of the system for each processor
dx: length of each cell \(dx=Ll/nc\), where nc is number of cells per processor
nl: rank of the processor on the left. Process 0 has last process on the left in a cyclic manner.
nr: rank of the processor on the right. Last process has process 0 on the right in cyclic manner.
nlp: number of lhsbuf values to be communicated to the left processor.
nrp: number of rhsbuf values to be communicated to the right processor.
np2c: number of (electron) particles per cell = density * volume of each processor covers/number of particles per processor
wp: electron plasma frequency \(\omega_p^2 = \frac{n q^2}{\epsilon_0 m}\)
dtfactor=0.001: fraction of plasma frequency time \(\tau = 1/\omega_p\) to be used as
dt: time step \(\delta t=dtfactor\cdot\tau\)
pdata[2*nproc*npart]: phase-space data array of particle x positions and v velocities alternated. Index j is x position and j+1 is velocity.
lhsbuf[2*nproc*npart], rhsbuf[2*nproc*npart]: buffers of particle positions and velocities on the left and right.
narray[ng]: density array
phiarray[ng]: potential array
Earray[ng]: Electric field array
nback[ng]: not used. intert. backup.
scale: generic scaling of differences equation that usually multiplies with \(1/\Delta x\) or similar weighting.

Main loop¶

moveparticles(dt): Computes from the electric field E new velocity for each particle and moves it to new position. If the particles moves out of processor bounds [xl, xr] then it is moved to left or right hand side buffer for communication to neighboring processors.
communicateparticles(): Each processor sends particles to the left and right. Before communition of arrays starts, sizes are exchanged.
advancefields(dt): Tridiagonal matrix of Poisson equation \(\phi_{j+1}-2\phi_j+\phi_{j-1}=p_j\) is solved with Gaussian elimination [BiLa85] by each processor and then fields are communicated to all processors. Then Laplace equation is summed to get the potential \(phi\)-array. From \(\phi\) negative gradient gives E-array.

CPU StarPU¶

Converting MPI version of SimPIC to task based CPU version follows StarPU guidelines for MPI support.

Integrating StarPU in a Make Build System¶

We begin with adding StarPU MPI header to simpic.cpp

#include <starpu_mpi.h>

and modifying Makefile for compiler and linking flags

CFLAGS += $$(pkg-config --cflags starpumpi-1.3)
LDLIBS += $$(pkg-config --libs starpumpi-1.3)

debug:
     mpic++ -g $(CFLAGS) simpic.cpp message.cpp $(LDLIBS)

Not using the StarPU MPI support¶

For a start we can concentrate on the StarPU task breakdown, leaving original MPI communication intact. StarPU data management with Distributed Shared Memory (DSM) needs to be provided in a consistent manner through the MSI protocol (Modified / Shared / Invalid).

Examining moveparticles(Scalar dt) as our first codelet the data that is send over MPI in communicateparticles() is basically received into complete pdata phase-space vector from buffers lhsbuf and rhsbuf that are practically sending smaller number of particles to the lhs and rhs. Number of particles that left computational domain [xl, xr] of each processor is counted by nlp and nrp. Particles move according to E field defined at the grid points. Each processor is using ng grid points in Earray vector.

Parameters that do not change in communication within the codelet are left nl and right nr neighboring processor numbers, domain boundary [xl, xr] pair, time step size dt, ng grid points, and factors such as charge-mass qm and cell-size dx that are pre-computed for faster computation in subroutines. All these trivial parameters that are not handled by StarPU data management can be passed to the codelet in one structure as a pointer to the structure containing all codelet arguments.

So, instead of global variables

int nl, nr;

we encapsulate them into

struct cl_args_t {
  int nl, nr;
};

and create an instance of the struct

cl_args_t cl_args;

and replace nl with cl_args.nl and nr with cl_args.nr respectively everywhere in the code in order to compile. We gradually add more parameters in step by step manner. For example we move dt into

struct cl_args_t {
  Scalar dt;
  int nl, nr;
};

In the end all the necessary parameters are put into the struct

struct cl_args_t {
  Scalar qm;
  Scalar dt;
  Scalar dx;
  Scalar xl;
  Scalar xr;
  int nl, nr;
  int nlp, nrp;
};

Besides replacing, i.e. nl to cl_args.nl, we must also add the struct as parameter, since it will not be global anymore.

Now we focus on function void moveparticles. The initial signature takes Scalar dt as argument, but now we will change it to struct cl_args_t cl_args

- void moveparticles(Scalar dt)
+ void moveparticles(struct cl_args_t *cl_args)
  {
+   struct cl_args_t *cl_args = (struct cl_args_t *) _cl_args;

The calling of this function inside mainloop(Scalar tmax, Scalar dt) is changed

- moveparticles(dt)
+ moveparticles(&cl_args)

Now, since the pointer is passed to the function, we need to change the access to the variables inside the struct

- cl_args.qm
+ cl_args->qm

Note

The refactoring of cl_args should be complete in a way that it can be removed from global scope. For a start, when using CPU only codelet, all cl_args will be read-only and will not cause locking problems and transfers from main memory. When GPU codelet is introduced then cl_args` needs to removed from global scope and pass its parameters to subroutines from the codelet.

We can now start creating StarPU task for the function movepartparticles. First we need a struct starpu_codelet that contains information on what function to be called and if there are any buffers to send.

static struct starpu_codelet cl =
  {
    .cpu_funcs = { moveparticles },
    .cpu_funcs_name = { "moveparticles" },
    .nbuffers = 3, // Number of buffers (arrays) the function needs
    .modes = {STARPU_RW, STARPU_RW, STARPU_RW} // Read/Write modes for buffers
  }

Now we can create a task

In global.h we define a global task

struct starpu_task *task;

In simpic.cpp:

task = starpu_tash_create();
task->cl = &cl;
task->handles[0] = pdata_handle;
task->handles[1] = lhsbuf_handle;
task->handles[2] = rhsbuf_handle;
task->cl_arg = &cl_args; // Struct containing constants
task->cl_arg_size = sizeof(cl_args_t);
task->synchronous = 1; // Blocking task
task->destroy = 0; // Do not destroy the task after submitting because it is a repeatable one.

Before we submit the task we have to fix the signature of moveparticles

- void moveparticles(struct cl_args_t *cl_args)
+
+ void moveparticles(void *buffers[], void * _cl_args)
  {
   int i, j;
   Scalar E, x, v;
   struct cl_args_t *cl_args = (struct cl_args_t *) _cl_args;

+  Scalar *pdata = (Scalar *) STARPU_VECTOR_GET_PTR(buffers[0]);
+  Scalar *lhsbuf = (Scalar *) STARPU_VECTOR_GET_PTR(buffers[1]);
+  Scalar *rhsbuf = (Scalar *) STARPU_VECTOR_GET_PTR(buffers[2]);

The argument signature void *buffers[], void * _cl_args is default for all StarPU CPU task functions. To get a vector buffer you call STARPU_VECTOR_GET_PTR on the input argument buffers[n], where n is is the chosen before hand when defining a StarPU task.

Now we can submit the task inside the function mainloop

- moveparticles(&cl_args);
+ starpu_task_submit(task);

GPU CUDA¶

The CUDA version of SimPIC is basically the MPI version with some CPU functions refactored to GPU CUDA functions. The code was tested only for -np=1, i.e., only for MPI execution with one process.

Compiling options and Make Build System¶

The CUDA code must reside in files with the .cu extension, otherwise the code is compiled as normal C/C++ code by the nvcc compiler. For example, for some CUDA code in simpic.cu the SimPIC code can be build with the following Makefile:

CFLAGS ?= -lmpi -lcusparse

all:
     nvcc simpic.cu message.cpp $(CFLAGS)

debug:
     nvcc simpic.cu message.cpp $(CFLAGS)

clean:
     rm -f a.out *.o *.dat

GPU particle mover + field solver¶

In this documentation we are going to explain how we moved the SIMPIC Particle Mover and Field Solver to GPU. This will be a detailed description but we will avoid physics, discretization models and some CPU functions that were already in the CPU code to avoid an extremely large document.

First of all, we will see how is the structure of the code. The following table contains all the files that are contained in the SIMPIC GPU version folder.

Files in the GPU version of particle mover and field solver.¶
File	Includes	Description
`simpic.cu`	CUDA, standard and mpi libraries	Here is located Particle Mover in GPU.
`fields.cu`	CUDA libraries, cuSPARSE libraries	Contains Field Solver algorithms. in GPU.
`global.h`	None	SIMPIC global definitions.
`diagnostics.cpp`	None	file in the mainloop of the program.
`timing.cpp`	None	MPI tools for timing
`ostring.hpp`	Standard libraries
`dat_to_gnuplot.py`	None	Python script to convert outputs.
`Makefile`	None	Compiling instructions
`runsimpic.sh`	None	Run in command line: `./runsimpic.sh`

To accelerate the code we had to move simpic.cpp and fields.cpp to GPU. That was done, at the beginning, changing the extension to a CUDA file .cu. Also, we can see that files trimatrix.cpp, trimatrix.hpp and functions.hpp were removed. Those files were interfering in the GPU parallelization so the features that we needed from them were included in the CUDA files.

In summary, the main files that we have to focus on are simpic.cu, fields.cu and global.h. The rest of the files have minor changes or were unaltered. In figure 1 we can see the execution order for all the main files and we are highlighting where are those functions located and how are they executed (in CPU or GPU).

SIMPIC execution order in the main files. Shows GPU and CPU functions.¶

In this section we are going to explain the main changes in the files global.h, fields.cu and simpic.cu and how CPU calculations were moved to GPU. The main idea of parallelizing in GPU is that each GPU thread will perform an operation. In our case, for particle mover, each thread is going to compute operations for one particle and, for field solver, each thread will calculate values for one grid point. In this process, it is important to track the number of threads and blocks to be used as well as all the data copies from CPU to GPU and vice versa. Also, we have to initialise and take into account many handles, status and pointers to GPU features.

In the following subsections we will look at the definitions and aspects to perform a correct GPU parallelization of the original CPU code. For a complete understanding, we will start with the general definitions, after so we will see how the field solver GPU parallelization was made and finally the main program and particle mover GPU paralellization.

Initial changes (`global.h`)¶

Some initial definitions are required for the development of this code. This definitions are implemented in global.h. First of all, we need to define the data types that we are going to use. CUDA does not understand Scalar data types, so we move the main arrays to doubles:

//Particle mover arrays
double *pdata;
double *pdata_gpu;
bool *check_alive;
int npart;

// field arrays
double *Earray;
double *phiarray;
double *narray;
double *nback;
double *Earray_gpu;
double *narray_gpu;
double *d_a,*d_b,*d_c,*d_r;
int *nc_gpu;

//GPU values
int Nthreads_max;
int Nblocks_particles;
int Nblocks_fields;

cudaDeviceProp prop;

It is important to note that we have defined twice pdata (particle array with position and velocity for each particle), Earray (array of electric field values) and narray (density array). This is because we need a different data allocation in CPU and GPU. Initially, we should need a copy of phiarray in GPU but our version of the field solver does not need so (we will see why). Also we define the a CUDA structure cudaDeviceProp which will give us the information about our GPU device.

Also, we need a very important function called atomicAdd. This function allows every GPU thread to read a value at some address of global or shared memory, add it a value and write the result back to that address. So many threads can add a value to a certain pointer. This functions works for float data types in every GPU structure but it only computes double operations for CUDA architectures greater or equal than 600. So we have to implement an atomicAdd function for doubles that operates in every CUDA architecture. We do this with:

__device__ double atomicAdd_double(double* address, double val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;

    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);

    return __longlong_as_double(old);
}

With all this previous definitions we can move on the main structure of the code.

Field solver (`fields.cu`)¶

The field solver is aimed to calculate the electric field over the simulated space. The main difference with the particle mover is that we need a discretization of the spatial domain in a finite number of points, called grid points, where the electrostatic equations will be solved.

We will write the GPU version in the file fields.cu which has the extension .cu from CUDA. The original field solver script was written in fields.cpp.

First of all, we import the needed modules in the script:

#include <iostream>
#include <cuda.h>
#include<cuda_runtime.h>
#include<cusparse_v2.h>

We are including the basic CUDA libraries. cuda.h defines the public host functions and types for the CUDA driver API, cuda_runtime.h defines the public host functions and types for the CUDA runtime API as well as built-in type definitions and function overlays for the CUDA language extensions and device intrinsic functions, click here for more info, and cusparse_v2.h is a CUDA library which contains algebraic routines for sparse matrices which will be useful at some point of the script, you can find more information in cuSPARSE.

Now we define a status and a handle for the cuSPARSE library:

cusparseStatus_t status;
cusparseHandle_t handle=0;

cusparseStatus_t is a data type that represents the status returned by the cuSPARSE library and cusparseHandle_t is a pointer to an opaque cuSPARSE context and will be initialized later before calling any function from this library. More about this data types can be found in the cuSPARSE documentation.

After so, we have to initialise or allocate the arrays:

/*Array allocation in CPU and GPU*/
void allocate_arrays(int dealloc=0)
{
  if(dealloc)
    {
      /*Delete CPU arrays*/
      delete [] narray;
      delete [] phiarray;
      delete [] Earray;

      /*Free GPU memory*/
      cudaFree(Earray_gpu);
      cudaFree(narray_gpu);
      cudaFree(nc_gpu);
      cudaFree(d_a);
      cudaFree(d_b);
      cudaFree(d_c);
      return;
    }

  /*Allocate GPU matrix arrays*/
  cudaMalloc((void**)&Earray_gpu, ng*sizeof(double));
  cudaMalloc((void**)&narray_gpu, ng*sizeof(double));
  cudaMalloc((void**)&nc_gpu, sizeof(int));
  cudaMalloc((void **)&d_a, ng*sizeof(double));
  cudaMalloc((void **)&d_b, ng*sizeof(double));
  cudaMalloc((void **)&d_c, ng*sizeof(double));

  /*Create CPU arrays*/
  narray = new double[ng];
  phiarray = new double[ng];
  Earray = new double[ng];
}

This function creates the CPU arrays and allocates memory in GPU for the arrays or values with the already defined data types. The function cudaMalloc does this job, which takes to arguments, the first one is a pointer to the allocated memory and the second one is the requested allocation size in bytes. We also have provided this function with a deallocation feature using cudaFree to liberate data from GPU at the end of the run.

We need to define the matrix coefficients for the given discrete scheme. This will be done in the first CUDA kernel:

/*Sets the matrix coefficients in each thread*/
__global__ void setcoeffs(Scalar scale, double *a,double *b,double *c, int *nc_gpu)
{
  /*the id of the thread*/
  int tid=blockIdx.x*blockDim.x+threadIdx.x;

  /*Each thread writes three values depending on its tid.
   a is lower diagonal, b diagonal, c upper diagonal*/
  if(tid == 0 )
  {
      a[tid] = 0.0;
      b[tid] = 1.0;
      c[tid] = 0.0;
    }
  else if (tid == *nc_gpu)
  {
      a[tid] = 0.0;
      b[tid] = 1.0;
      c[tid] = 0.0;
  }
  else if( (tid>0)&&(tid<*nc_gpu))
  {
      a[tid] = scale;
      b[tid] = -2.*scale;
      c[tid] = scale;
  }
}

This simple function will help us to explain how a CUDA kernel works. First of all, we see __global__ preceding the definition of the function. This prefix defines the function as a CUDA kernel which we will call from the host (CPU) side using special semantics (we will see how). Kernel functions are only executable in the device (GPU). We can also use the prefix __device__ to declare a device function which can only be called by other device or global functions but never from host.

GPUs have many processors and each one has many threads. In CUDA we say that threads are distributed in blocks. Many threads in many blocks will execute this kernel, so this function calculates the unique id of each thread tid and will fill the arrays values (associated with the tridiagonal matrix) according to this id. So, for example, if we have only 1 block and 1024 threads for 1000 grid points, the thread 567 will modify the a, b and c values at the position 567 while the thread 1001 will do nothing.

With all this functions, we will call the initialization of the fields from simpic.cu this is done by the following function

/*Initialization of all variables and arrays*/
void init_fields()
{
  /*Array allocation*/
  allocate_arrays();

  /*Setting initial values*/
  cudaMemcpy(nc_gpu, &nc, sizeof(int), cudaMemcpyHostToDevice);
  cudaMemset(Earray_gpu, 0, ng*sizeof(double));

  /*Definition of GPU threads and blocks for the correct
   behaviour of the field solver. Nthreads_fields may vary
   with hardware*/
  Nblocks_fields = ng/Nthreads_max +1;
  /*Set matrix coefficients*/
  setcoeffs<<<Nblocks_fields,Nthreads_max>>>(-epsilon/(q*dx*dx), d_a, d_b, d_c, nc_gpu);

  /*Create a cuSPARSE handle. Needed to call sparse functions*/
  status=cusparseCreate(&handle);
}

In this function host function we allocate the arrays in CPU and GPU, then we copy the value of the number of cells to GPU using cudaMemcpy and set the electric field values to 0 with cudaMemset. After so we define the number of blocks which will execute the global functions in the device. The number of threads is limited to a value (usually 1024) and depends on the hardware, this value will be initialised in simpic.cu.

After initialising and declaring values, we have the first summon of a function into GPU. We invoke the setcoeffs with special semantics, we declare the blocks and threads from GPU which will execute the function with <<<blocks, threads>>> and after so we pass the function arguments between parenthesis as usual.

Finally, we create the status with cusparseCreate() which initializes the cuSPARSE library and creates a handle on the cuSPARSE context.

We also move the sum of the Laplace solution and the gradient calculation to GPU with the global functions:

/*Sums the Laplace equation for rhs and lhs voltages*/
__global__ void sumLaplace_GPU(double *pphh, Scalar dx, Scalar rv, Scalar lv, Scalar xl, Scalar L, int ng)
{
  int tid=blockIdx.x*blockDim.x+threadIdx.x;
  Scalar frac, xlocal;

  if( (tid >= 0) && ( tid < ng)){
    xlocal = xl + tid*dx;
    frac = xlocal/L;
    pphh[tid] += (frac*rv + (1. - frac)*lv) ;
}

/*Computes the gradient in GPU*/
__global__ void gradient_GPU(double *grad, double *arr, int n, Scalar scale)
{
  int tid=blockIdx.x*blockDim.x+threadIdx.x;

  /*Intermediate points calculation*/
  if( (tid>0) && (tid < n))
  {
      grad[tid] = scale*(arr[tid+1] - arr[tid-1]);
  }

  /*Fix up end points*/
  if(tid==0)
  {
    grad[0] = 2*scale*(arr[1] - arr[0]);
  }
  if(tid==n)
  {
    grad[n] = 2*scale*(arr[n] - arr[n-1]);
  }
}

At the end, we have the main function which will calculate the evolution of the potential and field at a given time. This is a host function that will call the already defined global functions to be executed in the device.

/*Updates phi and E values in dt*/
void advancefields(Scalar ddt)
{
  starttime(FIELDS);

  /*Correct density array for computation*/
  starttime(MEMORY_GPU);
  cudaMemset(&(narray_gpu[0]), 0, sizeof(double));
  cudaMemset(&(narray_gpu[nc]), 0, sizeof(double));
  endtime(MEMORY_GPU);

  starttime(PERFORMANCE);
  /*Trimatrix solver
   *Function from cuSPARSE for trimatrix solution*
   *Overwrites n_array with phi values*/
  status=cusparseDgtsv(handle,nc,1,d_a,d_b,d_c,narray_gpu,nc);
  if (status != CUSPARSE_STATUS_SUCCESS)
  {
     std::cout << status << std::endl;
  }
  endtime(PERFORMANCE);

  /*Laplace correction and gradient calculation with values
   * that are already in GPU*/
  sumLaplace_GPU<<<Nblocks_fields,Nthreads_max>>>(narray_gpu, dx, rhsV(t), lhsV(t), xl, L, ng);
  gradient_GPU<<<Nblocks_fields,Nthreads_max>>>( Earray_gpu,  narray_gpu,  nc,  -0.5/dx);

  /*Copy data back to CPU*/
  if(diag_flag){
    starttime(MEMORY_GPU);
    cudaMemcpy(phiarray, narray_gpu, ng*sizeof(double),cudaMemcpyDeviceToHost);
    cudaMemcpy(Earray, Earray_gpu,ng*sizeof(double),cudaMemcpyDeviceToHost);
    endtime(MEMORY_GPU);
  }

  endtime(FIELDS);
}

At this point, we understand almost everything happening in this function. This is a host function that will set values to GPU, invoke global functions in the GPU and copy back data from it. The only new tool here is cusparseDgtsv() which is a cuSPARSE function for solving double tridiagonal matrices in GPU. With this function, we do not need to search and implement an algorithm for this calculation in GPU, because cuSPARSE already provides it. There are many types of function for this kind of purposes in this library, so you need to perform any basic algebra operation, it is always a good idea to check in the cuSPARSE library documentation for it.

CUDA Particle Mover¶

In the MPI version the particles subjected to the electric field are moved by the moveparticles function, which is replaced by moveparticlesGPU in the CUDA version.

We created a boolean array check-alive to flag particles dead when they go out of bound and alive otherwise. This has been defined in the global.h file as:

bool *check_alive;

This array needs to be allocated and initialized only inside the GPU, as it is not required by any of the CPU functions. This is done in the allocate-particles function in the simpic.cu file.

CUDA_ERROR(cudaMalloc((void**)&check_alive,(nproc+1)*npart*sizeof(bool)));
CUDA_ERROR(cudaMemset(check_alive,1,(nproc+1)*npart*sizeof(bool)));

All the particles are initially set to 1(alive). We now use this flag as an indicator of whether the particle positions/velocities need to be updated. This simplifies our algorithm for out of bound particles. If the particles goes beyond the boundaries, then the particle can simply be flagged as dead.

The kernel in CUDA is a special function defined by the __global__ prefix type. It executes blocks of threads in parallel, e.g., the kernel moveparticlesCUDA calculates the velocity pdata[index+1] and new position pdata[index] of every particle in parallel by executing every thread j which amounts to the number of particles npart in the simulation. It also calculates the array narray. Since this array is read and written to by many threads in parallel, a so called “race condition” occurs. To avoid this condition CUDA atomic functions such as atomicAdd_double as defined before are used. These functions are blocking in nature and hence can have an effect on the performance. The CUDA kernel is executed only for particles that are alive.

if(j<npart && alive[j]==true)  //kernel executed only if particle is alive
   {
     int index = 2*j;
     E = particleweight(pdata[index],Earray,xl,dx);
     pdata[index+1] += qm*E;
     pdata[index] += pdata[index+1]*dt;

     if(pdata[index]>xl && pdata[index]<xr)
     {
       weightparticleCUDA(pdata[index],qscale,narray,xl,dx);
     }
     else
     {
      alive[j]=false;  //out of bound particles flagged dead
    }

Particles are created in the GPU and transferred to CPU only when diagnostics are called. This can be seen in the allocate-particles function in simpic.cu file.

   CUDA_ERROR(cudaMalloc((void**)&pdata_gpu,(nproc+1)*2*npart*sizeof(double)));
   ...
   for(int i=0, j=0; i < npart; i++, j+=2)
  {
    pdata[j] = xl + ((i + 1)*(xr-xl))/Scalar(npart + 1);
    pdata[j+1] = 0.;
  }

CUDA_ERROR(cudaMemcpy(pdata_gpu, pdata, (nproc+1)*2*npart*sizeof(double), cudaMemcpyHostToDevice));

The allocation of GPU memory is done only once and the initialization is done in the CPU before being transferred to the GPU only once before the iteration main loop. The GPU works only on the particle array present in the device. The CPU does not have a copy of the updated velocities/positions, except when diagnostics is called.

Now only the field and density arrays need to be transferred each time, because these are updated by the CPU field solver each time. This can be seen in the moveparticlesGPU function in simpic.cu file.

//CPU-GPU transfers, only E array
          CUDA_ERROR(cudaMemcpy(Earray_gpu, Earray, ng*sizeof(double),cudaMemcpyHostToDevice));
       CUDA_ERROR(cudaMemset(narray_gpu,0,ng*sizeof(double)));

       int threadsPerBlock = 256;
       int blocksPerGrid = (npart + threadsPerBlock - 1)/threadsPerBlock;
       moveparticlesCUDA<<<blocksPerGrid,threadsPerBlock>>>(pdata_gpu,check_alive,Earray_gpu,narray_gpu,xl,xr,dx,dt,qm,qscale,npart);
     //Only density array transferred GPU-CPU
       CUDA_ERROR(cudaMemcpy(narray,narray_gpu,ng*sizeof(double),cudaMemcpyDeviceToHost));

The field/density arrays are generally 2-3 magnitudes smaller than the particle arrays and hence, this data transfer would take negligible time.

However, the CPU requires this data to perform diagnostics and file I/O operations. To minimize this as much as possible I have set a diagnostics flag in the mainloop function in simpic.cu file.

int diagnostics_counter = 0;

This counter can be set to a non-zero value to ask the CPU to perform diagnostics.

if(diagnosticsflag)
{
    CUDA_ERROR(cudaMemcpy(pdata, pdata_gpu, (nproc+1)*2*npart*sizeof(double),cudaMemcpyDeviceToHost));
    char msg[] = "";
    starttime(DIAGNOSTICS);
    print_diagnostics(msg);
    endtime(DIAGNOSTICS);
}

You can set the counter to a non-zero value every certain number of time step iterations. And only then will the particle data be transferred to the CPU.

Some helper CUDA functions (the equivalent of the CPU helper functions) were defined (__device__ prefix type) which are used by the kernel. A few variables allocated statically (__device__ prefix type) were also defined and are also used by the kernel.

/*Extrapolates values from particles to grid points.
 * Used to calculate density at the grid points*/
__device__ void weightparticleCUDA(double x, double weight, double *array, double xl, double dx)
{
  int n;
  float frac, xx;

  xx = (x-xl)/dx;
  n = int(xx);
  frac = xx-n;

  atomicAdd_double(&(array[n]),weight*(1.0 - frac));
  atomicAdd_double(&(array[n+1]),weight*frac);
}

/*Interpolates values for particle between two grid points.
 * Used to compute E field in particle position*/
__host__ __device__ double particleweight(double x, double *array, double xl, double dx, int nc)
{
  int n;
  double frac, xx;

  xx = (x-xl)/dx;
  n = int(xx);
  frac = xx-n;

  return frac*array[n+1] + (1.0 - frac)*array[n];
}

Main Loop (`simpic.cu`)¶

Now we are going to approach the main part of the code. This is located in simpic.cu as it was in the CPU file simpic.cpp. As always, we start including the libraries and files we need:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <stdarg.h>


#include "global.h"
#include <mpi.h>
#include "timing.cpp"
#include "diagnostics.cpp"
#include "fields.cu"

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

We have included some standard libraries, the rest of the files needed for the simuation and CUDA libraries. The only new CUDA library (which was not in fields.cu) is device_launch_parameters.h which provides us with some functions to get the information of the threads such as threadIdx.

Finally, we have the main SIMPIC functions:

void mainloop(Scalar tmax, Scalar dt)
{
  tt = 0.;
  int nsteps = 0;
  while(tt < tmax)
    {
      #ifdef DEBUG
      fprintf(stderr, "Completed %g %: t=%g, npart=%d \n",
           float(nsteps)/(float(ntimesteps))*100.,tt, npart);
      #endif

      nsteps+=1;
      nparttot += npart;
      diag_flag = ((nsteps % dg_steps == 0)|| (nsteps == ntimesteps-1));
      /*Recalculate particle positions, velocity and density*/
      moveparticlesGPU(dt);

      /*With new density recalculate E field*/
      advancefields(dt);

      /*Save diagnostics each dg_steps*/
      if( diag_flag )
      {
      starttime(DIAGNOSTICS);
      char msg[] = "";
      print_diagnostics(msg);
      endtime(DIAGNOSTICS);
      }

      tt += dt;
    }
}

int main(int argc, char **argv)
{
  /*Get CUDA info*/
  CUDA_ERROR(cudaGetDeviceProperties(&prop,0));
  printf("Found GPU '%s' with %g Gb of global memory, max %d threads per block, and %d multiprocessors\n", prop.name, prop.totalGlobalMem/(1024.0*1024.0), prop.maxThreadsPerBlock, prop.multiProcessorCount);
  Nthreads_max = prop.maxThreadsPerBlock;  //Max threads to be used
  CUDA_ERROR(cudaSetDevice(0));     //Set the CUDA device

  //Init MPI
  MPI_Init(&argc, &argv);
  tstart = MPI_Wtime();

  starttime(INIT);
  parsecmdline(argc, argv);     //Get input data

  init();   //Initialise

  #ifdef DEBUG
    printdata();
    std::cout<<"Max GPU threads per block "<<Nthreads_max;
    std::cout<<", using "<< Nblocks_particles<<" blocks for particles and ";
    std::cout<<Nblocks_fields<< " blocks for fields \n";
    std::cout<<"At least 1 thread is per particle is needed and 1 thread per grid point \n";
  #endif
  endtime(INIT);

  mainloop(t, dt);

  tend = MPI_Wtime();

  printtimes(stdout);

  Quit();

  return 0;
}

mainloop works exactly the same as in the CPU version and its easier to read due to the removal of MPI parallelization. At last, we have the main function, it is also a very straight forward function. We have to comment that we call a very useful CUDA function called cudaGetDeviceProperties, this functions returns an structure cudaDeviceProp that contains relevant information about the avalaible GPU such as the maximum threads per block and total global memory. We also have to call the function cudaSetDevice to set the device on which the active host thread should execute the device code.

And that was the way to paralellize and understand the CUDA implementation. We can see that it is not a hard work but it takes a lot of time debugging and depuration of the code to make it able and efficient to go.

StarPU notes¶

Building on VIZ¶

Before you can run the StarPU version of the code, you need to install the StarPU directory in your working directory as mentioned before.

wget https://files.inria.fr/starpu/starpu-1.3.4/starpu-1.3.4.tar.gz
tar xvf starpu-1.3.4.tar.gz
cd starpu-1.3.4
mkdir build
cd build
../configure --prefix=$HOME/starpu
make
make install

Building on taito-gpu cluster¶

wget http://starpu.gforge.inria.fr/testing/starpu-1.3/starpu-nightly-latest.tar.gz
tar xvf starpu-nightly-latest.tar.gz
cd starpu-1.3.1.99/
mkdir build
cd build
../configure --prefix=$HOME/starpu --without-hwloc

Building on Bison GPU node¶

wget http://starpu.gforge.inria.fr/testing/starpu-1.3/starpu-nightly-latest.tar.gz

# Other releases are available on http://starpu.gforge.inria.fr/files/
# Stable release can be used instead:
# wget http://starpu.gforge.inria.fr/files/starpu-1.3.3/starpu-1.3.3.tar.gz

tar xvf starpu-nightly-latest.tar.gz
mkdir starpu-1.3.3.99/build

cd starpu-1.3.3.99/build
module load cuda-10.1.243-gcc-8.3.0-qpdenfa
../configure --prefix=${HOME}/starpu --enable-openmp --enable-mpi-check
make -j 8
make install

Building on D.A.V.I.D.E. cluster¶

On [DAVIDE] cluster the following build procedure is recommended. One needs to configure [StarPU] on compute node and then compile the library on the login node due to missing Mellanox Hierarchical Collectives (hcoll) and Mellanox Messaging (mxm) libraries on compute nodes.

wget http://starpu.gforge.inria.fr/testing/starpu-1.3/starpu-nightly-latest.tar.gz
tar xvf starpu-nightly-latest.tar.gz
mkdir starpu-1.3.2.99/build

cd starpu-1.3.2.99/build
module load cuda/9.2.88 gnu/6.4.0 openmpi/3.1.0--gnu--6.4.0
module load szip/2.1.1--gnu--6.4.0 hdf5/1.10.4--openmpi--3.1.0--gnu--6.4.0

export OMP_NUM_THREADS=8
srun -N1 --time=03:00:00 --ntasks-per-node=16 --gres=gpu:4 -A FUAD3_SOLDyn3G -p dvd_fua_prod --pty bash

nvidia-smi topo -m
../configure --prefix=${HOME}/starpu --enable-maxcpus=512 --enable-maxcudadev=128 --enable-maxopencldev=128 --enable-maxnodes=32 --enable-maxbuffers=32 --with-hdf5-include-dir=${HDF5_INC} --with-hdf5-lib-dir=${HDF5_LIB} --enable-openmp
exit

make -j16
make install

When configuring is done the following important configuration options are printed and ready for make and installation.

configure:

     CPUs   enabled: yes
     CUDA   enabled: yes
     OpenCL enabled: yes
     MIC    enabled: no

     Compile-time limits
     (change these with --enable-maxcpus, --enable-maxcudadev,
     --enable-maxopencldev, --enable-maxmicdev, --enable-maxnodes,
     --enable-maxbuffers)
     (Note these numbers do not represent the number of detected
     devices, but the maximum number of devices StarPU can manage)

     Maximum number of CPUs:                     512
     Maximum number of CUDA devices:             128
     Maximum number of OpenCL devices:           128
     Maximum number of MIC threads:              0
     Maximum number of MPI master-slave devices: 1
     Maximum number of memory nodes:             32
     Maximum number of task buffers:             32

     GPU-GPU transfers: yes
     Allocation cache:  yes

     Magma enabled:     no
     BLAS library:      none
     hwloc:             yes
     FxT trace enabled: no

     Documentation:     no
     Examples:          yes

     StarPU Extensions:
            StarPU MPI enabled:                            yes
            StarPU MPI(nmad) enabled:                      no
            MPI test suite:                                no
            Master-Slave MPI enabled:                      no
            FFT Support:                                   yes
            Resource Management enable:                    no
            OpenMP runtime support enabled:                yes
            Cluster support enabled:                       no
            SOCL enabled:                                  yes
            SOCL test suite:                               no
            Scheduler Hypervisor:                          no
            simgrid enabled:                               no
            ayudame enabled:                               no
            Native fortran support:                        yes
            Native MPI fortran support:                    no
            Support for multiple linear regression models: yes

Verifying StarPU installation¶

To test some examples on compute node use:

srun -N1 --ntasks-per-node=16 --time=03:00:00 --gres=gpu:4 \
  -A FUAD3_SOLDyn3G -p dvd_fua_prod --pty bash
export LD_LIBRARY_PATH=${HOME}/starpu/lib:${LD_LIBRARY_PATH}
cd ${HOME}/starpu/lib//starpu/examples/

./add_vectors
[starpu][check_bus_config_file] No performance model for the bus, calibrating...
[starpu][benchmark_all_gpu_devices] CUDA 0...
[starpu][benchmark_all_gpu_devices] CUDA 1...
[starpu][benchmark_all_gpu_devices] CUDA 2...
[starpu][benchmark_all_gpu_devices] CUDA 3...
[starpu][benchmark_all_gpu_devices] CUDA 0 -> 1...
[starpu][measure_bandwidth_between_dev_and_dev_cuda] GPU-Direct 1 -> 0
[starpu][measure_bandwidth_between_dev_and_dev_cuda] GPU-Direct 0 -> 1
[starpu][benchmark_all_gpu_devices] CUDA 0 -> 2...
[starpu][benchmark_all_gpu_devices] CUDA 0 -> 3...
[starpu][benchmark_all_gpu_devices] CUDA 1 -> 0...
[starpu][measure_bandwidth_between_dev_and_dev_cuda] GPU-Direct 0 -> 1
[starpu][measure_bandwidth_between_dev_and_dev_cuda] GPU-Direct 1 -> 0
[starpu][benchmark_all_gpu_devices] CUDA 1 -> 2...
[starpu][benchmark_all_gpu_devices] CUDA 1 -> 3...
[starpu][benchmark_all_gpu_devices] CUDA 2 -> 0...
[starpu][benchmark_all_gpu_devices] CUDA 2 -> 1...
[starpu][benchmark_all_gpu_devices] CUDA 2 -> 3...
[starpu][measure_bandwidth_between_dev_and_dev_cuda] GPU-Direct 3 -> 2
[starpu][measure_bandwidth_between_dev_and_dev_cuda] GPU-Direct 2 -> 3
[starpu][benchmark_all_gpu_devices] CUDA 3 -> 0...
[starpu][benchmark_all_gpu_devices] CUDA 3 -> 1...
[starpu][benchmark_all_gpu_devices] CUDA 3 -> 2...
[starpu][measure_bandwidth_between_dev_and_dev_cuda] GPU-Direct 2 -> 3
[starpu][measure_bandwidth_between_dev_and_dev_cuda] GPU-Direct 3 -> 2
[starpu][benchmark_all_gpu_devices] OpenCL 0...
[starpu][benchmark_all_gpu_devices] OpenCL 1...
[starpu][benchmark_all_gpu_devices] OpenCL 2...
[starpu][benchmark_all_gpu_devices] OpenCL 3...
[starpu][check_bus_config_file] ... done
[starpu][initialize_lws_policy] Warning: you are running the default
lws scheduler, which is not a very smart scheduler, while the
system has GPUs or several memory nodes. Make sure to read the
StarPU documentation about adding performance models in order to be
able to use the dmda or dmdas scheduler instead.

[lkos0000@davide44 examples]$ ./add_vectors
[starpu][initialize_lws_policy] Warning: you are running the
default lws scheduler, which is not a very smart scheduler, while
the system has GPUs or several memory nodes. Make sure to read the
StarPU documentation about adding performance models in order to be
able to use the dmda or dmdas scheduler instead.

 [lkos0000@davide44 examples]$ ./hello_world
 [starpu][initialize_lws_policy] Warning: you are running the
 default lws scheduler, which is not a very smart scheduler, while
 the system has GPUs or several memory nodes. Make sure to read the
 StarPU documentation about adding performance models in order to
 be able to use the dmda or dmdas scheduler instead.
 Hello world
 (params = {1, 2.000000} )
 Callback function got argument 0x42

Building OpenMP example¶

cd examples/openmp
export PKG_CONFIG_PATH=${HOME}/starpu/lib/pkgconfig
gcc -fopenmp -O2 -g vector_scal_omp.c -o vector_scal_omp $(pkg-config --cflags starpu-1.0) $(pkg-config --libs starpu-1.0)
export LD_LIBRARY_PATH=${HOME}/starpu/lib:${LD_LIBRARY_PATH}
./vector_scal_omp

Rerefences

BiLa85

C.K. Birdsall, A.B Langdon, Plasma Physics via Computer Simulation, Adam Hilger press, 1991, p.446

OOPD1

https://ptsg.egr.msu.edu/#Software

StarPU(1,2)

Official webpage: http://starpu.gforge.inria.fr/

Documentation: http://starpu.gforge.inria.fr/doc/html/

DAVIDE

https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.2%3A+D.A.V.I.D.E.+UserGuide

GPU StarPU¶

We also have integrated StarPU in SIMPIC. StarPU is task programming library for hybrid architectures which facilitates heterogeneous computing. So our main objective was merging the CPU code with the GPU code to make an all-in-one version.

StarPU Particle Mover Task¶

The aim here is to add a GPU task for the particle mover to the CPU version of the SIMPIC code. To set up the task, we need to identify the arrays which need to be read/written by the CUDA kernel. From our previous analysis we know that the following arrays are modified: pdata, check-alive, lhsbuf, rhsbuf, narray, Earray. These arrays need to be given their own data handles in StarPU which are basically pointers to the data arrays required by StarPU data management. This is done in the global.h file as:

Also, as per requirements for StarPU data management, we initialize a structure in global.h of all the parameters that are required but do not change during the code in a structure cl-args given by:

In the global.h, there is one final initialization to be done. We need to define a StarPU task for the particle mover. This is done as below.

As you can see the StarPU task is a C++ structure with certain arguments. We need to set these arguments to associate the task with a certain function and memory. These arguments have been initialised in the simpic.cu file in the allocateparticles function.

The handles refers to the StarPU data handles which will be read/written in this task. cl-arg refers to all the non-modifiable arguments which are required for the task to be executed. We have stored this in the structure cl-args as seen before. Here, cl refers to a StarPU codelet which is defined as below

The codelet is another structure which defines the associated function, the number of array buffers modified and the mode of buffer access. .where indicates on which PU (CPU/GPU) the task should run on. cpu-funcs / cuda-funcs indicates which function will be executed. nbuffers indicates how many data arrays are being accessed by the task and modes indicates the mode of access: read only (STARPU-R), write-only (STARPU-W) or read and write (STARPU-RW). Now that we have the defined all the associated arguments of the StarPU task, we need to link the StarPU data handles with the actual memory buffers which it needs to work on. This is done in allocate-particles function as follows.

StarPU vector is a StarPU data interface of 1D arrays. We register each of the data handles to the corresponding pointer to the array in main memory. Now that we have defined the task for particle mover we need to make some changes to the CUDA function in terms of memory management. The function definition for a function that is associated to a StarPU task has a mandatory format.

Here, *buffers refers to an array of pointers with each pointer referring to each StarPU data handle. *cl-args refers to the parameters/arguments which are needed for execution of the function. The definition of the function arrays is done as below

You can notice that no explicit copying is done from the host to the device or vice-versa. Once the array pointer is linked to data that is registered in StarPU, all the required data movement is done by StarPU data management. One more CUDA specific change is that there is an additional function called in the GPU kernel starpu-cuda-get-local-stream. This is just to create a CUDA-specific stream in StarPU for better synchronization of the CUDA threads.

Once the computation is done, the STARPU data handles can be deregistered as they are no longer needed. This is done in free-particles function by

starpu_data_unregister(pdata_handle);
starpu_data_unregister(pdataout_handle);
starpu_data_unregister(lhsbuf_handle);
starpu_data_unregister(rhsbuf_handle);

The code was run giving identical results to the GPU version of SIMPIC. Now this task can be run on multiple PUs by defining data partitions and running on both STARPU-CPU, STARPU-CUDA for better utilization of computational resources.

Data Partitioning¶

Using StarPU, we can partition data into sub parts to be used for different tasks simultaneously. In our particle mover, we can partition the particle arrays and run it on different tasks simultaneously as all the particle mover calculations are independent. We should first define how many sub parts we need. This is added as a global constant in the global.h file.

#define PARTS 2

We can define this partition using StarPU-filter. This is a structure which defines, what type of data is being partitioned (vector in this case) and the number of partitions. This is done in the launch-movertasks function in simpic.cu file.

struct starpu_data_filter f =
  {
    .filter_func = starpu_vector_filter_block,
    .nchildren = PARTS
  };

Now, we can perform the partition on the specific arrays using the above filter.

starpu_data_partition(pdata_handle,&f);
starpu_data_partition(alive_handle,&f);

Once, we have data partitioned StarPU can launch multiple tasks working independently on each sub part.

Launching Multiple StarPU Tasks¶

A function called launch-movertasks is created in the simpic.cu file to do this. This is because there is some extra code to be written for this. Once the partition has been defined and data has been broken to sub part, we can launch one task per sub part.

for(i=0; i<starpu_data_get_nb_children(alive_handle); i++)
  {
    starpu_data_handle_t sub_handle1 = starpu_data_get_sub_data(pdata_handle,1,i);
    starpu_data_handle_t sub_handle2 = starpu_data_get_sub_data(alive_handle,1,i);

    task = starpu_task_create();
    task->cl = &cl;
    task->handles[0] = sub_handle1;
    task->handles[1] = sub_handle2;
    task->handles[2] = lhsbuf_handle;
    task->handles[3] = rhsbuf_handle;
    task->handles[4] = narray_handle;
    task->handles[5] = Earray_handle;
    task->cl_arg = &cl_args;
    task->cl_arg_size = sizeof(cl_args_t);
    task->synchronous = 1;
    task->destroy = 0;

    starpu_task_submit(task);
  }

Some points regarding the above code:

Here, the variable starpu-data-get-nb-children() gets the number of sub parts the data handle is broken into. Here it is given by the global variable PARTS as defined before.
We then need to create data handles for each sub part. This is defined by the variables sub-handle1, sub-handle2.
The function starpu-data-get-sub-data(...) gets the pointers for each partitioned piece of data. The second argument is the dimension of data being partitioned (here it is 1 because it is a 1D array) and third argument refers to the \(i^{th}\) piece of data.
We can now create a task using the new handles as we’ve done before. Note that here we have partitioned only the pdata, check-alive arrays. All the other arrays remain unpartitioned. Now, each task can be submitted for the corresponding piece of data.

Now, we should call this function in the mainloop.

launch_movertasks();
  starpu_task_wait_for_all();
  starpu_data_unpartition(pdata_handle,STARPU_MAIN_RAM);
  starpu_data_unpartition(alive_handle,STARPU_MAIN_RAM);

The starpu-task-wait-for-all function is used here as we need all the particle data to be computed before moving on to the next step. Also, we have unpartitioned the data immediately after launching the tasks because this data could be needed later for diagnostics. We have used the starpu-data-unpartition function for this. Hence, you can see that partitioning happens each time the particle mover is called. This could create some overhead.

StarPU field solver¶

Moving the field solver to StarPU task is similar to the particle mover. First of all, we need to add the previous definitions in the file global.h

starpu_data_handle_t pdata_handle;
starpu_data_handle_t alive_handle;
starpu_data_handle_t lhsbuf_handle;
starpu_data_handle_t rhsbuf_handle;
starpu_data_handle_t Earray_handle;
starpu_data_handle_t phiarray_handle;
starpu_data_handle_t narray_handle;
struct starpu_task *task;
struct starpu_task *task_fields;

We see that we have the same StarPU data handles as we had in the particle mover version. Also, we have defined one new task.

Now in fields.cu we define the new structure:

struct cl_args_f {
  Scalar qm;
  Scalar dt;
  Scalar dx;
  Scalar xl;
  Scalar xr;
  int nl, nr;
  int nlp, nrp; // counters for sending particle buffers
  int nc;
  TriMatrix* A;
  double* d_a;
  double* d_b;
  double* d_c;
} cl_args_fields;

This will contain the basic information for the field solver to run. Note that here we pass the static values that might not be changed.

We allocate, register and create the task in the function allocate_arrays from fields.cu:

 starpu_malloc((void**)&narray, ng*sizeof(double));
 starpu_malloc((void**)&phiarray, ng*sizeof(double));
 starpu_malloc((void**)&Earray,ng*sizeof(double));
 starpu_malloc((void**)&nback,ng*sizeof(double));


 starpu_vector_data_register(&narray_handle, STARPU_MAIN_RAM, (uintptr_t)narray, ng, sizeof(narray[0]));
 starpu_vector_data_register(&phiarray_handle, STARPU_MAIN_RAM, (uintptr_t)phiarray, ng, sizeof(phiarray[0]));
 starpu_vector_data_register(&Earray_handle, STARPU_MAIN_RAM, (uintptr_t)Earray, ng, sizeof(Earray[0]));

 task->handles[4]=narray_handle;
 task->handles[5]=Earray_handle;

 static struct starpu_codelet fields_cl =
{
  //.where = STARPU_CPU | STARPU_CUDA,
  //.cpu_funcs = { advancefields },
  .where = STARPU_CUDA,
  .cuda_funcs = { advancefields_GPU },
  //.cpu_funcs_name = { "advancefields" },
  .nbuffers = 3,
  .modes = {STARPU_RW, STARPU_RW, STARPU_RW}
 };

 task_fields = starpu_task_create();
 task_fields->cl = &fields_cl;
 task_fields->handles[0] = narray_handle;
 task_fields->handles[1] = phiarray_handle;
 task_fields->handles[2] = Earray_handle;
 task_fields->cl_arg = &cl_args_fields;
 task_fields->cl_arg_size = sizeof(cl_args_f);
 task_fields->synchronous = 1;
 task_fields->destroy = 0;

The task contains the needed information for the field solver, so we set every value and we also define the StarPU vectors that might be changed in the function. Also, we see that we can uncomment the cpu line and run this code in CPU instead of GPU. StarPU will do all the data transfers between tasks for us.

The field solver function to be executed is advancefield_GPU which is located in fields.cu. In this function we have:

struct cl_args_f *cl_args = (struct cl_args_f*)_cl_args;

 Scalar nlold, nrold;

 std::cout<<"runs GPU fields"<<std::endl;

 double *narray = (double *)STARPU_VECTOR_GET_PTR(buffers[0]);
 double *phiarray = (double *)STARPU_VECTOR_GET_PTR(buffers[1]);
 double *Earray = (double *)STARPU_VECTOR_GET_PTR(buffers[2]);

 /*Correct density array for computation*/

 A = cl_args->A;

 double narray_cpu[ng];
 double phiarray_cpu[ng];
 cudaMemcpy(phiarray_cpu, phiarray, ng*sizeof(double),cudaMemcpyDeviceToHost);
 cudaMemcpy(narray_cpu, narray, ng*sizeof(double),cudaMemcpyDeviceToHost);
 nlold = narray_cpu[0];
 nrold = narray_cpu[cl_args->nc];
 narray_cpu[0] = 0.;
 narray_cpu[cl_args->nc] = 0.;
 A->solve(narray_cpu, phiarray_cpu);
 narray_cpu[0] = 2.*nlold;
 narray_cpu[cl_args->nc] = 2.*nrold;
 cudaMemcpy(phiarray, phiarray_cpu, ng*sizeof(double),cudaMemcpyHostToDevice);
 cudaMemcpy(narray, narray_cpu, ng*sizeof(double),cudaMemcpyHostToDevice);


 /*Laplace correction and gradient calculation with values
  * that are already in GPU*/

 int Nthreads_max = prop.maxThreadsPerBlock;
 int Nblocks_fields = ng/Nthreads_max +1;

 sumLaplace_GPU<<<Nblocks_fields,Nthreads_max, 0, starpu_cuda_get_local_stream()>>>(phiarray, cl_args->dx, rhsV(t), lhsV(t), cl_args->xl, cl_args->xr - cl_args->xl, cl_args->nc +1);
 cudaStreamSynchronize(starpu_cuda_get_local_stream());

 gradient_GPU<<<Nblocks_fields,Nthreads_max, 0, starpu_cuda_get_local_stream()>>>( Earray,  phiarray,  cl_args->nc,  -0.5/cl_args->dx);
cl_args->xl, cl_args->xr - cl_args->xl, cl_args->nc +1);
 cudaStreamSynchronize(starpu_cuda_get_local_stream());

This function is identical as it was for the GPU field solver. Disgracefully, we have not been able to use the cuSPARSE library inside the GPU function (anyways, StarPU has support for this library). So we had to do a “manual” data transfer for solving the trimatrix in CPU, then data is transferred back to GPU and the rest of the calculations are done here.

Finally, in simpic.cu we have included in the mainloop the following code:

if(diagnosticsflag)
{
 starpu_data_acquire(narray_handle, STARPU_R);
 starpu_data_acquire(Earray_handle, STARPU_R);
 starpu_data_acquire(phiarray_handle, STARPU_R);

 char msg[] = "";
 starttime(DIAGNOSTICS);
 print_diagnostics(msg);
 endtime(DIAGNOSTICS);
 starpu_data_release(narray_handle);
 starpu_data_release(Earray_handle);
 starpu_data_release(phiarray_handle);

}

As in the full StarPU-GPU version all data is located in GPU. We need to transfer briefly the data back to CPU to save it into the output files. We acquire the data from StarPU to CPU with starpu_data_acquire and then we release it from CPU with starpu_data_release.

Profiling SIMPIC¶

GPU Version¶

We used the NVIDIA Visual Profiler to visualize the profiling done on the full CUDA version of SIMPIC.

Running on VIZ GPU Node¶

The NVIDIA visual profiler comes along with the CUDA toolkit and hence does not need to be installed separately. However, it does require the Java runtime environment to run. Once logged in to the VIZ GPU node, you can do the following steps to run the visual profiler.

gpu
module load OpenMPI/3.1.4-gcccuda-2019b
module load jre
nvvp ./runsimpic.sh

Visualization¶

In the Create New Session dialog box, you can check if your executable file (runsimpic.sh here) is fine, and you can insert some command line options or flags (optional).
You can then select options for tracing in the Profiling Options. We have selected all the default options except for the Unified Memory Profiling because we do not use that.
You can select options for visualizing the timeline in the Timeline Options tab. We have selected all the default timeline options for our case.

Visualization step 1.¶

Visualization step 2.¶

Visualization step 3.¶

The visual profiler then runs the application and generates a visual timeline of our code.

Timeline of One Time Step of our full SIMPIC GPU version.¶

Some observations from the timeline in the figure from above:

We can see very minimal data transfers between the host and device. This justifies ouroptimized data transfer algorithm.
We see that the particle mover accounts for close to 70% of the compute time whereasthe remaining compute time is taken by thecuSparsematrix solver for fields.
This shows that we have been able to incorporate the two main computationally in-tensive parts of a Particle in Cell code in GPU.

Further Profiling Analysis on NVIDIA Profiler¶

There are many more options we can use to further analyse our CUDA kernels in detail in the NVIDIA Visual Profiler. We can explore them now.

In the bottom left corner of the profiler you can see the CUDA Application Analysis tab.

We can examine the GPU usage in our application by clicking on the Examine GPU Usage button here. We can also see the underlying numbers behind these results by clicking on each of these messages. You will be able to see these properties on the lower right hand corner of the profiler.

GPU Usage button here. For our code we get the following results.¶

Available profiling results.¶

All these numbers suggest there is lot of scope for further optimization of our kernel like concurrency of memory transfers and computation or multiple kernels and using multiple streaming multiprocessors (SM). It also indicates that our data size is actually much small that we are not using the full memory bandwidth of the PCIe link between the host and device.¶

You can see a new analysis tab Performance Critical Kernels on bottom left corner now. We can examine the performance of the individual CUDA kernels here by clicking on the Perform Kernel Analysis tab. You can then select the moveparticlesCUDA kernel to examine it.

Perform kernel analyisis panel.¶

We can see that our computation and memory transfer utilization is textbf{low} when compared to the peak utilization of the our NVIDIA Tesla K80 accelerator. This means that latency dominates the runtime, indicating that to offload the particle mover to the GPU would make sense only for much larger problem size.¶

On the bottom left hand you can see a new analysis tab Compute,Bandwidth or Latency Bound. You can analyse here if the performance of our kernel is limited by the computation, memory bandwidth or latency. These are 3 factors which generally limit the performance of any CUDA kernel. We first perform latency analysis by clicking on the Perform Latency Analysis tab.

Compute, bandwidth or Latency panel.¶

We obtain this results. We can see that most of the latency issues arise from ‘Memory Throttle’ and ‘Memory Dependencies’. I believe that this arises from the atomicAdd function in the weightparticlesCUDA. Memory throttle and dependency arises when a lot of pending memory operations stall the progress of the application. This happens in the case of access to the density array elements. We can maybe use a better parallel algorithm for the computation of density arrays.¶

We can now analyse the computation part of our CUDA kernel by clicking on the Perform Compute Analysis tab in the bottom left corner. We also get an analysis of what kind of data operations are happening in our kernel here.

We can see here that memory load/store operations are higher than arithmetic operations, hence indicating a memory bound kernel. We also get an analysis of what kind of data operations are happening in our kernel.¶

We can see here that most of the arithmetic operations are double (FP64) fused multiply-add (FMA) operations, which is to be expected from our algorithm.¶

We can also analyse the usage of the different memory hierarchies in the GPU by clicking on the Perform Memory Bandwidth Analysis tab.

We can see that there is a lot of free memory in the GPU indicating that we are not using the full memory bandwidth of the GPU. This indicates that memory bandwidth is not likely to be the cause of any performance reduction.

Hence, we can conclude that the NVIDIA Visual Profiler has analysed our kernel and given us many tips on how we can further optimise the code to run on GPUs.

StarPU Version¶

We used an external tracing application FxT for tracing purposes. The resulting trace was visualised using the ViTE visualization program. The required libraries are available only on the Bison machine only.

Installing FxT Library on Bison¶

We need to install the FxT tracing library externally. You can follow these steps to install it on the Bison machine.

$ module load cuda-10.1.243-gcc-8.3.0-qpdenfa
$ module load openmpi-3.1.4-gcc-8.3.0-hxcgcwk
$ module load cmake-3.15.4-gcc-8.3.0-i7l4zh2
$ module load help2man-1.47.11-gcc-8.3.0-ggupyzs

Building fxt:

$ wget http://download.savannah.nongnu.org/releases/fkt/fxt-0.3.11.tar.gz
$ tar xvf fxt-0.3.11.tar.gz
$ cd fxt-0.3.11
$ mkdir build
$ cd build
$ ../configure --prefix=$HOME/fxt
$ make
$ make install

Now we have installed FxT, we need to integrate this into our StarPU library. For this, we need to configure StarPU with FxT. This can be done by:

$ cd starpu-1.3.4
$ mkdir build
$ cd build
$ ../configure --prefix=$HOME/starpu --with-fxt=$HOME/fxt
$ make
$ make install
$ export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$HOME/starpu/lib/pkgconfig
$ export LD_LIBRARY_PATH=$HOME/starpu/lib:$LD_LIBRARY_PATH
$ export PATH=$PATH:$HOME/starpu/bin

Installing ViTE Visualizer on Bison¶

We can now run our StarPU application with FxT library to perform tracing. Now we need to install ViTE program in Bison to visualize the trace file. This can be done by

module load gcc-8.3.0-gcc-4.8.5-eodqjge
module load cmake-3.17.1-gcc-4.8.5-w7yrxtc
module load qt-5.14.2-gcc-8.3.0-pk5i2vx
git clone https://gitlab.inria.fr/solverstack/vite.git
cd vite
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=$HOME/opt/vite ..
make -j6
make install

Now we have installed the required programs for profiling our StarPU application.

Tracing and Profiling our SIMPIC StarPU version¶

Before tracing, please note that you should run the application for only one time step for the current code version. Otherwise, it will give a task dependency error. Since our application is a sequential one, this is fine and will give us the same results anyways. To run our application with tracing, you need to insert this in the command line:

$ STARPU_SCHED=dmdas ./runsimpic.sh

This command runs the application and generates a trace file which is stored in the tmp directory. The file is saved as prof-file-username-00. Now we can execute the trace by first copying it to our Home directory and then opening the visualizer. You can use the following commands:

cp \tmp\prof_file_{username}_00 $HOME
cd $HOME/starpu
$ ./bin/starpu_fxt_tool -i $HOME/prof_file_{username}_0

You will now be able to see the a fresh paje.trace file in the HOME/starpu directory.

You can now open the ViTE visualizer using:

~/opt/vite/bin/vite paje.trace

You can now see the profiling timeline for our application.

Examining the Profile¶

You can now see the profiling timeline for our application in Figure below. Some observations from the figure.

Timeline for StarPU particle mover task.¶

We can see explicitly the memory transfers from the CPU memory to the CUDA memory as the white arrows.
The green bars indicate the StarPU tasks. The first task is the GPU particle mover task and the smaller green bar later is the field solver GPU task. The red parts indicate that the PU is idle.
Note that there are no explicit memory transfer calls in our code. All the required memory transfers are transferred by teh StarPU memory management and Scheduler by itself.
The timeline looks very similar to our GPU version. This is to be expected as this is essentially the same code but the kernels are executed as StarPU tasks.

Miscellaneous¶

General CUDA-GPU guidelines¶

In this section we want to share some general guidelines and tips for moving existing codes to CUDA.

Almost every program must have:

Obviously it is needed a CUDA GPU and CUDA software installed.
Use a CUDA extension on your file .cu and compile with C++ CUDA compiler nvcc .
Including basic CUDA libraries such as cuda.h, cuda_runtime.h and device_launch_parameters.h.
cudaSetDevice is a must have for a CUDA program to run.
Correctly allocate data in GPU (cudaMalloc) before any computation is made. Also copy and set memory from CPU to GPU and viceversa when needed using cudaMemcpy and cudaMemcpy. If this is done incorrectly memory errors may or not raise. You can also execute your code with the CUDA memory checker to get more information about bad memory errors. In our case executing in command line: cuda-memcheck ./runsimpic.sh .
It is also important to deallocate data when the program finishes using cudaFree.
CUDA kernel calls are made with the spetial semantics: cudaFunction<<<NBlocks, NThreads>>>(args);

Some tips:

You can add flags to the CUDA compiler to add libraries and features such as -lmpi (compiles with MPI) and -lcusparse (compiles with the cuSPARSE libraries).
CUDA has some useful libraries that can be added to ease our work. Checking the CUDA libraries is always a good option.
Try to avoid global data in GPU kernels, in case it is not posible use functions such as atomicAdd. Shared data is the fastest way to access data in CUDA but not always the best option.
Operations in different threads should be completely independent.
for loops with indepent calculations are easy to parallelize. Each thread might do the calculation for one index.
Avoid copying data from GPU to CPU and viceversa as it slows the program.
Track timings of the program to find extra/bad calls or computations.
Move to GPU function by function and compare with the original CPU program results (obvious).