book_gpu_computing/Debug_Tools.tex at master · thoangtrvn/book_gpu_computing · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
\chapter{Debug Tools}
\label{chap:Debug_Tools}

Overlap data copies and kernel execution. QUESTION: What if kernel can
modify the data, should we need a copy of GPU data all the time?

Queue independent kernel in different streams.

Don't spend too much time on optimizing kernel, unless you want to do
this for fun; as it may not be the bottle neck; but the I/O is.

\section{CUDA Fortran}

There is no support for debugging code generated by PGI Accelerators or CUDA
Fortran so far. We can build the code using \verb!-Mcuda=emu!, and then debug as
CPU code using \verb!pgdbg! tool.

However, there are bugs that cannot be detected unless it's running in a
massively parallel environment in GPU. Examples of those bugs are a race
condition or other memory (global or shared) contention issue. Also, the GPU may
use different numerical methods (such as FMA) which can yield slightly different
answers.

To also disable the use of FMA, we need also to add \verb!-Mcuda=nofma! option.

Other methods: If that doesn't do fix the issue, create temporary arrays that
capture intermediate values in your kernel. Then compare these values with the
same ones captured when running in emulation mode.

References:
\begin{enumerate}
  \item \url{https://www.pgroup.com/lit/articles/insider/v3n3a3.htm}
\end{enumerate}


\section{cuda-memcheck}
\label{sec:tool_cudamemcheck}
% \subsection{Check memory accesses}
% \label{sec:check-memory-access}

When you get a segmentation fault, a good program to start is
\verb!cuda-memcheck! which can be run inside cuda-gdb or as a
stand-alone program.

cuda-memcheck detects memory access errors in your CUDA application.
First, the program has to be compiled using \verb!-g -G! option pair.

\begin{verbatim}
cuda-memcheck <your prog> <arg1> <arg2>
\end{verbatim}
This is a tool to detect memory access violations.

FLAGS:
\begin{verbatim}
--continue
	try to continue running on memory access violations

\end{verbatim}

cuda-memcheck can be combined with \verb!cuda-gdb!. By default, memory access
violations cannot be detected when single-stepping the program in cuda-gdb
(Sect.\ref{sec:cuda-gdb}). To be able to detect, it can also be integrated into
cuda-gdb using
\begin{verbatim}
module load cuda
cuda-gdb demo

(cuda-gdb) set cuda memcheck on
(cuda-gdb) run
\end{verbatim}


2010: \url{http://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/cuda-memcheck.pdf}


\section{cuda-gdb}
\label{sec:cuda-gdb}

The code need to be compiled with the option pair \verb!-g -G!. NOTE: \verb!-g!
(host code) and \verb!-G! (GPU code) and remove all optimization \verb!-O0!.

IMPORTANT: The GPU being used must be not being used for display.
\begin{itemize}
  \item if you have only one GPU (and GPU is before CC3.5):

  CUDA application at a breakpoint == Frozen display

Console mode: no X server
\begin{verbatim}
sudo service lightdm stop
\end{verbatim}

  \item if you have only one GPU (and GPU is after CC3.5):

CUDA-GDB can be used to debug CUDA applications on the same GPU that is running the desktop GUI.
To use CUDA-gdb without screen interruptions, please set environment variable in .bashrc file (if you use bash)
\begin{verbatim}
export CUDA_DEBUGGER_SOFTWARE_PREEMPTION=1

//or
set cuda software_preemption on
\end{verbatim}
These options must be set prior to running the application.

Note: This is a BETA feature available on Linux and supports devices with SM3.5 compute capability.

You may need to relogin, and everything should work if you have CUDA Toolkit 8.0
(check this with nvcc --version). This trick works only for devices with compute
capability >=3.5.


  \item Multiple GPUs: one for display, one for compute

  \item Remote Debugging: SSH, VNC,
\end{itemize}

\subsection{Debug multiple GPUs on a node}

From CUDA 5.5, we can debug with multiple GPUs on a node.
\url{http://on-demand.gputechconf.com/gtc/2013/presentations/S3045-Getting-Most-From-GPU-Accelerated-Clusters.pdf}


\subsection{MPI code}

\url{http://on-demand.gputechconf.com/gtc/2013/presentations/S3045-Getting-Most-From-GPU-Accelerated-Clusters.pdf}

For smaller applications
\begin{verbatim}
 mpirun –np 2 xterm –e cuda-gdb a.out
\end{verbatim}

CUDA 5.0 and forward have the ability to attach to a running process
\begin{verbatim}
 mpirun -np 2 -host c0-2,c0-4 connectivity
\end{verbatim}
First, make the wait
\begin{verbatim}
{
 if (rank == 2) {
 int i = 0;
 printf(“rank %d: pid %d on %s ready for attach\n“, rank, getpid(), name);
 while (0 == i) {
 sleep(5);
 }
 }
\end{verbatim}

If you run the app, the process at which you want to attach prints out the
process ID and the machine running that process , so first ssh to that machine
\begin{verbatim}
mpirun -np 2 -host c0-2,c0-4 connectivity
rank 2: pid 20060 on c0-4 ready for attach

>> ssh c0-4

// and attach
>> cuda-gdb --pid 20060
\end{verbatim}

\subsection{tricks}

Tell CUDA to wait on exception in kernel
\begin{verbatim}
setenv CUDA_DEVICE_WAITS_ON_EXCEPTION 1

>> mpirun ...
\end{verbatim}

\subsection{simple code}

To debug the code, call
\begin{verbatim}
cuda-gdb -args myprogram arg1 arg2 <...more args>
\end{verbatim}

Inside cuda-gdb, use the \verb!info! command to investigate more, esp.
\verb!info cuda <option>!

To debug, call
\begin{verbatim}
break my_kernel
\end{verbatim}
for a single kernel, or
\begin{verbatim}
set cuda break_on_launch application
\end{verbatim}
to stop at all kernels.

You can switch from one coordinate to another (but you don't want to
set the hardware coordinate)
\begin{enumerate}
\item hardware coordinate: device, sm, warp, lane
\item software coordinate: thread, block, kernel
\end{enumerate}


\subsection{Example: Sobel filtering }
\label{sec:exampl-sobel-filt}


\begin{verbatim}
/* Allocate device memory */
cuda_ret = cudaMalloc( (void*) & in_frame_device, width * height *
channels * sizeof(char));
if (cuda_ret != cudaSuccess) FATAL("Unable to allocate GPU memory")

// similar for out_frame_device

\end{verbatim}

\begin{framed}
  Use this macro at the beginning of the file all the time
\begin{lstlisting}
#define FATAL(msg, ...)

\end{lstlisting}
\end{framed}

Set kernel dimensions
\begin{lstlisting}
// grid size
Dg.x = ceil(width / block_size);
Dg.y = ceil(height / block_size);
// block size
Db.x = Db.y = block_size
\end{lstlisting}

Call the kernels
\begin{lstlisting}
sobel<<< Dg, Db>>> (out_frame_device, in_frame_device, width, height,
channels);

cuda_ret = cudaDeviceSynchronize(); // new function

if (cuda_ret != cudaSuccess) FATAL("Unable to launch kernel");
\end{lstlisting}

Copy the output image back
\begin{lstlisting}
cuda_ret = cudaMemcpy(out_frame_host, out_frame_device, width * height
* channels, cudaMemcpyDefault);
if (cuda_ret != cudaSuccess) FATAL("Unable to copy back the result");
\end{lstlisting}

And finally release the memory
\begin{lstlisting}
free(in_frame_host); free(out_frame_host);
cudaFree(in_frame_device); cudaFree(out_frame_device);
\end{lstlisting}

KERNEL: each thread need to produce one output pixel
\begin{lstlisting}
__global__ void sobel(unsigned char *out, unsigned char *in, unsigned
int width, unsigned int height, unsigned int channels) {

  //[idx,idy] location of the input pixel
  int idx = threadIdx.x + blockIdx.x * blockDim.x;
  int idy = threadIdx.y + blockIdx.y * blockDim.y;

  // each pixel has 'channels' channels; so jump to the right memory
  // location of the pixel [idx,idy]
  int p = channels * ((idx-1) * width + idy-1)
  // linear index
  int loc = idx + idx * width;


  // use 3x3 mask
  out[loc] = in[loc] + in[loc+1] + in[loc-1] +
             in[loc-width] + in[loc-width-1] + in[loc-width+1]
             in[loc+width] + in[loc+width-1] + in[loc+width+1];
  for (int i = 0; i < 3; i++) {
     for (int j = 0; j < 3; j++) {

     }
  }

  out[loc] = output;

}
\end{lstlisting}


\section{NVML}
\label{sec:tool_NVML}

NVML provides API for controlling and monitoring GPUs per node. It can do
\begin{enumerate}
  \item GPU and memory utilization
\item ECC error events
\item Thermals
\item Change compute mode (i.e. exclusive mode)
\item Enable ECC support or clear ECC error counts
\item Change Windows driver models
\end{enumerate}

\section{nvidia-smi}
\label{sec:tool_nvidiasmi}

This is a console application that serves as the wrapper around NVML
(Sect.\ref{sec:tool_NVML}).

It can display NVIDIA driver version information
\begin{verbatim}
>> nvidia-smi
Mon Oct 29 12:30:02 2012
+------------------------------------------------------+
| NVIDIA-SMI 3.295.41   Driver Version: 295.41         |


>> cat /proc/driver/nvidia/version
\end{verbatim}

It can be used to turn on/off ECC memory
\begin{verbatim}
>> nvidia-smi -e 0
\end{verbatim}