-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathbash-parallel-programming.html
More file actions
295 lines (273 loc) · 15.5 KB
/
bash-parallel-programming.html
File metadata and controls
295 lines (273 loc) · 15.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Anything Technical | [intro-to] Bash Parallel Programming</title>
<link rel="shortcut icon" type="image/png" href="http://don-han.com/favicon.png">
<link rel="shortcut icon" type="image/x-icon" href="http://don-han.com/favicon.ico">
<link href="http://don-han.com/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Anything Technical Full Atom Feed" />
<link href="http://don-han.com/feeds/articles.atom.xml" type="application/atom+xml" rel="alternate" title="Anything Technical Categories Atom Feed" />
<link rel="stylesheet" href="http://don-han.com/theme/css/screen.css" type="text/css" />
<link rel="stylesheet" href="http://don-han.com/theme/css/pygments.css" type="text/css" />
<link rel="stylesheet" href="http://don-han.com/theme/css/print.css" type="text/css" media="print" />
<meta name="generator" content="Pelican" />
<meta name="keywords" content="intro-to,bash,parallel programming" />
</head>
<body>
<header>
<nav>
<ul>
<li><a href="http://don-han.com">Home</a></li>
<li><a href="http://don-han.com/pages/about.html">about</a></li>
</ul>
</nav>
<div class="header_box">
<h1><a href="http://don-han.com">Anything Technical</a></h1>
</div>
</header>
<div id="wrapper">
<div id="content"> <h4 class="date">Aug 09, 2015</h4>
<article class="post">
<h2 class="title">
<a href="http://don-han.com/bash-parallel-programming.html" rel="bookmark" title="Permanent Link to "[intro-to] Bash Parallel Programming"">[intro-to] Bash Parallel Programming</a>
</h2>
<p>Parallel programming basically means writing scripts that can process multiple
tasks at the same time. This is in contrast to running scripts in a serial order which requires the
previous tasks to be completed before moving onto the next.</p>
<p>Parallelization can be useful for boosting the performance, but it requires the tasks
that are going to be parallelized to be independent of one another. Therefore, in
this tutorial, we will refer to the function <code>do-something</code> as a task that can
be safely run in parallel. For didactic purpose, we define <code>do-something</code> as follows:</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2
3
4</pre></div></td><td class="code"><div class="highlight"><pre><span class="k">function</span> <span class="k">do</span>-something<span class="o">(){</span>
sleep <span class="nv">$1</span>
<span class="nb">echo</span> <span class="nv">$1</span>
<span class="o">}</span>
</pre></div>
</td></tr></table>
<p>There are several ways of parallelizing a bash script, and each has its own pros
and cons.</p>
<p>1.</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2
3</pre></div></td><td class="code"><div class="highlight"><pre><span class="k">for</span> i in <span class="k">$(</span>seq <span class="m">5</span> 1<span class="k">)</span><span class="p">;</span> <span class="k">do</span>
<span class="k">do</span>-something <span class="nv">$i</span> <span class="p">&</span>
<span class="k">done</span>
</pre></div>
</td></tr></table>
<p>This is the simplest form of parallelization in bash. <code>&</code> forks each
<code>do-something</code> task to the background every time it's called.</p>
<p>If you run this parallel script, your output should look something like this:</p>
<div class="highlight"><pre>1
2
3
4
5
</pre></div>
<p>Even though our script runs from 5 to 1, because all five tasks start at the same time due to parallelization, <code>do-something</code> task with the shortest sleeping time gets finished first.</p>
<p>Also, note the speed improvement. Our serial script runs in <code>5+4+3+2+1=15</code>
seconds while our parallel should theoretically run in 5 seconds. That's 3 times performance improvement by just attaching <code>&</code>!</p>
<p>However, there is one critical problem. Imagine, instead of looping through 5 iterations, your script has to process 5 billion different tasks, and your <code>do-something</code> does something far more intensive. Then, when you run your script, it will attempt to process all those tasks at once, which will easily overload the machine. Our next method will help us prevent that.</p>
<p>ADVATAGE: simple</p>
<p>DISADVANTAGE: can potentially overload the system</p>
<p>2.</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre> 1
2
3
4
5
6
7
8
9
10</pre></div></td><td class="code"><div class="highlight"><pre><span class="nv">wait_turn</span><span class="o">=</span>4
<span class="nv">j</span><span class="o">=</span>1
<span class="k">for</span> i in <span class="k">$(</span>seq <span class="m">10</span> 1<span class="k">)</span><span class="p">;</span> <span class="k">do</span>
<span class="k">do</span>-something <span class="nv">$i</span> <span class="p">&</span>
<span class="k">if</span> <span class="o">((</span>j++ % <span class="nv">wait_turn</span> <span class="o">==</span> 0<span class="o">))</span><span class="p">;</span> <span class="k">then</span>
<span class="nb">wait</span>
<span class="nb"> </span><span class="k">fi</span>
<span class="k">done</span>
<span class="nb">wait</span>
<span class="nb">echo</span> <span class="s2">"after loop"</span>
</pre></div>
</td></tr></table>
<p>Running this, our result should look something like this:</p>
<div class="highlight"><pre>7
8
9
10
3
4
5
6
1
2
after loop
</pre></div>
<p>Can you see the pattern?</p>
<p>Once our <code>wait</code> function is called, the shell "waits" until all background tasks
are finished. In this case, a batch of four tasks are forked into the background at a time as specified by <code>wait_turn</code>, and the script waits until all four processes are complete. </p>
<p>By doing this, we limit the number of tasks that will be processed. While this will also limit the time it takes to process the all tasks, it won't take down your machine.</p>
<p>The value of <code>wait_turn</code> has been chosen arbitrarily, so you can fiddle with it to find the right number for your own script. Usually, the value is a multiple of the number of the cores you have.</p>
<p>A keen observer might have noted the another <code>wait</code> function after the function.
To illustrate its need, consider a similar script without the after-loop <code>wait</code>:</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2
3
4
5
6
7
8
9</pre></div></td><td class="code"><div class="highlight"><pre><span class="nv">wait_turn</span><span class="o">=</span>4
<span class="nv">j</span><span class="o">=</span>1
<span class="k">for</span> i in <span class="k">$(</span>seq <span class="m">10</span> 1<span class="k">)</span><span class="p">;</span> <span class="k">do</span>
<span class="k">do</span>-something <span class="nv">$i</span> <span class="p">&</span>
<span class="k">if</span> <span class="o">((</span>j++ % <span class="nv">wait_turn</span> <span class="o">==</span> 0<span class="o">))</span><span class="p">;</span> <span class="k">then</span>
<span class="nb">wait</span>
<span class="nb"> </span><span class="k">fi</span>
<span class="k">done</span>
<span class="nb">echo</span> <span class="s2">"after loop"</span>
</pre></div>
</td></tr></table>
<p>Before scrolling down, think what this would output.</p>
<div class="highlight"><pre>7
8
9
10
3
4
5
6
after loop
1
2
</pre></div>
<p>Notice how our "after loop" string comes before <code>do-something 1</code> and <code>do-something 2</code> is
processed. This could be critical if you have an after-loop code that needs the
for-loop to completely finish. Likewise, apply <code>wait</code> function after the for-loop for the first method if necessary.</p>
<p>Yet, even this method can be restrictive. Consider a case where the total time
of task completion varies greatly. For our example above, say every fifth tasks
(<code>do-something 5</code>, <code>do-something 10</code>), for some odd reason, takes 5 and 10 miniutes to complete instead of mere 5 or 10 seconds, respectively. If we
run our script again under this new condition, we can see that our script runs
very inefficiently because for our first batch, even though tasks 7, 8, 9 are
already completed, they would have to wait for one task, <code>do-something 10</code> to
finish! Likewise, for our second batch, the threads that were working on tasks
3, 4, and 6 are idling because they cannot move forward until <code>do-something 5</code>
completes. In other words, if the batch contains an unusually slow task, it becomes
the bottleneck and slows down the entire process, almost defeating the purpose
of parallelism.</p>
<p>ADVANTAGE: parallelization without overloading the system</p>
<p>DISADVANTAGE: bottlenecks occur and can potentially slow down the entire process</p>
<p>3.</p>
<p>Fortunately, UNIX/LINUX comes with a command called
<a href="https://en.wikipedia.org/wiki/Xargs"><code>xargs</code></a> which whips idling tasks back to
work. <code>xargs</code> is not specifically designed to be used for parallelization, but
it comes with a handy <code>-P</code> flag which defines the maximum number of parallel processes that can run at
a time. <code>xargs</code> is extremely flexible, so I will not go into too much details in
this article, but I will cover the basics:</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1</pre></div></td><td class="code"><div class="highlight"><pre>seq <span class="m">10</span> <span class="m">1</span> <span class="p">|</span> xargs -n1 -P4 -I <span class="o">{}</span> bash -c <span class="s2">"sleep {}; echo {};"</span>
</pre></div>
</td></tr></table>
<p>To break it down, I give the input ranging from 10 to 1. Then, <code>xargs</code> separates
the input into indiviual pieces (as specified by -n1, meaning take at most one
argument). The big <code>-P</code> flag as mentioned is what gives the parallelization
ability. Without it, the <code>xargs</code> runs the command in a serial manner. After we
set the variable to <code>{}</code>, <code>-I</code> replaces all ocurrences of <code>{}</code> in the command
with the input chunk, which in our case are individual numbers. Finally, <code>xargs</code>
runs the bash, which, in turn, executes the string with all <code>{}</code> replaced with
input chunks.</p>
<p>The above snippet produces:</p>
<div class="highlight"><pre>7
8
9
10
5
4
3
6
1
2
</pre></div>
<p>If you take a look at the first batch, as we have seen with #2, only upto 4 tasks are processed at a time since we have limited to the four processes. However, if you take a look at the second batch, you will see that the numbers are not ordered as before. This is because <code>xargs</code> assigns tasks as soon as processes becomes available.</p>
<p>Indeed, if we give as many processes as there are inputs, all processes will be handled at the same time as we would expect.</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1</pre></div></td><td class="code"><div class="highlight"><pre>seq <span class="m">10</span> <span class="m">1</span> <span class="p">|</span> xargs -n1 -P10 -I <span class="o">{}</span> bash -c <span class="s2">"sleep {}; echo {};"</span>
</pre></div>
</td></tr></table>
<p>which rightly produces</p>
<div class="highlight"><pre>1
2
3
4
5
6
7
8
9
10
</pre></div>
<p><code>xargs</code> should be sufficient for most users under most usage cases. Other than the fact that it's a tad bit more complicated than the first two methods I introduced above, <code>xargs</code> provides an efficient, easy way of implementing a parallel script.</p>
<p>ADVANTAGE: eliminates bottleneck</p>
<p>DISADVANTAGE: none for most users. See #4 for more</p>
<p>4. </p>
<p>However, for some users, they might need something more powerful, in which case,
I introduce you <a href="http://www.gnu.org/software/parallel/">GNU Parallel</a>.
Thankfully, GNU Parallel already detailed
<a href="https://www.gnu.org/software/parallel/man.html#DIFFERENCES-BETWEEN-xargs-AND-GNU-Parallel">reasons why you might want to switch over from <code>xargs</code></a>. </p>
<p>The <code>parallel</code> equivalent of </p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1</pre></div></td><td class="code"><div class="highlight"><pre>seq <span class="m">10</span> <span class="m">1</span> <span class="p">|</span> xargs -P4 -I <span class="o">{}</span> bash -c <span class="s2">"sleep {}; echo {};"</span>
</pre></div>
</td></tr></table>
<p>is</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1</pre></div></td><td class="code"><div class="highlight"><pre>seq <span class="m">10</span> <span class="m">1</span> <span class="p">|</span> parallel -P4 <span class="s2">"sleep {}; echo {};"</span>
</pre></div>
</td></tr></table>
<p>You can see that <code>parallel</code>syntax is a lot simpler than <code>xargs</code>.</p>
<p>ADVANTAGE: refer to <a href="https://www.gnu.org/software/parallel/man.html#DIFFERENCES-BETWEEN-xargs-AND-GNU-Parallel">the comparison between xargs and GNU Parallel</a>; to name a few, flexibility and simplicity</p>
<p>DISADVANTAGE: external dependency</p>
<div class="clear"></div>
<div class="info">
<a href="http://don-han.com/bash-parallel-programming.html">posted at 00:00</a>
· <a href="http://don-han.com/category/articles.html" rel="tag">articles</a>
·
<a href="http://don-han.com/tag/intro-to.html" class="tags">intro-to</a>
<a href="http://don-han.com/tag/bash.html" class="tags">bash</a>
<a href="http://don-han.com/tag/parallel-programming.html" class="tags">parallel programming</a>
</div>
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_shortname = 'anythingtechnical';
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>
</article>
<div class="clear"></div>
<footer>
<p>
<a href="https://github.com/jody-frankowski/blue-penguin">Blue Penguin</a> Theme
·
Powered by <a href="http://getpelican.com">Pelican</a>
·
<a href="http://don-han.com/feeds/all.atom.xml" rel="alternate">Atom Feed</a>
</footer>
</div>
<div class="clear"></div>
</div>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-73456201-1");
pageTracker._trackPageview();
} catch(err) {}</script>
</body>
</html>