GHC2016/main.tex at master · sabasehrish/GHC2016 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
\documentclass[11pt, twocolumn]{article}
\usepackage[total={6in, 8in}, margin=0.75in]{geometry}
\usepackage{booktabs} % for nice tables
\usepackage{longtable}
\usepackage{lmodern}
\usepackage{listings}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage{cite}
\usepackage{url}
\usepackage{microtype}
\usepackage[toc,page]{appendix}
%\usepackage{lineno}

\usepackage{amssymb}
\usepackage{tabularx}
%\usepackage[paper=letterpaper,top=1in,bottom=1in,left=1in,right=1in]{geometry}

\usepackage[T1]{fontenc}
\usepackage[sc]{mathpazo}
\linespread{1.02}

\setlength\parindent{0pt}
\setlength{\parskip}{7pt plus4mm minus3mm}
\newcommand{\block}{\footnotesize $\blacksquare$ \normalsize}
%\pagestyle{empty}
\newcommand{\squeezeup}{\vspace{-5.5mm}}
%\linenumbers

\bibliographystyle{plain}

\begin{document}
\title{Bringing Spark and HPC to Experimental High Energy Physics Analyses}
\author{Saba Sehrish \\ Computer Science Researcher\\ Fermi National Accelerator Laboratory \\ ssehrish@fnal.gov}
\date{}
\maketitle

\thispagestyle{empty}

\section*{Abstract}
\squeezeup
Experimental HEP is a compute- and data-intensive statistical science,
in which interactive, iterative and low-latency analysis of PBs of data is the key to extracting more science. HEP analysis consists of several steps that take from days to weeks. We describe our experience implementing selected HEP analyses in Spark as an alternative to typical models, and evaluate Spark on HPC resources.

Track: Data Science  \\
Audience: Beginner, Intermediate technical talk \\
\squeezeup
\input{intro}
\section{CMS Dark Matter Analysis}
The CMS Dark Matter analysis focuses on searching for new types of
elementary particles.
While other experiment searches consist of attempting to detect astrophysical
dark matter passing through a detector (direct searches), dark matter could
also be produced in collisions at the LHC if its
constituent particles are of an appropriate mass. The CMS collaboration is carrying out such a
search using the CMS detector at the LHC.

The CMS dark matter analysis is using composite C++ objects,
which describe the analysis products of the reconstruction of either
simulated or recorded collisions.
The data volume to be analyzed is about 200 TB, but not organized for end-user analysis.
This complex data format is converted to a simplified tabular data structure, called as n-tuples.
These n-tuples have a simple flat structure of vectors of basic types like integers and floats.
Making n-tuples is a time consuming process and it takes about a week to form n-tuples as
shown in Table~\ref{tab:1}.

Due to the large volume and significant processing time, the n-tuples are not used directly in the end-user analysis and undergo a process of skimming and slimming
(either an event is made smaller or some events are dropped).
This process reduces the data volume to 2 TB and takes about a day.
On average, this process is performed 4 times a year, but can be executed much more frequently as needed.

These reduced n-tuples are then used to analyze a specific channel of
the Dark Matter analysis use case. The outputs of each analysis are plots and tables.
An example of end-user analysis is to count signal and background events,
called ``cut-n-count''. This is achieved by processing the n-tuples
and making optimized selection cuts. In the end, the required plots
 and tables are produced to extract the physics results of the analysis.
 Table~\ref{tab:1} shows different steps, input size and processing time.
 The purpose of our study is to evaluate Spark for such analysis.
%A more sophisticated approach is using machine learning techniques like multi-variate analysis approaches to separate signal and background. These require an intermediate step executed on subsets of the BACON ntuples to train the multi-variate technique and then apply it to the BACON ntuples.

\begin{table}[h]
\begin{tabular}{ l | c | r }
\hline
  Steps & Input size & Processing time \\
  \hline
  Ntupling & 200 TB & ~ 1 week \\
  \hline
  Skimming \&  &2 TB  & ~ 1 day\\
  Slimming & & \\
  \hline
  Analysis &  << 1 TB & 5-30 minutes\\
  \hline
\end{tabular}
\caption{Input size and processing time for each step in 2015.}
  \label{tab:1}
  \end{table}

%Once:
%? Convert data and simulation into format suited for
%interactive analysis (NTuples)
%? Problem: too big for interactive analysis
%? Once or several times:
%? Skimming & Slimming: reduce number of events
%and information content
%? Analysts can explore data and simulation interactively
%? General requirements
%- Apply user queries on pre-calculated quantities
%- Add user-specified calculations ? use for selection as
%well

\squeezeup
\begin{figure}[htbp]
\begin{center}
 \includegraphics[scale=0.5]{flow}
\caption{Frequency of each step in current analysis and our approach to
use Spark to implement each step. }
\label{fig:flow}
\end{center}
\end{figure}
%\squeezeup

%\section{Approach}
%\squeezeup
%
%The input data is structured, and
%2. Skimming and Slimming using Scala at large scale and ASCII and R at small scale
%3. Plots using SparkR

%\squeezeup
\section{Our approach using Spark}
\squeezeup
Spark allows for in-memory distributed data processing, and is an attractive
approach for the cases where repeated analysis is performed on the
same data sets. A supported and tuned installation of Spark is
available at NERSC~\cite{nersc-spark}, which eliminates the need for
the user to master the installation and tuning of the complex Spark system.
In the following discussion, we describe the pros and cons of using
different features of Spark. Most of these observations will be applicable
to other scientific domains. Figure~\ref{fig:flow} shows the mapping of
different steps in the CMS dark matter analysis to the Spark framework. The conversion
step is extremely important because the structure of data defines
what APIs can be used later on in the cut-n-count analysis.

\begin{enumerate}
\item \textit{Input data format: }
Our first step is to convert the data to a form that
can be readily read in Spark DataFrames as shown in Figure~\ref{fig:flow}.
The CMS data is stored in the form of ROOT trees~\cite{root}, which is the most common
data format in HEP, which are not accessible in Spark.
The ROOT data format is hierarchical and complex and Spark DataFrames are tabular.
%Initially data is converted to either JSON or ASCII for experimentation with sparkR and PySpark.
%The nested object structure of the data made it extremely inconvenient to read
%any data using the current API both with sparkR and pySpark.
Hence, we need flattened tables to effectively use the API.
Additionally, we are working to use Avro and HDF5 data formats for this project.
The use of HDF5 will be extremely useful as it is one of the tuned and supported
parallel I/O libraries available on HPC machines.

\item \textit{Operations on multiple DataFrames: }
Currently we read data in multiple logical tables, e.g. all information regarding
a given reconstructed particle type can be found in one table. The nature of this analysis is to
combine the data from different tables
and make several histograms for different particles properties.
In our experience, if we read data from multiple DataFrames,
adding a new column to the DataFrame from another
DataFrame is not supported within Spark. Spark only supports adding new columns within the same DataFrame.
We also use inner joins to combine the DataFrames, but the overhead of the
\texttt{join} operation is extremely overwhelming and we are investigating this issue. % that we are considering to pre-process the data, and
%make amendments to data files before copying data to NERSC.

\item \textit{Spark Operations and API: }
This analysis requires skimming and slimming operation.
The skimming and slimming operation involves the use of functions; \texttt{map}, \texttt{filter} and
\texttt{reduce}.
Scala is a natural choice to implement this operation; Spark is written in Scala and it should out-perform Python on large data sets.
 Spark provides implicit parallelization of the
physicists' data processing algorithms using these functions, thus providing the possibility
of good scaling to large numbers of cores without requiring the
physicists to master complex parallel programming techniques. This
provides important ease of use for ``casual'' programmers, for whom
the goal is rapid turnaround time for analysis, who are usually not
interested in developing specialized parallel programming
skills. The availability of Python and R interface for end user analyses is also a key feature desired by the HEP community.

\item \textit{Scaling: }
We have used Java in our initial experience with Spark to perform filter and reduction
operation and understand the performance and scaling on Cori (NERSC). Using a data set of size 250 GB,
with about 70 million records and applying a simple filter operation,
the Spark implementation provided good
  scaling without requiring either tuning of the implementation or
  developer expertise in parallel algorithms.

  %However, we are working to understand the performance.

\item \textit{Orchestration: }
In Spark, data distribution and task assignment
  is abstracted from the user. There are functions available to
  perform global reduction in a distributed environment, which can be
  readily used.  Such a set-up provides ease-of-programming in a
  distributed environment. It does mean, however, that the user has to
  rely on system optimizations provided by Spark's implementation to
  improve any performance.

\item \textit{Application tuning: } All the transformations in Spark
  are lazy, with delayed calculation of results. The transformations
  applied to the base dataset are remembered by the system
  and only computed when an action is carried out on the dataset.
  This design enables Spark to run more efficiently.
%  For example, it
%  can recognize that a dataset created through \texttt{map} will be
%  used in a reduce and return only the result of the reduce to the
%  driver, rather than the larger mapped dataset.
  Due to this lazy evaluation, it is hard to isolate slow-performing tasks and report
  timing for different stages.

%\item \textit{Wrapped types: } We used the Java DataFrame API, and
%  most of the available operations for our implementation are
%  available through the Java wrapped types. A lot of time was spent in
%  boxing and unboxing.  It is well known that wrapped types perform
%  much slower than the primitive types, as explained in
%  ~\cite{effectivejava}. ``... Changing the declaration of sum from
%  Long to long reduces the run-time from 43 seconds to 6.8 seconds on
%  my machine. The lesson is clear: prefer primitives to boxed
%  primitives, and watch out for unintentional autoboxing
%  ...''~\cite{effectivejava}.  We did a test on Cori to compare the
%  performance of Float and float by running a simple Java program
%  based on the calculations of the classification algorithm, and
%  observed a performance difference of $28\%$ in using unboxed types.
%  Hence, we can achieve better performance with Spark if DataFrames
%  used primitive types.
%%[Boxed type test on Cori? 23 sec Vs 1 min 20 sec. ]
%
%
%\item \textit{Vectorized linear algebra library: } We experienced slow
%  performance for repetitive numerical computations in Spark due to
%  unavailability of high performance linear algebra library.  In the
%  MPI implementation, we used Armadillo, which is a high quality C++
%  linear algebra library.  The use of advanced libraries without data
%  conversions, and the ability to use vectorized hardware allowed MPI
%  implementation to perform exceedingly well as compared with Spark.

\end{enumerate}
\squeezeup
\textbf{Participation}: I will attend the conference.

\textbf{Bio:}
Saba Sehrish is a Computer Science Researcher
at Fermilab. She did her post doc at Northwestern University
and PhD at University of Central Florida.
Her research interests include HPC and Big data computing for HEP analysis.
She is currently working on Spark for HEP and also a team member of the LArSoft
project at Fermilab.

\textbf{Collaborators: }
Matteo Cremonesi, Oliver Gutsche, Jim Kowalkowski,
Cristina Mantilla, Marc Paterno, Jim Pivarski, Alexey Svyatkovskiy
\squeezeup
\scriptsize

%Bibliography and References Cited
\bibliography{earlycareerbib,ssio}

\end{document}