PiWi/eval.tex at master · Eshcar/PiWi · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
We  compare the \sys\/ prototype to RocksDB -- a mature industry-leading KV-store implementation  -- under a variety of scenarios. \remove{RocksDB is used as storage layer of multiple popular SQL and NoSQL databases, e.g., MyRocks~\cite{MyRocks} (incarnation of MySQL) and MongoRocks~\cite{MongoRocks} (incarnation of MongoDB). RocksDB is an LSM-tree that is highly optimized for both read and write scenarios. For example, its compaction scheduling policies are highly tuned to minimize impact on mainstream data access.} We use the most recent RocksDB release 5.17.2, available Oct 24, 2018.
It is worth noting that RocksDB's performance improved significantly during the last two years, primarily through
optimized LSM compaction algorithms~\cite{CallaghanCompaction}.

The experiment setup is described in \cref{ssec:setup}.
Performance results for \sys\ and RocksDB in
different workloads are presented in \cref{ssec:results}.
\cref{ssec:drill} analyzes \sys's sensitivity to different configuration settings and application parallelism.
%\cref{ssec:recover} evaluates its recovery mechanism.


\subsection{Setup}
\label{ssec:setup}

\paragraph{Testbed.} We employ a C++ implementation~\cite{Cpp-YCSB} of YCSB~\cite{YCSB}, the  de facto standard
benchmarking platform for KV-stores.
YCSB provides a standard set of APIs and a workload suite.
% Most modern KV-stores implement  YCSB adapter API's.
The platform decouples data access from workload generation,
thereby providing common ground for backend comparison.

A workload is defined by  (1) the ratio of get, put, and scan accesses, and (2) access pattern defined as a synthetic distribution of keys.
YCSB provides a basic set of benchmarks  inspired by real-life applications, and allows developing new ones through workload
generation APIs. A typical YCSB instance stress-tests the backend KV-store through a pool of concurrent worker threads that drive identical
workloads. %It aggregates the key performance metrics for the scenario under test.

Our hardware is a 12-core Intel Xeon 5 machine with 4TB SSD disk. Unless otherwise stated, the driver application
exercises 12 workers in all experiments. In order to guarantee a fair memory allocation to all KV-stores, we run each
experiment within a Linux container with 16GB RAM.

\paragraph{Metrics.} Our primary performance metric is \emph{throughput}
and \emph{latency percentiles}, as produced by YCSB.
In addition, we measure \emph{write amplification}, namely, bytes written to storage over bytes passed from the application.
In order to explain the results, we also explore \emph{read amplification} (in terms of bytes as well as number of system calls per application read).
%as well as the number of read \inred{ and write} system calls.
The OS performance counters are retrieved from the Linux proc filesystem.

\paragraph{Workloads.}
We scale the dataset size from 4GB to 64GB, in order to exercise multiple locality
scenarios with respect to the available 16GB of RAM. Similarly to the published RocksDB benchmarks~\cite{RocksDBPerf}, we use
10-byte keys that YCSB pads with a fixed 4-byte prefix (effectively, 14-byte keys), and 800-byte values.
The data is stored uncompressed.

We study two different key-access distributions:

\begin{enumerate}
\item {\em Zipf-simple} -- the standard YCSB Zipfian distribution over simple (non-composite) keys.
Key access frequencies are sampled from the heavy-tailed Zipf distribution,
following the description in~\cite{Gray:1994:QGB:191839.191886}, with $\theta = 0.8$.
The ranking is over a random permutation of the entire key range, so popular keys are uniformly dispersed.
% The key locations are sampled uniformly at random from the whole data range.
This workload exhibits
medium temporal locality (e.g., the most popular key's frequency is approximately $0.7\%$),
and no spatial locality.
%This is a standard YCSB workload that captures a multitude of use cases  -- e.g., a web page cache distribution by URL.

\item {\em Zipf-composite}  -- a Zipfian distribution over composite keys.
The key's $14$ most significant bits comprise the primary attribute. Both the primary attribute and
the remainder of the key are drawn from Zipf ($\theta=0.8$) distributions over their ranges.
Zipf-composite exhibits high spatial locality, representing workloads
with composite keys, such as message threads~\cite{Borthakur:2011:AHG:1989323.1989438},
social network associations~\cite{Armstrong:2013:LDB:2463676.2465296}, and analytics engines~\cite{flurry}.

\remove{
\item {\em Latest-simple} -- frequent access to recently added simple keys.
Keys are inserted in sequential  order; the read keys' distribution is skewed towards recently added ones.
Specifically, the sampled key's position wrt the most recent key is distributed Zipf. This is a
standard YCSB workload with medium spatial and temporal locality that represents, e.g., status updates and reads.
}
\end{enumerate}

The workloads exercise different mixes of puts, gets, and scans. We use standard YCSB scenarios
(A to E) that range from write-savvy ($50\%$ puts) to read-savvy ($95\%-100\%$ gets or scans) settings.
In order to stress the system even more on the write side, we introduce a new workload, named
YCSB-P, comprised of $100\%$ puts. It captures a heavy-duty data load scenario (e.g., from an
external data pipeline).

\paragraph{Evaluation methodology and configuration.} Each experiment consists of two stages. The first stage builds
the dataset, by filling an initially empty store with a sequence of KV-pairs, ordered by key. The second
phase exercises the specific scenario; all worker threads follow the same access pattern. Most experiments
perform 80 million data accesses. The experiments that run scans perform 4 to 16 million accesses, depending
on the scan size. We run 5 experiments for each data point, and present the median metric measurement.

%\paragraph{Configuration.}
We only present the performance metrics in asynchronous logging mode, since synchronous logging
is approximately 10 times slower, thereby trivializing the results of every scenario that includes puts.

All experiments in~\cref{ssec:results} use the same \sys\/ configuration, to avoid
over-tuning; \cref{ssec:drill} provides insights on parameter choices.
We allocate 8GB to munks and 4GB to the row cache,
so together they consume 12GB out of the 16GB container.
The row cache uses three hash tables.
The Bloom filters for munk-less funks are partitioned 16-way.
We set the \sys\/ chunk size limit to 10MB, and the rebalance threshold factor to $0.7$ -- i.e.,
chunks can grow up to 7MB  %(approximately 7000 KV-pairs)
before being compacted. The funk log size limit is 2MB {\inred{for munk-less chunks,
and 20MB for chunks with munks}}.

% \inred{for write intensive workloads, and 512KB for other workloads.}
We use RocksDB with the default configuration, which is also used by its public performance benchmarks~\cite{RocksDBPerf}.

\subsection{\sys\ versus \ RocksDB}
\label{ssec:results}

\begin{figure*}[tb]
\centering
\begin{subfigure}{0.33\linewidth}
\includegraphics[width=\textwidth]{figs/Workload_P_line.pdf}
\caption{YCSB-P -- put-only}
\label{fig:throughput:p}
\end{subfigure}
\begin{subfigure}{0.33\linewidth}
\includegraphics[width=\textwidth]{figs/Workload_A_line.pdf}
\caption{YCSB-A -- mixed get-put}
\label{fig:throughput:a}
\end{subfigure}
\begin{subfigure}{0.33\linewidth}
\includegraphics[width=\textwidth]{figs/Workload_C_line.pdf}
\caption{YCSB-C -- get-only}
\label{fig:throughput:c}
\end{subfigure}
%\begin{subfigure}{0.3\linewidth}
%\includegraphics[width=\textwidth]{figs/Workload_B_line.pdf}
%\caption{YCSB-B}
%\label{fig:throughput:b}
%\end{subfigure}
%\begin{subfigure}{0.3\linewidth}
%\includegraphics[width=\textwidth]{figs/Workload_D_line.pdf}
%\caption{YCSB-D}
%\label{fig:throughput:d}
%å\end{subfigure}
\begin{subfigure}{0.33\linewidth}
\includegraphics[width=\textwidth]{figs/Workload_E-_line.pdf}
\caption{YCSB-E10 -- short scans}
\label{fig:throughput:e10}
\end{subfigure}
\begin{subfigure}{0.33\linewidth}
\includegraphics[width=\textwidth]{figs/Workload_E_line.pdf}
\caption{YCSB-E100 -- medium scans}
\label{fig:throughput:e100}
\end{subfigure}
\begin{subfigure}{0.33\linewidth}
\includegraphics[width=\textwidth]{figs/Workload_E+_line.pdf}
\caption{YCSB-E1000 -- long scans}
\label{fig:throughput:e1000}
\end{subfigure}
\begin{subfigure}{\linewidth}
\centerline{
\includegraphics[width=0.9\textwidth]{figs/legend.pdf}
\vspace{-5mm}
}
\end{subfigure}
\caption{
{\sys\/ versus RocksDB throughput (Kops), under multiple workloads with Zipf distributions over composite (two-component) and
simple (non-composite) keys and scaling dataset sizes.}
}
\label{fig:throughput}
\end{figure*}

Figure~\ref{fig:throughput} presents throughput measurements under the Zipf-simple and
Zipf-composite key distributions in the put-only workload YCSB-P, the mixed YCSB-A,
the get-only YCSB-C, and the scan-heavy YCSB-E with three different scan
sizes. YCSB-B is omitted because the results are very close to those for YCSB-C.
For completeness, we also experiment with YCSB-D that exercises the
Latest workload -- frequent reads of the recently added simple keys. The results are
also close to YCSB-B, and omitted as well. {\inred{See the YCSB-B and YCSB-D
experiments in the supplementary material.}}

We now discuss the results for the different scenarios.

%YCSB-D exercises the Latest-simple  access pattern and its throughput is presented in Figure~\ref{fig:throughput:d}.

{\bf Put-only.}
In {YCSB-P} (100\% put, Figure~\ref{fig:throughput:p}),
\sys's throughput is 1.3x to 1.6x versus RocksDB's with composite keys,
and 1.1x to 1.5x with simple keys.
\sys\ benefits from spatial locality whereas RocksDB's write performance
is relatively insensitive to it, as is typical for LSM stores.

This workload's bottleneck is the reorganization of persistent data
(funk rebalances in \sys, compactions in RocksDB), which causes
write amplification and hampers performance.
%The overhead manifests in disk write rates, which \sys\/ reduces through better use of spatial locality.
For small datasets (8GB or less), \sys\/ accommodates all puts in munks, and so
funk rebalances are rare. In big datasets, funk rebalances do occur, but mostly in munk-less
chunks, which are accessed less frequently than popular ones.
Hence, in both cases, funk rebalances incur less I/O than RocksDB's compactions, which do not
distinguish between hot and cold data.

{\inred{\sys\/'s high log size limit for chunks with munks
also reduces I/O by decelerating munk flushes. This does not impact recovery time since \sys\/
does not replay its logs.}}

The write amplification in this workload is summarized in
Table~\ref{fig:writeamp}. We see that \sys\/ reduces the disk write rate dramatically,
with the largest gain observed for big datasets (e.g., $1.29$ versus $2.92$ for the 64GB
database under Zipf-composite). Thus, \sys\/ also reduces disk wear.


\begin{table}[htb]
\centerline{
{\small{
\begin{tabular}{lccccc}
\hline
 & 4GB & 8GB & 16GB & 32GB & 64GB \\
\hline
Zipf-composite: &  \multicolumn{5}{c}{}  \\
RocksDB & 1.93	& 2.15 & 2.29 & 2.54	& {\bf {2.92}}\\
\sys &  1.42	 & 1.38	& 1.27	& 1.30	& { {\bf 1.29}}\\
\hline
Zipf-simple: &  \multicolumn{5}{c}{}   \\
RocksDB & 1.94 & 2.20	& 2.69 & 3.03 &	2.77 \\
\sys & 1.41& 1.37	& 1.15	& 1.13	& 1.31 \\
\hline
\end{tabular}
}}
}
%\centering
%\hspace{0.05\linewidth}
%\begin{subfigure}{0.3\linewidth}
%\includegraphics[width=0.3\textwidth]{figs/P_write_amplification_disk_line.pdf}
%\caption{YCSB-P}
%\label{fig:writeamp:p}
%\end{subfigure}
%\hspace{0.05\linewidth}
%\begin{subfigure}{0.3\linewidth}
%\includegraphics[width=\textwidth]{figs/Workload_A_writeamplification_disk_line.pdf}
%\caption{YCSB-A}
%\label{fig:writeamp:a}
%\end{subfigure}
\caption{{\sys\/ versus RocksDB write amplification under the put-only workload (YCSB-P) and scaling dataset sizes.}}
\label{fig:writeamp}
\end{table}

{\bf Mixed put-get.} YCSB-A (50\% put, 50\% get, Figure~\ref{fig:throughput:a})
 is particularly challenging for \sys\/ because it exercises a high contention between gets and puts.
Here, the bottleneck is  disk search in gets, primarily the linear search in funk logs
that keep filling up due to concurrent puts. (In-memory searches are three orders of magnitude faster.)

\sys\/ achieves $1.3$x to $2.2$x throughput versus RocksDB with composite keys,
thanks to better exploitation of spatial locality. Figure~\ref{fig:tail_latency:co} shows that \sys\/
is also faster wrt the tail (95\%) put and get latency in this scenario. With simple keys,
\sys\/ is faster than RocksDB  for small datasets that fit in memory, and slower for big datasets.
{\inred{Figure~\ref{fig:tail_latency:si} zooms in on this setting: while \sys\/ is on par with
RocksDB wrt the put latency in this setting, it falls short wrt the get latency.}}

When the working set does not fit in RAM, disk access dominates the tail latencies.
Figure~\ref{fig:readstat:dist} depicts the distribution of gets by the storage  component
that fulfils the request, and Figure~\ref{fig:readstat:lat} presents the disk search latencies by component.
For example, for the 64GB dataset, $1.9\%$ of gets are served from logs under Zipf-composite, versus $4.0\%$ under Zipf-simple,
and the respective log search latencies are $1.7$ ms vs $4.4$ ms. This happens because in the latter, puts are more dispersed,
hence the funks are (1) cached less effectively by the OS, and (2) rebalanced less frequently so their logs grow longer.

\remove{
\begin{figure*}[tb]
\centering
\begin{subfigure}{0.36\linewidth}
\vskip .15in
\includegraphics[width=\textwidth]{figs/tail_line.pdf}
\vskip .55in
\caption{95-percentile latency, put and get}
\label{fig:tail_latency:lat}
\end{subfigure}
\begin{subfigure}{0.32\linewidth}
\includegraphics[width=\textwidth]{figs/Time_percentage_A.pdf}
\caption{Fraction of get accesses, by component}
\label{fig:tail_latency:dist}
\end{subfigure}
\begin{subfigure}{0.31\linewidth}
\includegraphics[width=\textwidth]{figs/Latency_A.pdf}
\caption{On-disk get latency, by component}
\label{fig:tail_latency:disk}
\end{subfigure}
\label{fig:tail_latency}
\caption{{\sys\/ tail latency analysis, under a mixed get-put workload (YCSB-A).}}
\end{figure*}
}

\begin{figure}[htb]
\centering
\begin{subfigure}{0.49\linewidth}
\includegraphics[width=\textwidth]{figs/tail_flurry_line.pdf}
\caption{Zipf-composite}
\label{fig:tail_latency:co}
\end{subfigure}
\begin{subfigure}{0.49\linewidth}
\includegraphics[width=\textwidth]{figs/tail_zipf_line.pdf}
\caption{Zipf-simple}
\label{fig:tail_latency:si}
\end{subfigure}
\caption{{\sys\/ versus RocksDB 95\% latency (ms), under a mixed get-put workload (YCSB-A).}}
\label{fig:tail_latency}
\end{figure}

\begin{figure}[htb]
\centering
%\hspace{0.05\linewidth}
\begin{subfigure}{0.5\linewidth}
\includegraphics[width=\textwidth]{figs/Time_percentage_A.pdf}
\vskip .1in
\caption{Fraction of get accesses}
\label{fig:readstat:dist}
\end{subfigure}
%\hspace{0.05\linewidth}
\begin{subfigure}{0.49\linewidth}
\includegraphics[width=\textwidth]{figs/Latency_A.pdf}
\caption{On-disk get access latency}
\label{fig:readstat:lat}
\end{subfigure}
\caption{{\sys\/ 95\% tail latency analysis by the serving component, under a mixed get-put workload (YCSB-A).}}
\label{fig:readstat}
\end{figure}

Naturally, the efficiency of in-memory caching is paramount. In the same example, the RAM hit rate
(munk and row cache combined) is
$90.5\%$ with composite keys versus $79\%$ without them. The row cache becomes instrumental as spatial locality drops --
 it serves $32.8\%$ of gets for Zipf-simple versus $12.4\%$ for Zipf-composite.

{\bf Scan-dominated.} In YCSB-E (5\% put, 95\% scan, Figures~\ref{fig:throughput:e10}-~\ref{fig:throughput:e1000}).
the number of items to iterate through is
sampled uniformly in the range [1,S], where S is either 10, 100, or 1000.
This workload (except with very short scans on very large datasets) is the best for \sys, since it exhibits the spatial locality the system has been designed for.
\sys\/ achieves $1.25$x to $3$x throughput versus RocksDB under Zipf-composite, and in Zipf-simple, improves over RocksDB in small datasets.
{\inred{\cref{ssec:drill} discusses how the scan performance on big datasets could be improved through dynamic adaptation of the
funk log size limit. }}

%$1.5$x to 2x under
\remove{
Note that the scan speed is  higher for the 8GB dataset versus the 4GB dataset.
Both are quite fast as they are served from memory, but in the smaller dataset there is more contention
between scans and puts, which slows down   progress. }
%This phenomenon becomes insignificant with longer scans.

{\bf Get-only.}
%YCSB-B, YCSB-C and YCSB-D are all get-dominated.
We see that in YCSB-C (100\% get, Figure~\ref{fig:throughput:c}),
\sys\/ achieves $1.6$x to $2.1$x throughput versus RocksDB with composite keys,
and up to $1.6$x   with simple ones.
%{\inred{YCSB-B (5\% put, 95\% get) and YCSB-D (5\% put, 95\% get, Latest-simple distribution,
%which achieve similar results, %Figure~\ref{fig:throughput:d})
%are omitted. }}

Since YCSB-C only exercises the read path, performance hinges on
caching efficiency. Both \sys\ and RocksDB benefit from  OS (filesystem) caching in addition to application-level caches.
Table~\ref{fig:readamp} compares the two systems' read amplification (as a proxy for cache hit ratio) with composite keys,
in terms of (1) actual disk bytes read and (2) read system calls.  Under the first,
RocksDB is slightly better in large datasets. However, it relies extensively on the OS cache -- in the 64GB dataset,
%, at the expense of the user-level block cache. In the same setting,
it performs almost 12 times as many system calls as \sys,  wasting the CPU resources on
kernel-to-user data copy. RocksDB developers explain that the block cache scaling potential is limited in their
database, due to tension between its read-path and write-path RAM resources~\cite{RocksDB-default-blockcache-issue}.
\sys, in contrast, exploits its munk cache for both reads and writes, which leads to better RAM utilization.

\begin{table}[htb]
{\small{
\begin{tabular}{lccccc}
\hline
& 4GB & 8GB & 16GB & 32GB & 64GB \\
\hline
RocksDB, bytes &  0.05 &	0.11 & 0.20 & 0.47 & 0.95\\
\sys, bytes &  0.0 &	0.0 &	0.11	& 0.80	& 1.07 \\
\hline
RocksDB, syscall & 3.84	& 4.01	& 4.14	& 4.28	& {\bf {4.37}} \\
\sys, syscall  & 0.0 & 0.0	& 0.10 & 0.24 & {\bf {0.41}} \\
\hline
\end{tabular}
}}
\caption{{\sys\/ versus RocksDB read amplification, in terms of bytes and system calls,
under a read-only workload (YCSB-C) with Zipf-composite distribution.}}
\label{fig:readamp}
\end{table}


\remove{
\begin{figure}[t]
\centering
\includegraphics[width=0.3\textwidth]{figs/Workload_D_line.pdf}
\caption{{\sys\/ versus RocksDB throughput, under the YCSB-D workload and scaling dataset sizes.}}
\label{fig:throughput:d}
\end{figure}
}

\subsection{\sys\ Insights}
\label{ssec:drill}

\paragraph{Vertical scalabiliy.}
Figure~\ref{fig:scalability} illustrates \sys's throughput scaling for the 64GB dataset under both Zipf
distributions. We exercise the YCSB-A, YCSB-C and YCSB-P scenarios, with 1 to 12 worker threads.
As expected, YCSB-C (read-only scenario) scales nearly perfectly (9.4x for composite keys, 7.7x for simple ones).
The other workloads scale slower, due to read-write and write-write contention as well as background munk and funk rebalances.
%YCSB-P scales 3.8x for both distributions at 8 threads. YCSB-A scales 6.4x for Zipf-composite and 4.6x for Zipf-simple at 12 threads.

\begin{figure}[th]
\centering
\includegraphics[width=0.35\textwidth]{figs/scalability_line.pdf}
\caption{{\sys\/ scalability with the number of threads for
the 64GB dataset and different workloads. }}
\label{fig:scalability}
\end{figure}

\remove{
\begin{table*}
\centering
%\begin{subfigure}{0.55\linewidth}
{\small{
\begin{tabular}{l|cccccc|c|ccccc|}
\cline{2-7} \cline{9-13}
  & \multicolumn{6}{c|}{Maximum log size} & & \multicolumn{5}{c|}{Bloom filter split factor}\\
%\cline{2-7} \cline{9-13}
& 128KB & 256KB & 512KB & 1MB & 2MB & 4MB & &1 & 2 & 4 & 8 & 16 \\
\cline{2-7} \cline{9-13}
Zipf-composite: & 50.2	& 55.5 & 80.7	& 134.1 & {\bf {157.5}} & 149.5 & & 134.7 & 133.5 & 140.1 & 152.5 & {\bf {157.4}}   \\
Zipf-simple:    & 27.8	& 30.9 & 36.1	& 58.3  & {\bf {68.3}}   & 68.1   & &  36.3 & 39.2   & 46.3  & 56.0  & {\bf {59.9}}\\
%\hline
%YCSB-E, Zipf-range & 29.1	& 36.4 & 36.9	& 37.3	& {\inred{30.4}} & 	37.1 \\
%YCSB-E, Zipf & 16.1 & 16.3 &	15.8	& 15.8 &	16.4 &	15.8 \\
\cline{2-7} \cline{9-13}
\end{tabular}
}}
%\caption{Throughput (Kops) vs log size}
%\label{fig:wal:sz}
%\end{subfigure}
%\hspace{0.09\linewidth}
%\begin{subfigure}{0.35\linewidth}
%{\small{
%\begin{tabular}{|l|c|c|c|c|}
%\hline
%1 & 2 & 4 & 8 & 16\\
%\hline
%134.7 & 133.5 & 140.1 & 152.5 & {\bf {157.4}} \\
% 36.3 & 39.2 & 46.3 & 56.0 & {\bf {59.9}} \\
%\hline
%\end{tabular}
%}}
%\caption{Throughput (Kops) vs Bloom filter split factor}
%\label{fig:wal:bf}
%\end{subfigure}
\caption{{\sys\/ throughput under a mixed get-put workload (YCSB-A), 64GB dataset, and different configuration parameters.}}
\label{fig:wal}
\end{table*}
}

\begin{figure}[htb]
\centering
\begin{subfigure}{0.49\linewidth}
\includegraphics[width=\textwidth]{figs/max_log_size_line.pdf}
\caption{Maximum log size,\\  YCSB-A and YCSB-E100}
\label{fig:params:log}
\end{subfigure}
%\hspace{0.05\linewidth}
\begin{subfigure}{0.49\linewidth}
\includegraphics[width=\textwidth]{figs/Bloom_filter_line.pdf}
\caption{Bloom filter split factor,\\ YCSB-A}
\label{fig:params:bf}
\end{subfigure}
\caption{{\sys\/ throughput sensitivity to configuration parameters, on the 64GB dataset under the YCSB-A (mixed put-get) and
YCSB-E100 (scan-dominant, 1 to 100 items).}}
\label{fig:params}
\end{figure}

\paragraph{Configuration parameters.}
We explore the system's  sensitivity to funk-log configuration parameters, for the most challenging 64GB dataset,
{\inred{and explain the choice of the default values.}}

Figure~\ref{fig:params:log} depicts the throughput's dependency on the log size limit {\inred{for munk-less funks,
under the YCSB-A and YCSB-E100 workloads with Zipf-composite key distribution.
The fraction of puts in YCSB-A is 50\% (versus 5\% in YCSB-E), which makes it more sensitive to the log size. }}
A low threshold (e.g., 128KB) causes frequent funk rebalances, which degrades the performance more than 3-fold.
On the other hand, too high a threshold (4MB) lets the logs grow bigger, and slows down gets. Our experiments are
tuned to use 2MB logs, which favors write-intensive workloads. {\inred{YCSB-E favors smaller logs, since the write
rate is low, and more funk rebalances can be accommodated. Its throughput could grow by up to 20\% by tuning to
use 512KB logs.}}

Figure~\ref{fig:params:bf} depicts the throughput dependency on the Bloom filter split factor, under YCSB-A.
Partitioning to 16 mini-filters gives the best result{\inred{; beyond this point the benefit levels off.
The impact of Bloom filter partitioning on the end-to-end get latency as well as the memory footprint is negligible.}}