diff --git a/Distributed_Complex_Event_Recognition-Arnau_Abela.pdf b/Distributed_Complex_Event_Recognition-Arnau_Abela.pdf deleted file mode 100644 index f665bd7..0000000 Binary files a/Distributed_Complex_Event_Recognition-Arnau_Abela.pdf and /dev/null differ diff --git a/docs/thesis/chapters/chapter_3.tex b/docs/thesis/chapters/chapter_3.tex index e2369d0..2b85d50 100644 --- a/docs/thesis/chapters/chapter_3.tex +++ b/docs/thesis/chapters/chapter_3.tex @@ -144,9 +144,9 @@ \section{Computational model}\label{sec:cea} \section{Timed Enumerable Compact Set}\label{sec:data_structure} -The data structure that lazily represents the set of partial matches in CORE is called \acrfull{tecs}. A tECS is a \emph{directed acyclic graph (DAG)} $\mathcal{E}$ with two types of nodes: \emph{union nodes} and \emph{non-union nodes}. Union nodes have two children: \code{left} and \code{right}. Non-union nodes are labelled by a stream position and are divided in \emph{output nodes} and \emph{bottom nodes}. The former have exactly one child and the latter have none. To simplify presentation in what follows, we write nodes of any kind as \textrm{n}, bottom, output and union nodes as \textrm{b, o, u}, respectively, and we denote the sets of bottom, output and union nodes by $N_{B}$, $N_{O}$ and $N_{U}$, respectively. - -For a node \textrm{n}, define its \emph{descending-paths}, denoted \code{paths(\textrm{n})}, recursively as follows: if \textrm{n} is a bottom node, then \code{paths(\textrm{n})} = 1; if \textrm{n} is an output node, then \code{paths(\textrm{n}) = paths(next(\textrm{n}))}; if \textrm{n} is a union node, \code{paths(\textrm{n}) = paths(left(\textrm{n})) + paths(right(\textrm{n}))}. Every node \textrm{n} carries paths(\textrm{n}) as an extra label; thus the descending-paths can be retrieved in constant time. The descending-paths attribute is going to be used during the enumeration phase of the distributed evaluation algorithm to balance the workload of each processing unit. +The data structure that lazily represents the set of partial matches in CORE is called \acrfull{tecs}. A tECS is a \emph{directed acyclic graph (DAG)} $\mathcal{E}$ with two types of nodes: \emph{union nodes} and \emph{non-union nodes}. Union nodes have two children: \code{left} and \code{right}. Non-union nodes are labelled by a stream position and are divided in \emph{output nodes} and \emph{bottom nodes}. The former have exactly one child and the latter have none. +{\color{Comment} We should integrate descending-paths here (or at least mention it)}. +To simplify presentation in what follows, we write nodes of any kind as \textrm{n}, bottom, output and union nodes as \textrm{b, o, u}, respectively, and we denote the sets of bottom, output and union nodes by $N_{B}$, $N_{O}$ and $N_{U}$, respectively. {\color{Comment} missing definition of $pos$, $next$, $right$ and $left$}. An \emph{open complex event}, denoted $(i, D)$, is a complex event $([i, j], D)$, where the closing event $j$ hasn't been reached yet. A \acrshort{tecs} represents sets of open complex events. Let $\bar{p} = n_{1}, n_{2}, \ldots, n_{k}$ be a \emph{full-path} in $\mathcal{E}$ such that $n_{k}$ is a bottom node. Then $\bar{p}$ represents the open complex event ${\llbracket \bar{p} \rrbracket}_{\mathcal{E}} = (i, D)$, where $i = pos(n_{k})$ is the label of bottom node $n_{k}$ and $D$ are the labels of the other non-union nodes in $\bar{p}$. Given a node \textrm{n} in $\mathcal{E}$, ${\llbracket \textrm{n} \rrbracket}_{\mathcal{E}}$ is the set of open complex events represented by \textrm{n} and consists of all open complex events ${\llbracket \bar{p} \rrbracket}_{\mathcal{E}}$ with $\bar{p}$ a full-path in $\mathcal{E}$ starting at \textrm{n}. @@ -240,6 +240,44 @@ \section{Auxiliary data structures}\label{sec:auxiliary_data_structure} We assume that hash table lookups and insertions take constant time, and iterating over has constant delay. +{\color{Added} +\section{Descending-paths}\label{sec:desc-path} + +{\color{Comment} I don't think we need an exclusive section for this, but when I tried to integrate this into the description of the tECS it was too much information interrupting the flow of the description} + +For a node \textrm{n}, define its \emph{descending-paths}, denoted \code{paths(\textrm{n})}, as a binary relation $R$ between event position and the number of paths starting from that position that reaches \textrm{n}. This binary relation is defined recursively as follows: + +\[ +paths(\textrm{n}) = +\begin{cases} + \{(pos(\textrm{n}), 1)\} &\quad \textrm{n} \in N_{B} \\ + paths(next(\textrm{n})) &\quad \textrm{n} \in N_{O} \\ + paths(left(\textrm{n})) \cup paths(right(\textrm{n})) &\quad \textrm{n} \in N_{U} \\ +\end{cases} +\] + +We define the union of two binary relations $R := R_{1} \cup R_{2}$ as follows: + +\begin{align*} + R := & \{ \ (x_{1}, y_{1} + y_{2}) \ | \ \forall (x_{1}, y_{1}) \in R_{1}, \exists (x_{2}, y_{2}) \in R_{2}, x_{1} = x_{2} \ \} \\ + & \cup \{ \ (x_{1}, y_{1}) \ | \ \forall (x_{1}, y_{1}) \in R_{1}, \nexists (x_{2}, y_{2}) \in R_{2} \ \} \\ + & \cup \{ \ (x_{2}, y_{2}) \ | \ \forall (x_{2}, y_{2}) \in R_{2}, \nexists (x_{1}, y_{1}) \in R_{1} \ \} +\end{align*} + +For example, suppose node \textrm{n} has two paths: one starting at position $0$ and the other starting at position $2$, then $paths(\textrm{n}) = {\{0 \to 1, 2 \to 1\}}$. The descending-paths attribute is used during the enumeration phase of the distributed evaluation algorithm to balance the workload of each processing unit. + +Every node \textrm{n} carries paths(\textrm{n}) as an extra attribute; thus the descending-paths can be retrieved in constant time. This attribute will contain a pointer to a hash table that encodes the descending-path binary relation efficiently. Additionally, we define the method \code{pathsc(n, $\tau$)} that counts the number of paths starting after position $\tau$. This method will be useful for the enumeration phase of the evaluation algorithm \ref{sec:enumeration}. + +Notice, the size of this hash table may grow linearly with respect to the length of the stream. During the evaluation algorithm \ref{sec:evaluation}, to keep the size constant, we will only preserve paths that are inside the time window. The hash table can be efficiently generated as follows: + +\begin{itemize} + \item Bottom nodes will create a new hash table with a single entry corresponding to the current position. + \item Output nodes will point to the hash table of their child node. + \item Union nodes will create a fresh hash table and initialize it with all binary relations from \code{left(\textrm{u})} and \code{right(\textrm{u})} such that $\{(x, y) \ | \ x \ge \tau\}$, where $\tau = j - \epsilon$, $j$ is the current position and $\epsilon$ is the time window. +\end{itemize} + +All three operations can be computed in constant time $\mathcal{O}(\tau)$. +} \section{Chapter summary} diff --git a/docs/thesis/chapters/chapter_5.tex b/docs/thesis/chapters/chapter_5.tex index 22b2c06..f2cd1e5 100644 --- a/docs/thesis/chapters/chapter_5.tex +++ b/docs/thesis/chapters/chapter_5.tex @@ -151,7 +151,7 @@ \section{The Enumeration procedure}\label{sec:enumeration} \SetKwProg{Procedure}{procedure}{}{} \SetKwFunction{Enumerate}{\textsc{Enumerate}} \Procedure{\Enumerate{$\mathcal{E}, n, \epsilon, j, p$}}{ - $\delta \leftarrow \lceil \text{paths(n)} \ / \ {|\mathcal{P}|} \rceil$\; + {\color{Added}$\delta \leftarrow \lceil \text{pathsc(n, }\tau) \ / \ {|\mathcal{P}|} \rceil$}\; $\sigma \leftarrow \text{index}(p) \cdot \delta$\; st $\leftarrow$ new-stack()\; $\tau \leftarrow j - \epsilon $\; @@ -168,23 +168,21 @@ \section{The Enumeration procedure}\label{sec:enumeration} $P \leftarrow P \ \cup $ {pos($n'$)}\; $n' \leftarrow $ next($n'$)\; } - \ElseIf{$n' \in N_{U}$}{ - \If{$max(right(n')) \ge \tau$}{ - \eIf{$paths(left(n')) > \sigma'$}{ - $\delta'' \leftarrow \delta' - max(0, paths(left(n')) - \sigma')$\; - }{ - $\delta'' \leftarrow \delta'$\; + {\color{Added} + \ElseIf{$n' \in N_{U}$}{ + \If{$max(right(n')) \ge \tau$}{\label{line:enumeration:if:1} + $\delta'' \leftarrow \delta' - max(0, pathsc(left(n'), \tau) - \sigma')$\; + \If{$\delta'' > 0$}{\label{line:enumeration:if:2} + $\sigma'' \leftarrow \sigma' - pathsc(left(n'), \tau)$\; + push(st, $\langle$right($n'$), $P$, $\sigma''$, $\delta''\rangle$)\; + } } - $\sigma'' \leftarrow max(0, \sigma' - paths(left(n')))$\; - \If{$paths(right(n')) > \sigma'' \land \delta'' > 0$}{\label{line:enumeration:if:1} - push(st, $\langle$right($n'$), $P$, $\sigma''$, $\delta''\rangle$)\; + \eIf{$pathsc(left(n'), \tau) > \sigma'$}{\label{line:enumeration:if:3} + $n' \leftarrow left(n')$\; + }{ + \textbf{break}\; } } - \eIf{$paths(left(n')) > \sigma'$}{\label{line:enumeration:if:2} - $n' \leftarrow left(n')$\; - }{ - \textbf{break}\; - } } } } @@ -219,9 +217,13 @@ \section{The Enumeration procedure}\label{sec:enumeration} \end{figure} -\aref{algo:enumeration} receives as an input a \acrshort{tecs}, a node $n$, a time-window $\epsilon$, a position $j$, and a process $p$ and traverses a fraction of $G_{\mathcal{E}}$ in a DFS-way left-to-right order. First, computes the parameters $\sigma, \delta$ corresponding to the starting and ending path to enumerate, respectively. The value of these parameters can be computed statically i.e. without message interchanging. Each iteration of the \code{while} of line~\ref{line:enumeration:while:1} traverses a new path starting from the point it branches from the previous path (except for the first iteration). For this, the stack $st$ is used to store the node and partial complex event of that branching point. Then, the \code{while} of line~\ref{line:enumeration:while:2} traverses through the nodes of the next path, following the left direction whenever a union node is reached and adding the right node to the stack whenever need. The \code{if}s of line~\ref{line:enumeration:if:1}~and~line~\ref{line:enumeration:if:2} make sure that enumeration starts on path $\pi_{\sigma}$ so only $paths_{\ge j - \epsilon, \sigma, \delta}$ are traversed. Moreover, by checking for every node $n'$ its value $max(n')$ before adding it to the stack, it makes sure of only going through paths in $paths_{\ge j - \epsilon}$. +{\color{Added} \aref{algo:enumeration} receives as an input a \acrshort{tecs}, a node $n$, a time-window $\epsilon$, a position $j$, and a process $p$ and traverses a fraction of $G_{\mathcal{E}}$ in a DFS-way left-to-right order. First, it computes the parameters $\sigma$ and $\delta$ corresponding to the starting and ending path to enumerate using the method \code{pathsc(n, $j - \epsilon$)}. This method can be executed in constant time. We remark that the value of parameters $\sigma$ and $\delta$ can be computed statically and locally before the execution of the algorithm on each node.} + +The enumeration starts from the root node $n$. Each iteration of the \code{while} of line~\ref{line:enumeration:while:1} traverses a new path starting from the point it branches from the previous path. The stack $st$ is used to store the node, the partial complex event, and the parameters $\sigma, \delta$ of that branching point. {\color{Added} The \code{while} of line~\ref{line:enumeration:while:2} traverses through the nodes of the next path, following the left direction whenever a union node is reached and adding the right node to the stack whenever need. The \code{if} of line~\ref{line:enumeration:if:1} guarantees that traversed paths $\in paths_{\ge \tau}$ i.e. paths outside the time window are skipped. The \code{if}s of line~\ref{line:enumeration:if:2}~and~\ref{line:enumeration:if:3} assert that we enumerate at most $\delta$ paths and the enumeration starts on path $\pi_{\sigma}$, respectively. All three conditionals guarantee that only paths $\in paths_{\ge \tau, \sigma, \delta}$ are enumerated on process $p$.} +\begin{remark*} A simpler recursive algorithm could have been used, however, the constant-delay output might not be guaranteed because the number of backtracking steps after branching might be as long as the longest path of $G_{\mathcal{E}}$. To guarantee constant steps after branching and assure constant-delay output, \aref{algo:enumeration} uses a stack which allows to jump immediately to the next branch. We assume that storing $P$ in the stack takes constant time. We materialize this assumption by modelling $P$ as a linked list of positions, where the list is ordered by the last element added. To update $P$ with position $i$, we only need to create a node $i$ that points to the previous last element of $P$. Then, storing $P$ on the stack is just storing the pointer of the last element added. +\end{remark*} This concludes the presentation of Algorithm~\ref{algo:enumeration}. In the reminding of this section, we prove properties (1), (2) and (3). diff --git a/docs/thesis/preamble.tex b/docs/thesis/preamble.tex index 0e13e27..9fcf267 100644 --- a/docs/thesis/preamble.tex +++ b/docs/thesis/preamble.tex @@ -93,6 +93,8 @@ \usepackage{xcolor} % \definecolor, \color{codegray} \definecolor{codegray}{rgb}{0.9, 0.9, 0.9} +\colorlet{Added}{green!70!black} +\colorlet{Comment}{red!50!yellow} % \color{codegray} ... ... % \textcolor{red}{easily} @@ -103,8 +105,8 @@ %%%% Extras \usepackage{multirow} -\usepackage{todonotes} \usepackage{adjustbox} +\usepackage{soul} %%%% Glossaries & Acronyms diff --git a/docs/thesis/thesis.cls b/docs/thesis/thesis.cls index 99ea52f..1d176af 100644 --- a/docs/thesis/thesis.cls +++ b/docs/thesis/thesis.cls @@ -149,6 +149,7 @@ \newtheorem{definition}[theorem]{Definition} \theoremstyle{remark} \newtheorem{remark}[theorem]{Remark} +\newtheorem*{remark*}{Remark} \usepackage[centerlast,small,sc]{caption} \setlength{\captionmargin}{20pt} \newcommand{\fref}[1]{Figure~\ref{#1}}