\documentclass[12pt]{article}
\usepackage{fullpage}
\usepackage{algpseudocode}
\usepackage{hyperref}
\usepackage{url,amssymb,epsfig,color,xspace,tikz,amsmath}
\hypersetup{
colorlinks=true,
linkcolor=blue,
filecolor=red,
urlcolor=red,
}
\usetikzlibrary{arrows,automata}
\urlstyle{same}
\renewcommand{\thesubsection}{Problem \arabic{subsection}}
\newcommand{\quesbox}[2]{\begin{center} \framebox[.5\textwidth]{%
\raisebox{-5mm}[0mm][#1]{\begin{minipage}[t]{.45\textwidth}%
{\normalsize\sf #2}{\phantom{ans}}\end{minipage}}} \end{center}}
\begin{document}
\begin{center}
{\Large\bf University of Waterloo}\\
\vspace{3mm}
{\Large\bf CS240 - Spring 2020}\\
\vspace{2mm}
{\Large\bf Assignment 5}\\
\vspace{3mm}
\textbf{Due Date: Wednesday August 5, 5pm}
\end{center}
The integrity of the grade you receive in this course is very
important to you and the University of Waterloo. As part of every
assessment in this course you must read and sign an Academic Integrity
Declaration before you start working on the assessment and submit it
{\bf before the deadline of August 5} along with your answers to the assignment --
i.e. {\bf read, sign and submit A05-AcInDe.txt now or as soon as
possible}. The agreement will indicate what you must do to ensure
the integrity of your grade. If you are having difficulties with the
assignment, course staff are there to help (provided it isn't last
minute).
{\bf The Academic Integrity Declaration must be signed and submitted
on time or the assessment will not be marked.}
Please refer to the course webpage for guidelines on submission.
Submit your written solutions electronically as a PDF with file name
a5Submit.pdf using MarkUs. We will also accept individual question
files named a5q1.pdf, a5q2.pdf, \dots if you wish to submit questions
as you complete them. There is a programming question in this assignment;
don't forget to submit your completed code, named {\tt DFA.cpp}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{One-sided range search in a heap [5 marks]}
Heaps are suitable for {\em one-sided range queries}. Specifically,
assume you are given a max-heap $H$ and an integer $x$, and you are
asked to return all keys in $H$ that fall in the range $[x,\infty)$,
i.e., all keys that are at least $x$.
Give an algorithm that answers such a one-sided range query in time
that depends only on the number $k$ of keys in the output. Describe
your algorithm, justify why it works, and analyze the runtime.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{A variant of 2d range search [7 marks]}
Consider the following variant of two-dimensional range reporting
queries. We keep a set of two-dimensional points $P$ in a data
structure. All points have positive coordinates. For a query range $Q
= [a, b] \times [0, c]$, we must find some point $p$ in $P \cap Q$ or
report NULL if the intersection is empty. In other words, we must find
one point $p$ such that $p.y \le c$ and $a \le p.x \le b$ (we do not
need all of them; only one!)
Describe a data structure that answers queries described above in
$O(\log(n))$ time and uses space $O(n)$, give the search algorithm
and analyze its runtime. Justify your answers.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{KMP [6+6+2+4 marks]}
\begin{enumerate}
\item Compute the failure array for the pattern $P=$\texttt{bacbac}.
\item
Show how to search for pattern $P=$\texttt{bacbac} in the text
$T=$\texttt{bacbabaababacbacbaca} using the KMP algorithm. Indicate
in a table such as Table \ref{kmp} which characters of $P$ were
compared with which characters of $T$. Follow the example given in
class: place each character of $P$ in the column of the character of
$T$ we compare it to and put brackets around the character if an
actual comparison was not performed. You may not need all space in
the table.
\item How would you modify the KMP algorithm given in class to output
all matches instead of only the first one? You should be able to
change or add just a few lines.
\item We want to find the longest prefix of a string $S$ which reads
the same forward and backward. Explain how to do this, with a
justification (hint: use the KMP failure array of the string
$S+X+{\rm reverse}(S)$, where the $+$ sign indicates concatenation,
and $X$ is a character not in $S$).
\end{enumerate}
\begin{table}
\Large{
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline
b&a&c&b&a&b&a&a&b&a&b&a&c&b&a&c&b&a&c&a\\
\hline
\hline
&&&&&&&&&&&&&&&&&&&\\
\hline
&&&&&&&&&&&&&&&&&&&\\
\hline
&&&&&&&&&&&&&&&&&&&\\
\hline
&&&&&&&&&&&&&&&&&&&\\
\hline
&&&&&&&&&&&&&&&&&&&\\
\hline
&&&&&&&&&&&&&&&&&&&\\
\hline
&&&&&&&&&&&&&&&&&&&\\
\hline
&&&&&&&&&&&&&&&&&&&\\
\hline
&&&&&&&&&&&&&&&&&&&\\
\hline
\end{tabular}
\end{center}}
\caption{Table for KMP problem.}\label{kmp}
\end{table}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Huffman encoding [4 marks]}
Define frequencies $(f_i)_{1 \le i \le n}$ by $f_1=f_2=4$ and $f_i =
2^i+2^{i-3}$ for $3 \le i \le n$. Draw the corresponding Huffman tree
and give a brief justification
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{String matching with a DFA [3+2+2+6+12 marks]}
In this problem, you will implement the algorithm briefly described in
class for string matching with a Deterministic Finite Automaton (DFA)
(p.\ 16/31): we construct an automaton using a pattern $P$, then we
match in a text $T$. The algorithm for matching is simple and
efficient, but the construction of the automaton takes more work.
Let $P$ the pattern to search, of length $m$. Then the {\em states} of
the automaton are $0,\dots,m$ and the {\em transition function}
$\delta$ of the automaton is defined as follows, for a state $q$ and a
character $C$ in $\Sigma$:
$$\delta(q, C) = \ell(P[0..q-1]C),$$
where
\begin{itemize}
\item $P[0..q-1]C$ is the concatenation of $P[0..q-1]$ and $C$
\item for a string $s$, $\ell(s) \in \{0,\dots,m\}$ is {\em the length
of the longest prefix of $P$ that is also a suffix of $s$}.
\end{itemize}
This means that if we are in state $q$, and we read character $C$,
our new state is $\delta(q, C)$ as defined above.
Graphically, this corresponds to the transition
\hspace{5cm}\begin{tikzpicture}[scale=0.5,->,>=stealth',auto,node distance=4cm, thick]
\tikzstyle{every state}=[fill=none,draw=black,text=black]
\node[state,minimum size=35pt] (0) {$q$};
\node[state,minimum size=35pt] (1) [right of=0] {$\delta(q,C)$};
\draw (0) edge node {C} (1);
\end{tikzpicture}
The automaton given in class is not complete; here is the entirety of
it, for the pattern $P={\tt ababaca}$:
\begin{tikzpicture}[->,>=stealth',auto,node distance=1.5cm, minimum size=0.1cm, thick]
\tikzstyle{every state}=[fill=none,draw=black,text=black]
\node[state,minimum size=3pt] (0) {$0$};
\node[state,minimum size=3pt] (1) [right of=0] {$1$};
\node[state,minimum size=3pt] (2) [right of=1] {$2$};
\node[state,minimum size=3pt] (3) [right of=2] {$3$};
\node[state,minimum size=3pt] (4) [right of=3] {$4$};
\node[state,minimum size=3pt] (5) [right of=4] {$5$};
\node[state,minimum size=3pt] (6) [right of=5] {$6$};
\node[state,minimum size=3pt] (7) [right of=6] {$7$};
\draw (0) edge node {a} (1);
\draw (0) edge [loop above] node {b,c} (0);
\draw (1) edge [loop above] node {a} (1);
\draw (1) edge node {b} (2);
\draw (1) edge [bend left] node[below=-2pt,pos=0.2] {c} (0.south);
\draw (2) edge node {a} (3);
\draw (2) edge [bend left=45] node[pos=0.2,right=3pt] {b,c} (0.south);
\draw (3) edge [bend right=45] node[swap] {a} (1);
\draw (3) edge node {b} (4);
\draw (3) edge [bend left=65] node[pos=0.2,right=3pt] {c} (0.south);
\draw (4) edge node {a} (5);
\draw (4) edge [bend left=75] node[pos=0.2,right=3pt] {b,c} (0.south);
\draw (5) edge [bend right=55] node[swap] {a} (1);
\draw (5) edge [bend right=50] node[swap,pos=0.7,above=-2pt] {b} (4);
\draw (5) edge node {c} (6);
\draw (6) edge node {a} (7);
\draw (6) edge [bend left=75] node[pos=0.2,right=3pt] {b,c} (0.south);
\draw (7) edge [bend right=55] node[swap] {a} (1);
\draw (7) edge [bend right=55] node[swap] {b} (2);
\draw (7) edge [bend left=75] node[pos=0.2,right=3pt] {c} (0.south);
\end{tikzpicture}
For instance, suppose we take the state $q=5$ and the character $C={\tt
b}$. Then, $P[0..4]C= {\tt ababab}$ and the longest prefix
of $P={\tt ababaca}$ that is a suffix of $P[0..4]C={\tt ababab}$ is
${\tt abab}$, which has length $4$. So we set $\delta(5,{\tt b})=4$.
If we have the automaton, the string matching algorithm is simple. We
start in state $q=0$. For $i=0,\dots,n-1$, we replace $q$ by
$\delta(q,T[i])$; every time we hit $q=m$, we report $i-(m-1)$. The
proof of correctness relies on the following fact, that we do not ask
you to prove: {\em for all $i$, the state after reading $T[i]$ is
$\ell(T[0..i])$} (that is, the length of the longest prefix of $P$
that is also a suffix of the string $T[0..i]$ we have read so far).
In other words, the state tells us the best match we can have on the
right-end side of $T[0..i]$.
\begin{enumerate}
\item[1.] Taking the fact above for granted, prove that the algorithm
correctly reports the location of all matches (and only reports
matches, no false positives).
\end{enumerate}
Instead of tabulating all transitions $(q,C)$, for $q$ in
$\{0,\dots,m\}$ and $C$ in $\Sigma$, we only store $\delta(q,C)$ for
$q$ in $\{0,\dots,m\}$ and $C$ a character appearing in $P$. We will
thus write $S$ for the set of characters in $P$ (there is no
repetition in $S$) and we will let $s$ be its size.
\begin{enumerate}
\item[2.] If $C$ is a character in $\Sigma-S$, what
should be the value of $\delta(q,C)$? (Give a justification)
\item[3.] Explain what data structure you would use to store the
values $\delta(q,C)$ for $C$ in $S$, and why. Several solutions are
possible; for full credit, your justification should briefly discuss
the time for inserting or finding $\delta(q,C)$ for given $q,C$, in
terms of $s$. (For this question, the alphabet $\Sigma$ is
arbitrary; do not assume it has fixed size, like ASCII for
instance).
\item[4.] Give an algorithm that takes $P$ as input and computes and
stores the values of the transition function $\delta(q,C)$ for all
$q$ in $\{0,\dots,m\}$ and $C$ in $S$; state and justify its runtime
in terms of $m$ and $s$. For full credit, the runtime for computing
the values $\delta(q,C)$ should be $O(m^3 s)$, although faster
answers are possible.
We suggest that you first give an algorithm for computing the
function $\ell(\ )$ above, and study its runtime.
\item[5.] Implement the construction of the automaton as explained
above and the corresponding pattern matching algorithm in C++, in a
file {\tt DFA.cpp}. Here, the alphabet is the set of {\tt char}'s.
You may use one of the containers of the Standard Template Library;
refer for example to
\url{https://en.wikipedia.org/wiki/Standard_Template_Library}.
You should read the pattern from file {\tt a5pattern.txt} and the
text from file {\tt a5text.txt}. File {\tt a5pattern.txt} will
contain a single line. The program should concatenate all lines of
the text {\tt a5text.txt} into a single string and should output two
things:
\begin{itemize}
\item the transition function, as a sequence of triples
$$q~~ C~~ \delta(q,C)$$ for $q$ in $\{0,\dots,m\}$ and $C$ in $S$, one
triple per line, with values separated by a single space (do not
leave a trailing space after the last value on a line). The order
is not important.
\item all indices $i$ at which the pattern occurs at $T[i..i+m-1]$,
one number per line, in increasing order.
\end{itemize}
The triples should be printed in a file {\tt a5delta.txt} and the
indices of the matches in a file {\tt a5matches.txt}. For example,
for the automaton seen in class, the following lines should be part
of {\tt a5delta.txt}:
\begin{verbatim}
0 a 1
0 b 0
0 c 0
\end{verbatim}
\end{enumerate}
\end{document}