sentiment_discourse.tex

% FILE: main.tex  Version 2.1
% AUTHOR:
% Universität Duisburg-Essen, Standort Duisburg
% AG Prof. Dr. Günter Törner
% Verena Gondek, Andy Braune, Henning Kerstan
% Fachbereich Mathematik
% Lotharstr. 65., 47057 Duisburg
% entstanden im Rahmen des DFG-Projektes DissOnlineTutor
% in Zusammenarbeit mit der
% Humboldt-Universitaet zu Berlin
% AG Elektronisches Publizieren
% Joanna Rycko
% und der
% DNB - Deutsche Nationalbibliothek

\chapter{Discourse-Aware Sentiment Analysis}\label{chap:discourse}

Although message-level sentiment analysis methods do a fairly good job
at classifying the overall polarity of a message,
%% putting their best leg forward to incorporate the compositional
%% principle into that prediction,
a crucial limitation of all these systems is that they completely
overlook the structural nature of their input by either considering it
as a single whole (\eg{} bag-of-features approaches) or analyzing it
as a monotone sequence of equally important elements (\eg{} recurrent
neural methods).  Unfortunately, both of these solutions violate the
hierarchical principle of language~\cite{Saussure:90,Hjelmslev:70},
which states that complex linguistic units are formed from smaller
language elements in the bottom-up way, \eg{} words are created by
putting together morphemes, sentences are made of several words, and
discourses are composed of multiple coherent sentences.  Moreover,
apart from this inherent structural heterogeneity, even units of the
same linguistic level might play a different role and be of unequal
importance when joined syntagmatically into the higher-level whole.
For example, in words, the root morpheme typically conveys more
lexical meaning than the affixes; in sentences, the syntactic head
usually dominates its grammatical dependents; and, in discourse, one
of the sentences frequently expresses more relevant ideas than the
rest of the text.
%% At the same time, even auxiliary modifying elements might completely
%% overturn the meaning of the central part to its opposite (cf. \emph{to
%%   like} vs. \emph{to dislike}; \emph{She enjoyed this song}
%% vs. \emph{She didn't enjoy this song}; \emph{Trump is a good
%%   businessman} vs. \emph{Trump is a good businessman, but a terrible
%%   employer}).

Exactly the lack of discourse information was one of the main reasons
for the misclassifications made by the systems of \citet{Severyn:15},
\citet{Baziotis:17}, and our own LBA method in
Examples~\ref{snt:cgsa:exmp:severyn-error},\ \ref{snt:cgsa:exmp:baziotis-error},
and~\ref{snt:cgsa:exmp:lba-error}.  Since none of these approaches
explicitly took discourse structure into account, we decided to check
whether making the last of these solutions (the LBA classifier) aware
of discourse phenomena would improve its results.  But before we
present these experiments, we first would like to make a short
digression into the theory of discourse and give an overview of the
most popular approaches to text-level analysis that exist in the
literature nowadays.  Afterwards, in Section~\ref{sec:dasa:data}, we
will describe the way how we inferred discourse information for PotTS
and SB10k tweets.  Then, in Section~\ref{sec:dasa:methods}, we will
summarize the current state of the art in discourse-aware sentiment
analysis (DASA) and also present our own methods, evaluating them on
the aforementioned datasets.  After analyzing the effects of various
common factors (such as the impact of the underlying sentiment
classifier and the amenability of various discourse relation schemes
to different DASA approaches), we will recap the results and summarize
our findings in the last part of this chapter.

\section{Discourse Analysis}\label{sec:dasa:theory}

Since the main focus of our experiments will be on \emph{discourse
  analysis}, we first need to clarify what discourse analysis actually
means and which common ways there are to represent and analyze
discourse automatically.

In a nutshell, discourse analysis is an area of research which
explores and analyzes language phenomena beyond the sentence
level~\cite{Stede:11}.  Although the scope of this research can be
quite large, ranging from the use of pronouns in a sentence to the
logical composition of the whole document, in our work we will
primarily concentrate on the coherence structure of a text, \ie{} its
segmentation into \emph{elementary discourse units} (typically single
propositions) and induction of hierarchical \emph{coherence relations}
(semantic or pragmatic links) between these EDUs.

Although the idea of splitting the text into smaller meaningful pieces
and inferring semantic relationships between these parts is anything
but new, dating back to the very origins of general
linguistics~\cite{Aristotle:10} and in particular its structuralism
branch~\cite{Saussure:90}, an especially big surge of interest in this
field happened in the 1970-s with the fundamental works of
\citet{vanDijk:72} and \citet{vanDijk:83}, who introduced the notion
of local and global coherence, defining the former as a set of ``rules
and conditions for the well-formed concatenation of pairs of sentences
in a linearly ordered sequence'' and specifying the latter as
constraints on the macro-structure of the
narrative~\cite[see][]{Hoey:83}.  Similar ideas were also proposed
by~\citet{Longacre:79,Longacre:96}, who considered the paragraph as a
unit of tagmemic grammar that was composed of multiple sentences
according to a predefined set of compositional principles.  Almost
contemporary with these works, \citet{Winter:77} presented an
extensive study of various lexical means that could connect two
sentences and grouped these means into two major categories:
\textsc{Matching} and \textsc{Logical Sequence}, depending on whether
they introduced sentences that were giving more details on the
preceding content (\textsc{Matching}) or adding new information to the
narrative (\textsc{Logical Sequence}).

The increased interest of traditional linguistics in text-level
analysis has rapidly spurred the attention of the broader NLP
community.  Among the first who stressed the importance of discourse
phenomena for automatic generation and understanding of texts was
\citet{Hobbs:79}, who argued that semantic ties between sentences were
one the most important component for building a coherent discourse.
Similarly to \citeauthor{Winter:77}, \citeauthor{Hobbs:79} also
proposed a classification of inter-sentence relations, dividing them
into \textsc{Elaboration}, \textsc{Parallel}, and \textsc{Contrast}.
Albeit this taxonomy was obviously too small to accommodate all
possible semantic and pragmatic relationships that could exist between
two clauses, this division had laid the foundations for many
successful approaches to automatic discourse analysis that appeared in
the following decades.

\paragraph{RST.}

One of the best-known such approaches, \emph{Rhetorical Structure
  Theory} or \emph{RST}, was presented by~\citet{Mann:88}.  Besides
revising \citeauthor{Hobbs:79}' inventory of discourse relations and
expanding it to 23 elements (including new items such as
\textsc{Antithesis}, \textsc{Circumstance}, \textsc{Evidence}, and
\textsc{Elaboration}), the authors also grouped all coherence links
into nucleus-satellite (hypotactic) and multinuclear (paratactic)
ones, depending on whether the arguments of these edges were of
different or equal importance to the content of the whole text.  Based
on this grouping, they formally described each relation as a set of
constraints on the \emph{Nucleus} (N), \emph{Satellite} (S), \emph{the
  N+S combination}, and \emph{the effect} of the whole combination on
the reader (R).  An excerpt from the original description of the
\textsc{Antithesis} relation is given in
Example~\ref{dasa:exmp:rst-evidence}

\begin{example}[Definition of the \textsc{Antithesis} Relation]\label{dasa:exmp:rst-evidence}
  \textbf{Relation Name:} \textsc{Antithesis}

  \textbf{Constraints on N:} W has positive regard for the situation
  presented in N

  \textbf{Constraints on S:} None

  \textbf{Constraints on the N+S Combination:} the situations presented
  in N and S are in contrast (\ie{} are
  \begin{inparaenum}[(a)]
  \item comprehended as the same in many respects,
  \item comprehended as differing in a few respects and
  \item compared with respect to one or more of these differences
  \end{inparaenum}); because of an incompatibility that arises from the contrast, one
  cannot have positive regard for both the situations presented in N
  and S\@; comprehending S and the incompatibility between the
  situations presented in N and S increases R's positive regard for
  the situation presented in N

  \textbf{Effect:} R's positive regard for N is increased

  \textbf{Locus of the Effect:} N
\end{example}
The authors then defined the general structure of discourse as a
projective (constituency) tree whose nodes were either elementary
discourse units or subtrees, which were connected to each other via
discourse relations.

You can see an example of such a discourse tree from the original
Rhetorical Structure Treebank~\cite{Carlson:01a} in
Figure~\ref{dasa:fig:rst-tree}.

\begin{figure*}[htb!]\label{dasa:fig:rst-tree}
  \input{rst.tex}
\end{figure*}

Despite its immense popularity and practical utility~\cite[see
][]{Marcu:98,Yoshida:14,Bhatia:15,Goyal:16}, RST has often been
criticized for the rigidness of the imposed tree
structure~\cite{Wolf:05} and unclear distinction between discourse
relations~\cite{Nicholas:94,Miltsakaki:04}.  As a result of this
criticism, two alternative approaches to automatic discourse analysis
were proposed in later works.

\paragraph{PDTB.}

One of these approaches, \emph{PDTB} (named so after the Penn
Discourse Treebank [\citeauthor{Prasad:04}, \citeyear{Prasad:04}]),
was developed by a research group at University of
Pennsylvania~\cite{Miltsakaki:04,Miltsakaki:04a,Prasad:08}.  Instead
of fully specifying the hierarchical structure of the whole text and
providing an all-embracing set of discourse relations, the authors of
this theory mainly focused on the grammatical and lexical means that
could connect two sentences (\emph{connectives}) and express a
semantic relationship (\emph{sense}) between these predicates.
Typical such means are coordinating or subordinating conjunctions
(\eg{} \emph{and}, \emph{because}, \emph{since}) and discourse
adverbials (\eg{} \emph{however}, \emph{otherwise}, \emph{as a
  result}), which can denote a \textsc{Comparison}, a
\textsc{Contingency}, or some other sense\footnote{In particular, the
  authors of PDTB distinguished four major senses
  (\textsc{Comparison}, \textsc{Contingency}, \textsc{Expansion}, and
  \textsc{Temporal}), and subdivided each of these categories into
  further subtypes, \eg{} \textsc{Comparison} included
  \textsc{Concession} and \textsc{Contrast}, whereas
  \textsc{Contingency} sense was further divided into \textsc{Cause}
  and \textsc{Condition}.} between two sentential arguments
(\textsc{Arg1} and \textsc{Arg2}).

%% The choice of these senses was explicitly restricted for each word:
%% for example, the set of possible senses for \emph{nonetheless}
%% included \textsc{Comparison}, \textsc{Conjunction},
%% \textsc{Contra-Expectation}, and \textsc{Contrast}.

Apart from \emph{explicitly} mentioned connectives, \citet{Prasad:04}
also allowed for situations where a connective was missing but could
be easily inferred from the text.  They called such cases
\emph{implicit} discourse relations and demanded the arguments of such
structures be determined as well.  Furthermore, if there was no
connective at all, the authors of PDTB distinguished three different
possibilities:
\begin{itemize}
\item the coherence relation was either expressed by an alternative
  lexical means, which made the connective redundant
  (\textsc{AltLex}),
\item or it was achieved by referring to the same entities in both
  arguments (\textsc{EntRel}),
\item or there was no coherence relation at all (\textsc{NoRel});
\end{itemize}
and also provided a special \textsc{Attribution} label for marking the
authors of reported speech.

Example~\ref{dasa:exmp:pdtb-analysis} shows the previous fragment of
the Rhetorical Treebank now annotated according to the PDTB scheme.

As we can see from the analysis, PDTB is indeed more flexible than
RST, as it allows its discourse units (arguments) to overlap, be
disjoint or even embedded into other segments.  The assignment of
sense relations is also more straightforward and mainly determined by
the connectives that link the arguments.  But, at the same time, the
structure of this annotation is completely flat so that we can neither
infer which of the sentences plays a more prominent role nor see the
modification scope of other supplementary statements.

\begin{example}[Example of PDTB Analysis]\label{dasa:exmp:pdtb-analysis}
  \fbox{Analysts said,} \argone[1]{profit for the dozen or so big drug
    makers, as a group, is estimated to have climbed between 11\% and
    14\%.}  \connective[1]{\textsc{implicit}$:=$in fact}
  \argtwo[1]{\connective[2]{\textsc{explicit}$:=$While}
    \argtwo[2]{that's not spectacular}}, \fbox{Neil Sweig, an analyst
    with Prudential Bache, said} \argtwo[1]{\argone[2]{\argone[3]{that
        the rate of growth will ``look especially good as compared to
        other companies} \connective[3]{\textsc{explicit}:
        if}\argtwo[3]{the economy turns downward}}}.''
\end{example}

\paragraph{SDRT.}

Another alternative to RST, \emph{Segmented Discourse Representation
  Theory} or \emph{SDRT}, was proposed by \citet{Lascarides:01}.
Although developed from a completely different angle of view (the
authors of SDRT mainly drew their inspiration from predicate logic,
dynamic semantics, and anaphora theory), this theory shares many of
its features with Rhetorical Structure Theory, as it also assumes a
graph-like structure of text and distinguishes between coordinating
and subordinating relations.  However, unlike RST, Segmented Discourse
Representation explicitly allows the text structure to be a multigraph
and not only tree (\ie{} a discourse node can have multiple parents
and can also be connected via multiple links to the same vertex),
provided that it does not have crossing dependencies (\ie{} does not
violate the right-frontier constraint).

We can also notice the relatedness of the two theories by looking at
the SDRT analysis of the previous RST fragment in
Example~\ref{dasa:fig:sdrt-graph}.  Although the names of the
relations in the presented graph differ from those used in Rhetorical
Structure Theory, many of these links have the same (or at least
similar) meaning as the respective edges in the first analysis: for
example, the \textsc{Source} relation in SDRT almost completely
corresponds to the \textsc{Attribution} edge in
Example~\ref{dasa:fig:rst-tree}, and the \textsc{Contrast} link is
similar to the \textsc{Comparison} relation defined by
\citet{Carlson:01b}.
%% These discrepancies between paratactic dependencies in SDRT and
%% their hypotactic equivalents in RST account for the lion's share of
%% the differences between the two discourse representations in
%% Figures~\ref{dasa:fig:rst-tree} and \ref{dasa:fig:sdrt-graph}.
%% Another dissimilarity stems from the different scopes of the
%% commentary \texttt{While that's not spectacular} assigned by SDRT and
%% RST: while the SDRT graph suggests that this opinion primarily relates
%% to the actual statement of Neil Sweig, RST tree widens the
%% modification scope of this opinion also to the fact of making this
%% statement.

\begin{figure}[htbp]
  \begin{center}
    \begin{tikzpicture}[>=triangle 45,semithick]
      \tikzstyle{edu}=[]; \tikzstyle{cdu}=[draw,shape=rectangle];
      \node[edu] (1a) at (1,0) {$\pi_{1a}$}; \node[edu] (1b) at (1,-2)
           {$\pi_{1b}$};

           \node[edu] (p'') at (7,0)  {$\pi''$};
           \node[edu] (p') at (5.5,-2)  {$\pi'$};
           \node[edu] (1g) at (8.5,-2)  {$\pi_{1g}$};

           \node[edu] (1e) at (4,-4)  {$\pi_{1e}$};
           \node[edu] (1f) at (7,-4)  {$\pi_{1f}$};

           \node[edu] (1c) at (2,-2)  {$\pi_{1c}$};
           \node[edu] (1d) at (4,-2)  {$\pi_{1d}$};

           \draw[->] (1a)  to node [auto] {Source} (1b);
           \draw[->] (1a)  to node [auto] {Narration} (p'');
           \draw[-]  (p'') to node [auto] {} (p');
           \draw[-]  (p'') to node [auto] {} (1g);
           \draw[->] (p')  to node [auto] {Precondition} (1g);
           \draw[-]  (p')  to node [auto] {} (1e);
           \draw[-]  (p')  to node [auto] {} (1f);
           \draw[->] (1e)  to node [auto] {Contrast} (1f);
           \draw[->] (p'') to node [xshift=-8mm,yshift=-0.35mm] {Commentary} (1c);
           \draw[->] (p'') to node [xshift=-0mm,yshift=0mm] {Source} (1d);
    \end{tikzpicture}
    \caption{Example of an SDRT graph}\label{dasa:fig:sdrt-graph}
  \end{center}
\end{figure}

\paragraph{Final choice.}

Because it was unclear which of these approaches (RST, PDTB, or SDRT)
would be more amenable to our sentiment experiments, we have made our
decision by considering the following theoretical and practical
aspects: From theoretical perspective, we wanted to have a strictly
hierarchical discourse structure for each analyzed tweet so that we
could infer the semantic orientation of that message by recursively
accumulating polarity scores of its elementary discourse segments.
From practical point of view, since there was no discourse parser
readily available for German, we wanted to have a maximal assortment
of such systems available for English so that we could pick one that
would be easiest to retrain on German data.  Fortunately, both of
these concerns have lead us to the same solution---Rhetorical
Structure Theory, which was the only formalism that explicitly
guaranteed a single root for each analyzed text and also offered a
wide variety of open-source parsing
systems~\cite[\eg][]{Hernault:10,Feng:14,Ji:14,Yoshida:14,Joty:15}.

\section{Data Preparation}\label{sec:dasa:data}

To prepare the data for our experiments, we split all microblogs from
%% This figure was generated using the iPython notebook
%% `notebooks/dasa.ipynb`.
\begin{figure*}[htb]
  \centering { \centering
        \begin{subfigure}{0.7\textwidth}
          \centering
          \includegraphics[width=\linewidth]{img/dasa_potts_edu_distribution.png}
          \caption{PotTS}\label{dasa:fig:data-distribution-potts}
        \end{subfigure}
      }
      \centering
          {
            \centering
            \begin{subfigure}{0.7\textwidth}
              \centering
              \includegraphics[width=\linewidth]{img/dasa_sb10k_edu_distribution.png}
              \caption{SB10k}\label{dasa:fig:data-distribution-sb10k}
            \end{subfigure}
          }
          \caption[EDU distribution in PotTS and SB10k]{Distribution
            of elementary discourse units and polarity classes in the
            training and development sets of PotTS and
            SB10k}\label{dasa:fig:data-distribution}
\end{figure*}

the PotTS and SB10k corpora into elementary discourse units using the
ML-based discourse segmenter of \citet{Sidarenka:15}, which had been
previously trained on the Potsdam Commentary Corpus~\cite[PCC~2.0;
][]{Stede:14}.  After filtering out all tweets that had only one
EDU,\footnote{Since the focus of this chapter is mainly on discourse
  phenomena, we skip all messages that consist of a single discourse
  segment, because their overall polarity is unaffected by the
  discourse structure and can be normally determined with the standard
  discourse-unaware sentiment techniques.} we obtained 4,771 messages
(12,137 segments) for PotTS and 3,763 posts (9,625 segments) for the
SB10k corpus.  In the next step, we assigned polarity scores to the
segments of these microblogs with the help of our lexicon-based
attention classifier, analyzing each elementary unit in isolation,
independently of the rest of the tweet.  We again used the same
70--10--20 split into training, development, and test sets as we did
in the previous chapters, considering message-level labels inferred
from the annotation of the second expert as gold standard for the
PotTS corpus and using provided manual sentiment labels for tweets as
reference for the SB10k data.

As we can see from the statistics in
Figure~\ref{dasa:fig:data-distribution}, most tweets that consist of
multiple EDUs typically have two or three segments, whereas messages
with more than three discourse units are extremely rare.  This is also
not surprising regarding that the maximum length of a microblog is
constrained to 140 characters.  Nonetheless, even with this severe
length restriction, there still are a few messages that have up to 13
EDUs.  Since it was somewhat surprising for us to see that many
segments in a single tweet, we decided to have a closer look at these
cases.  As it turned out, such high number of discourse units
typically resulted from spurious punctuation marks, which were
carelessly used by Twitter users and evidently confused the segmenter
(see Example~\ref{dasa:exmp:many-segments}).

\begin{example}[SB10k Tweet with 13 EDUs]\label{dasa:exmp:many-segments}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    [Guinness on Wheelchairs :]$_1$ [Das .]$_2$ [Ist .]$_3$ [Verdammt
      .]$_4$ [Noch .]$_5$ [Mal .]$_6$ [Einer .]$_7$ [Der .]$_8$
    [Besten .]$_9$ [Werbespots .]$_{10}$ [Des .]$_{11}$ [Jahrzehnts
      .]$_{12}$ [( Auch ...]$_{13}$ }\\
                  {\textup{[}Guinness on
                      Wheelchairs :\textup{]$_1$} \textup{[}This .\textup{]$_2$}
                    \textup{[}Is .\textup{]$_3$} \textup{[}Gosh .\textup{]$_4$}
                    \textup{[}Darn .\textup{]$_5$} \textup{[}It .\textup{]$_6$}
                    \textup{[}One .\textup{]$_7$} \textup{[}Of .\textup{]$_8$}
                    \textup{[}The best .\textup{]$_9$} \textup{[}Commercials
                      .\textup{]$_{10}$} \textup{[}Of .\textup{]$_{11}$} \textup{[}The
                      Decade .\textup{]$_{12}$} \textup{[}( Also ...\textup{]$_{13}$}}
\end{example}

Another noticeable trend that we can see in the data is that the
distribution of polar classes in messages with multiple segments
largely corresponds to the frequencies of these polarities in the
complete datasets: For example, the positive semantic orientation
still dominates the PotTS corpus, whereas the neutral polarity
constitutes the vast majority of the SB10k set.  At the same time,
negative microblogs again are the least represented class in both
cases and account for only 22\% of the former corpus and for 16\% of
the latter data.

To obtain RST trees for these messages, we retrained the DPLP
discourse parser of~\citet{Ji:14} on PCC, after converting all
discourse relations to the binary scheme $\{$\textsc{Contrastive},
\textsc{Non-Contrastive}$\}$ as suggested
by~\citet{Bhatia:15}.\footnote{See Table~\ref{dasa:tbl:rst-rel-sets}
  for more details regarding this mapping.}  In contrast to the
original DPLP implementation though, we did not use Brown
clusters~\cite{Brown:92}, because this resource was not available for
German, nor did we apply the linear projection of the features,
because the released parser code was missing this component either.
In part due to these modifications, but mostly because of the
specifics of the German language (richer morphology, higher lexical
variety, and syntactic ambiguity) and a skewed distribution of
discourse relation, the results of the retrained model were
considerably lower than the figures reported for the English treebank,
amounting to 0.777, 0.512, and 0.396~\F{} for span, nuclearity, and
relation classification on PCC~2.0 versus corresponding 82.08, 71.13,
and 61.63~\F{} on the RST Treebank.\footnote{Following \citet{Ji:14},
  we use the span-based evaluation metric of~\citet{Marcu:00}.}
\begin{figure*}[htb]
  \input{twitter-rst.tex}
\end{figure*}

An example of an automatically induced RST tree is shown in
Figure~\ref{dasa:fig:twitter-rst-tree}.  As we can see from this
picture, the adapted parser can correctly distinguish between
contrastive and non-contrastive relations in the analyzed tweet (even
though it only predicts the former class for two percent of all edges
on the PotTS and SB10k data [see
  Figure~\ref{dasa:fig:relation-distribution}]), but apparently
struggles with the disambiguation of the nuclearity status, assigning
the highest importance in this example to the initial discourse
segment (``Mooooiiinn.''  [\emph{Hellloooo!}]), which is merely a
greeting, and weighing the second EDU (``Gegen solche N\"achte hilft
die beste Kur nicht.''  [\emph{Even the best cure won't help against
    such nights.}]) less than the third one (``Aber Kaffee!''
[\emph{But coffee!}]), although traditional RST would rather consider
both units as equally relevant and join them via the multi-nuclear
\textsc{Contrast} link.
\begin{figure*}[bht]
  \centering
      {
        \centering
        \begin{subfigure}{0.7\textwidth}
          \centering
          \includegraphics[width=\linewidth]{img/dasa_potts_rel_distribution.png}
          \caption{PotTS}\label{dasa:fig:relation-distribution-potts}
        \end{subfigure}
      }
      \centering
          {
            \centering
            \begin{subfigure}{0.7\textwidth}
              \centering
              \includegraphics[width=\linewidth]{img/dasa_sb10k_rel_distribution.png}
              \caption{SB10k}\label{dasa:fig:relation-distribution-sb10k}
            \end{subfigure}
          }
          \caption[Relation distribution in PotTS and
            SB10k]{Distribution of discourse relations in the training
            and development sets of PotTS and
            SB10k}\label{dasa:fig:relation-distribution}
\end{figure*}

\section{Discourse-Aware Sentiment Analysis}\label{sec:dasa:methods}

% \done[inline]{\citet{Bickerstaffe:10}}

% \citet{Bickerstaffe:10} also considered the rating prediction task,
% addressing this problem with the minimum-spanning-tree (MST) SVM
% approach.  In the initial step of this method, they constructed a
% strongly connected graph whose vertices were associated with the most
% representative example (determined via the average all-pairs Tanimoto
% coefficient) of each star rating and the edge weights represented the
% Tanimoto distances between those nodes.  Afterwards, they determined
% the MST of this graph using the Kruskal's
% algorithm~\cite[see][pp.~567--574]{Cormen:09} and, finally,
% constructed a decision tree from this MST, replacing the MST vertices
% with binary SVM classifiers, which had to discern the respective
% rating groups. An evaluation on the four-star review corpus
% of~\citet{Pang:05} showed an improvement by up to~7\% over the
% previous state of the art, boosting it to 59.37\% average accuracy.

Now before we use these data in our sentiment experiments, let us
first revise the most prominent approaches to discourse-aware
sentiment analysis that exist in the literature nowadays.

As it turns out, even the very first works on opinion mining already
pointed out the importance of discourse phenomena for classification
of the overall polarity of a text.  For example, in the seminal paper
of~\citet{Pang:02}, where the authors tried to predict the semantic
orientation of movie reviews, they quickly realized the fact that it
was insufficient to rely on the mere presence or even the majority of
polarity clues in the text, because these clues could any time be
reversed by a single counter-argument of the critic (see
Example~\ref{disc-snt:exmp-pang02}).  This observation was also
confirmed by \citet{Polanyi:06}, who ranked discourse relations among
the most important factors that could significantly affect the
intensity and polarity of a sentiment.  To prove this claim, they gave
several convincing examples, where a concessive statement considerably
weakened the strength of a polar opinion, and vice versa, an
elaboration notably increased its persuasiveness.

\citet{Pang:04} were also among the first who incorporated a
discourse-aware component into a document-level sentiment classifier.
For this purpose, they developed a two-stage system in which the first
predictor distinguished between subjective and objective statements by
constructing a graph of all sentences (linking each sentence to its
neighbors and also connecting it to two abstract polarity nodes) and
then partitioning this graph into two clusters (subjective and
objective) based on its minimum cut; the second classifier then
inferred the overall polarity of the text by only looking at the
sentences from the first (subjective) group.  With this method,
\citeauthor{Pang:04} achieved a statistically significant improvement
(86.2\% versus 85.2\% for the Na\"{\i}ve Bayes system and 86.15\%
versus 85.45\% for SVM) over classifiers that analyzed all text
sentences at once, without any filtering.
%% (Later on, a similar approach was also proposed by
%% \citeauthor{Yessenalina:10}~[\citeyear{Yessenalina:10}], who used
%% an expectation-maximization algorithm to select a small subset of
%% the most indicative sentences and then classified the document [as
%% either positive or negative] with the help of this subset,
%% achieving 93.22\% accuracy on the aforementioned IMDB dataset.)

\begin{example}[Polarity reversal via discourse antithesis]\label{disc-snt:exmp-pang02}
  \noindent\upshape This film should be brilliant.  It sounds like a
  great plot, the actors are first grade, and the supporting cast is
  good as well, and Stallone is attempting to deliver a good
  performance.  However, it can't hold up.~\cite{Pang:02}
\end{example}

Although an oversimplification, the core idea that locally adjacent
sentences are likely to share the same subjective orientation
(\emph{local coherence}) was dominating the following DASA research
for almost a decade.  For example, \citet{Riloff:03} also improved the
accuracy of their Na\"{\i}ve Bayes predictor of subjective expressions
by almost two percent after adding a set of local coherence features.
Similarly, \citet{Hu:04} could better disambiguate users' attitudes to
particular product attributes by taking the semantic orientation of
previous sentences into account.

At the same time, another line of discourse-aware sentiment research
concentrated on the joint classification of all opinions in the text,
where in addition to predicting each sentiment in isolation, the
authors also sought to maximize the ``total happiness'' (\emph{global
  coherence}) of these assignments, ensuring that related subjective
statements received agreeing polarity scores.  Notable works in this
direction were done by \citet{Snyder:07}, who proposed the Good Grief
algorithm for predicting users' satisfaction with different restaurant
aspects, and \citet{Somasundaran:08a,Somasundaran:08}, who introduced
the concept of \emph{opinion frames} (OF), a special data structure
for capturing the relations between opinions in discourse.  Depending
on the type of these opinions (arguing~[\emph{A}] or
sentiment~[\emph{S}]), their polarity towards the target
(positive~[\emph{P}] or negative~[\emph{N}]), and semantic
relationship between these targets (alternative~[\emph{Alt}] or the
same~[\emph{same}]), the authors distinguished 32 types of possible
frames (\emph{SPSPsame}, \emph{SPSNsame}, \emph{APAPalt}, etc.),
dividing them into reinforcing and non-reinforcing ones.  In later
works, \citet{Somasundaran:09a,Somasundaran:09b} also presented two
joint inference frameworks (one based on the iterative classification
and another one relying on integer linear programming) for determining
the best configuration of all frames in text, achieving 77.72\%
accuracy on frame prediction in the AMI meeting
corpus~\cite{Carletta:05}.

%% \done[inline]{\citet{Somasundaran:09a,Somasundaran:09b}}

%% In a later work, \citet{Somasundaran:09b,Somasundaran:09a} also
%% introduced a joint inference framework based on the Iterative
%% Classification Algorithm (ICA) and Integer Linear Programming (ILP)
%% for joinly predicting the best configuration of single opinions and
%% their frames.  In this approach, the authors first applied a local SVM
%% classifier to compute the probabilities of polarity classes (positive,
%% negative, or neutral) of individual dialog acts and then harnessed the
%% ICA and ILP systems to determine which of the predicted opinions were
%% connected via opinion frames and whether these frames were reinforcing
%% or not.  Given a perfect information about the opinion links, this
%% joint method outperformed the local classifier by more than 9
%% percentage points, reaching 77.72\% accuracy on the AMI meeting
%% corpus~\cite{Carletta:05}.

%% \done[inline]{\citet{Mao:06}}

%% \citet{Mao:06} proposed the idea of isotonic CRFs in which they
%% explicitly modeled the constraint that features which were stronger
%% associated with either polarity classes had to have higher
%% coefficients than less predictive attributes.  After proving that this
%% formalism also allowed to directly model the ordinal scale of
%% sentiment scores (with lower CRF outputs indicating the negativity of
%% a sentence, and higher scores showing its positive class), the authors
%% used this approach to model the sentiment flow in a document.  For
%% this purpose, they first predicted the polarity value for each
%% sentence of a document in isolation and then convolved these outputs
%% with a Gaussian kernel, getting a smoothed polarity curve for the
%% whole analyzed text at the end.
%% \done[inline]{\citet{Thomas:06}}

%% \citet{Thomas:06} enhanced an SVM-based sentiment classification
%% system for predicting speaker's attitude in political speeches with
%% information about the inter-speaker agreement, incorporating these
%% links into the global cost function.  Thanks to this change, the
%% authors achieved $\approx$4\% improvement in accuracy (from 66.05 to
%% 70.81\%) over the baseline classifer which analyzed each utterance in
%% isolation.

An attempt to unite local and global coherence was made by
\citet{McDonald:07}, who tried to simultaneously predict the polarity
of a document and classify semantic orientations of its sentences.
For this purpose, the authors devised an undirected probabilistic
graphical model based on the structured linear
classifier~\cite{Collins:02}.  Similarly to \citet{Pang:04}, they
connected the label nodes of each sentence to the labels of its
neighboring clauses and also linked these nodes to the overarching
vertex representing the polarity of the text.  After optimizing this
model with the MIRA learning algorithm~\cite{Crammer:03},
\citeauthor{McDonald:07} achieved an accuracy of 82.2\% for
document-level classification and 62.6\% for sentence-level prediction
on a corpus of online product reviews, outperforming pure document and
sentence classifiers by up to four percent.  A crucial limitation of
this system though was that its optimization required the gold labels
of sentences and documents to be known at the training time, which
considerably limited its applicability to other domains with no such
data.

%% A similar approach was also suggested by~\citet{Sadamitsu:08}, who
%% attained 82.74\% accuracy on predicting the polarity of customer
%% reviews with the help of hidden conditional random fields.

Another significant drawback of all previous approaches is that they
completely ignored traditional discourse theory and, as a result,
severely oversimplified discourse structure.  Among the first who
tried to overcome this omission were \citet{Voll:07}, who proposed two
discourse-aware enhancements of their lexicon-based sentiment
calculator (SO-CAL).  In the first method, the authors let the SO-CAL
analyze only the topmost nucleus EDU of each sentence, whereas in the
second approach, they expanded its input to all clauses that another
classifier had considered as relevant to the main topic of the
document.  Unfortunately, the former solution did not work out as well
as expected, yielding 69\% accuracy on the corpus of Epinion
reviews~\cite{Taboada:06}, but the latter system could perform much
better, achieving 73\% on this two-class prediction task.

Other ways of adding discourse information to a sentiment system were
explored by \citet{Heerschop:11}, who experimented with three
different approaches:
\begin{inparaenum}[(i)]
\item increasing the polarity scores of words that appeared near the
  end of the document,
\item assigning higher weights to nucleus tokens, and finally
\item learning separate scores for nuclei and satellites using a
  genetic algorithm.
\end{inparaenum}
An evaluation of these methods on the movie review corpus
of~\citet{Pang:04} showed better performance of the first option
(60.8\% accuracy and 0.597 macro-\F), but the authors could
significantly improve the results of the last classifier at the end by
adding an offset to the decision boundary of this method, which
increased both its accuracy and macro-averaged \F{} to 0.72.

Further notable contributions to RST-based sentiment analysis were
made by \citet{Zhou:11}, who used a set of heuristic rules to infer
polarity shifts of discourse units based on their nuclearity status
and outgoing relation links; \citet{Zirn:11}, who used a lexicon-based
sentiment system to predict the polarity scores of elementary
discourse units and then enforced consistency of these assignments
over the RST tree with the help of Markov logic constraints; and,
finally, \citet{Wang:13}, who determined the semantic orientation of a
document by taking a linear combination of the polarity scores of its
EDUs and multiplying these scores with automatically learned
coefficients.

%% \footnote{Similarly to the approach of~\citet{Zirn:11}, these
%%   coefficients depended on the status of the segment in the RST
%%   tree (whether nucleus or sattelite) and relation, which connected
%%   the respective discourse node to the ancestor.}  A similar system
%%   was also described by \citet{Chenlo:13,Chenlo:14}, who used their
%%   model to analyze user blog posts, achieving significantly better
%%   results on the TREC corpus \cite{Macdonald:09} than any
%%   discourse-unaware baselines.

Among the most recent advances in RST-aware sentiment research, we
should especially emphasize the work of \citet{Bhatia:15}, who
proposed two different DASA systems:
\begin{itemize}
\item discourse-depth reweighting (DDR)
\item and rhetorical recursive neural network (R2N2).
\end{itemize}
In the former approach, the authors estimated the relevance
$\lambda_i$ of each elementary discourse unit $i$ as:
\begin{equation*}
  \lambda_i = \max\left(0.5, 1 - d_i/6\right),
\end{equation*}
where $d_i$ stands for the depth of the $i$-th EDU in the document's
discourse tree.  Afterwards, they computed the sentiment score
$\sigma_i$ of that unit by taking the dot product of its binary
feature vector $\mathbf{w}_i$ (token unigrams) with polarity scores
$\boldsymbol{\theta}$ of these unigrams:
\begin{equation*}
  \sigma_i = \boldsymbol{\theta}^{\top}\mathbf{w}_i;
\end{equation*}
and then calculated the overall semantic orientation of the
document~$\Psi$ as the sum of sentiment scores for all units,
multiplying these scores by their respective discourse-depth factors:
\begin{equation*}
  \Psi = \sum_i\lambda_i\boldsymbol{\theta}^T\mathbf{w}_i = \boldsymbol{\theta}^T\sum_i\lambda_i\mathbf{w}_i,
\end{equation*}
In the R2N2 system, the authors largely adopted the RNN method
of~\citet{Socher:13} by recursively computing the polarity scores of
discourse units as:
\begin{equation*}
  \psi_i = \tanh\left(K_n^{(r_i)} \psi_{n(i)} + K_s^{(r_i)}\psi_{s(i)} \right),
\end{equation*}
where $K_n^{(r_i)}$ and $K_s^{(r_i)}$ stand for the nucleus and
satellite coefficients associated with the rhetorical relation $r_i$,
and $\psi_{n(i)}$ and $\psi_{s(i)}$ represent sentiment scores of the
nucleus and satellite of the $i$-th vertex.  This approach achieved
84.1\% two-class accuracy on the movie review corpus
of~\citet{Pang:04} and reached 85.6\% on the dataset
of~\citet{Socher:13}.

For the sake of completeness, we should also note that there exist
discourse-aware sentiment approaches that build upon PDTB and SDRT\@.
For example, \citet{Trivedi:13} proposed a method based on latent
structural SVM~\cite{Yu:09}, where they represented each sentence as a
vector of features produced by a feature function $\mathbf{f}(y,
\mathbf{x}_i, h_i)$, in which $y\in\{-1, +1\}$ denotes the potential
polarity of the whole document, $h_i \in \{0, 1\}$ stands for the
assumed subjectivity class of sentence $i$, and $\mathbf{x}_i$
represents the surface form of that sentence; and then tried to infer
the most likely semantic orientation of the document $\hat{y}$ over
all possible assignments $\mathbf{h}$, \ie{}:
\begin{equation*}
  \hat{y} =
  \argmax_y\left(\max_{\mathbf{h}}\mathbf{w}^{\top}\mathbf{f}(y,
  \mathbf{x}, \mathbf{h})\right).
\end{equation*}
To ensure that these assignments were still coherent, the authors
additionally extended their feature space with special
\emph{transitional} attributes, which indicated whether two adjacent
sentences were likely to share the same subjectivity given the
discourse connective between them.  With the help of these features,
\citeauthor{Trivedi:13} could improve the accuracy of the
connector-unaware model on the movie review corpus of~\citet{Maas:11}
from 88.21 to 91.36\%.

The first step towards an SDRT-based sentiment approach was made by
\citet{Asher:08}, who presented an annotation scheme and a pilot
corpus of English and French texts that were analyzed according to the
SDRT theory and enriched with additional sentiment information.
Specifically, the authors asked the annotators to ascribe one of four
opinion categories (reporting, judgment, advice, or sentiment) along
with their subclasses (\eg{} inform, assert, blame, recommend) to each
discourse unit that had at least one opinionated word from a sentiment
lexicon.  Afterwards, they showed that with a simple set of rules, one
could easily propagate opinions through SDRT graphs, increasing the
strengths or reversing the polarity of the sentiments, depending on
the type of the discourse relation that was linking two segments.

In general, however, PDTB- and SDRT-based sentiment systems are much
less common than RST-inspired solutions.  Because of this fact and due
to the reasons described in Section~\ref{sec:dasa:theory}, we will
primarily concentrate on the RST-based of methods.  In particular, for
the sake of comparison, we replicated the linear combination approach
of \citet{Wang:13} and also reimplemented the DDR and R2N2 systems
of~\citet{Bhatia:15}.  Furthermore, to see how these techniques would
perform in comparison with much simpler baselines, we additionally
created two methods that predicted the polarity of a message by only
considering its last or topmost nucleus EDU (henceforth \textsc{Last}
and \textsc{Root}), and also estimated the results of our original LBA
classifier without any discourse-related modifications (henceforth
\textsc{No-Discourse}).

Apart from the above baselines and existing methods, we propose
several novel DASA solutions, which will be briefly described below.

\subsection{Latent CRF}

In the first of these solutions, called \emph{Latent Conditional
  Random Fields} or \emph{LCRFs}, we consider the problem of
message-level sentiment analysis as an inference task over an
undirected graphical model, where the nodes of the model represent
polarity probabilities of elementary discourse units and the structure
of the graph reflects the RST dependency tree of the
message.\footnote{Drawing on the work of~\citet{Bhatia:15}, we obtain
  this representation using the DEP-DT algorithm of~\citet{Hirao:13}
  with a minor modification that we do not follow any satellite
  branches while computing the heads of abstract RST nodes in Step 1
  of this procedure~\cite[see][pp.~1516--1517]{Hirao:13}.}  In
particular, we define CRF graph $\mathcal{G}=(\mathcal{V},
\mathcal{E})$ as a set of vertices $\mathcal{V}=
\mathcal{Y}\cup\mathcal{X}$, in which $\mathcal{Y}=\{y_{(i, j)}\mid
i\in\{\text{\textsc{Root}}, 1, 2, \ldots, T\}, j
\in\{\text{\textsc{Negative}, \textsc{Neutral},
  \textsc{Positive}}\}\}$ represents (partially observed) random
variables (with $T$ standing for the number of EDUs in the tweet), and
$\mathcal{X}=\{x_{(i, j)}\mid i\in\{\text{\textsc{Root}}, 1, 2,
\ldots, T\}, j \in[0,\ldots, 3]\}$ denotes the respective features of
these nodes (three polarity scores returned by the LBA classifier plus
an additional offset feature whose value is always \texttt{1}
irrespectively of the input).  Since the \textsc{Root} vertex,
however, does not have a corresponding discourse segment in the RST
tree, we use the polarity scores predicted by the LBA classifier for
the whole message as features for this node.

Graph edges $\mathcal{E}$ connect random variables to their
corresponding features and also link every pair of vertices
$(v_{(k,\cdot)},v_{(i,\cdot)})$ if node $k$ appears as the parent of
node $i$ in the RST dependencies.\footnote{In fact, we use two edges
  to connect each child to its parent: one for the
  \textsc{Contrastive} relation and another one for the
  \textsc{Non-Contrastive} link.}  You can see an example of such
automatically induced CRF tree in Figure~\ref{dasa:fig:latent-crf}.

\begin{figure*}[thb]
  \centering \input{latent-crf}
  \caption[Example of an RST-based Latent-CRF]{Example of an
    automatically constructed RST-based latent-CRF tree\\ {\small
      (random variables are shown as circles, fixed input parameters
      appear as rectangles, and observed values are displayed in
      gray)}}\label{dasa:fig:latent-crf}
\end{figure*}

%% Figure~\ref{dasa:fig:latent-crf} shows a real example of such
%% automatically induced CRF tree where we can already notice a few
%% tendencies regarding the obtained discourse graph: First of all, our
%% segmenter clearly tends to oversegment its input, also considering
%% conjoined predicates and adverbial subordinate clauses as separate
%% discourse units.  Even though this behavior violates the principles of
%% standard RST, it actually comes advantageous to our particular
%% sentiment application as it allows the base classifier to be more
%% fine-grained (and consequently more precise) in its predictions.  At
%% the same time, we again can see that the automatic parser has
%% difficulties with determining the correct nuclearity status of
%% discourse segments, putting the segment ``f\"uhlt sich fast an''
%% (\textit{almost feels}) in the top-most position, which we can hardly
%% call the right decision.  Finally, we also can observe that despite an
%% incorrect prediction of the polarity of the whole tweet (the LBA
%% system considers it as a negative message, although human experts
%% regarded it as neutral) our base classifier might still have better
%% guesses for single EDUs, giving us at least a hypothetical possibility
%% to overcome its general error.

Now before we describe the training of our model, let us briefly
recall that in the standard CRF optimization we typically try to find
optimal parameters $\boldsymbol{\theta}^*$ that maximize the
log-likelihood of all label sequences $\mathbf{y}^{(i)}$ on the
training set $\mathcal{D}=\left\{\left(\mathbf{x}^{(i)},
\mathbf{y}^{(i)}\right)\right\}_{i=1}^{N}$, \ienocomma:
\begin{equation*}
  \boldsymbol{\theta}^* = \argmax_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}) = \sum_{i=1}^{N}\log\left(p\left(\mathbf{y}^{(i)}\vert\mathbf{x}^{(i)}; \boldsymbol{\theta}\right)\right),\label{dasa:eq:crf-objective}
\end{equation*}
where the conditional likelihood is normally estimated as:
\begin{equation*}
  p\left(\mathbf{y}^{(i)}\vert\mathbf{x}^{(i)}; \boldsymbol{\theta}\right) =
  \frac{\exp\left(\sum_{t=1}^{T_i}\sum_k\boldsymbol{\theta}_k\mathbf{f}_k\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{(i)}_{t-t},\mathbf{y}^{(i)}_{t}\right)\right)}{Z}.
\end{equation*}
Adapting this equation to our RST-based CRF structures, we obtain:
\begin{equation}
  p\left(\mathbf{y}^{(i)}\vert\mathbf{x}^{(i)}; \boldsymbol{\theta}\right) =
  \frac{\exp\left(\sum_{t=0}^{T_i}\left[%
      \sum_v\boldsymbol{\theta}_v\mathbf{f}_v\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{(i)}_{t}\right)
      + \sum_{c\in
        ch(t)}\sum_e\boldsymbol{\theta}_e\mathbf{f}_e\left(\mathbf{y}^{(i)}_{t},
      \mathbf{y}^{(i)}_{c}\right)\right]\right)}{Z},\label{dasa:eq:tree-crf}
\end{equation}
where $ch(t)$ denotes the children of node $t$, $v$ stands for the
indices of node features, and $e$ represents the indices of edge
attributes.

A crucial problem with this formulation though is that in our task,
only a small subset of labels from $\mathbf{y}^{(i)}$ (namely those of
the root node) are actually observed at the training time, whereas the
rest of the tags (those which pertain to EDUs) are unknown.  We will
denote these observed and hidden subsets as $\mathbf{y}_o^{(i)}$ and
$\mathbf{y}_h^{(i)}$ respectively.  Using this notation, we can
redefine the training objective of our model as finding such
parameters $\boldsymbol{\theta}^*$ that maximize the log-likelihood of
\emph{observed} labels, \ienocomma:
\begin{equation*}
  \boldsymbol{\theta}^* =
  \argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}\log\left(p\left(\mathbf{y}_o^{(i)}\vert\mathbf{x}^{(i)};
  \boldsymbol{\theta}\right)\right).
\end{equation*}
With this formulation, however, it is still unclear what we should do
with hidden tags $\mathbf{y}_h^{(i)}$, because the values of their
features remain undefined.

One possible way to approach the problem of unobserved states in the
input is to assume that any label sequence $\mathbf{y}_h^{(i)}$ might
be true, and then try to maximize the parameters along the path that
leads to the maximum probability of the correct observed tag,
\ienocomma:
\begin{align}
  \begin{split}
    \mathbf{y}^{(i)}&=[\mathbf{y}_o^{(i)}, \mathbf{y}_h^{*(i)}]\text{, where}\\\label{dasa:eq:y_i}
    \mathbf{y}_h^{*(i)}&=\argmax_{\mathbf{y}_h^{(i)}}p\left(\mathbf{y}_o^{(i)}\vert\mathbf{x}^{(i)}\right),
  \end{split}
\end{align}
and which we can easily find using standard Viterbi decoding.

Unfortunately, if we simply consider label sequence $\mathbf{y}^{(i)}$
from Equation~\ref{dasa:eq:y_i} as the ground truth and penalize all
labels that disagree with this sequence, we might overly commit
ourselves to the model's guess of unknown tags and unduly discriminate
against other possible hidden label assignments.  To mitigate this
effect, we can instead penalize only one other sequence, namely the
one that maximizes the probability of an incorrect label at the
observed state:
\begin{align*}
  \mathbf{y}^{'(i)}&=\argmax_{\mathbf{y}_o^{'(i)}\neq\mathbf{y}_o^{(i)}}p\left([\mathbf{y}_o^{'(i)},
    \mathbf{y}_h^{*(i)}]\vert\mathbf{x}^{(i)}\right)\text{,
    where}\\
  \mathbf{y}_h^{*(i)}&=\argmax_{\mathbf{y}_h^{(i)}}p\left(\mathbf{y}_o^{'(i)}\vert\mathbf{x}^{(i)}\right).
\end{align*}
Correspondingly, we reformulate our objective and instead of
maximizing the log-likelihood of the training set will now maximize
the difference between the log-probabilities of the correct and most
likely wrong assignments:
\begin{align}
  \begin{split}
    \boldsymbol{\theta}^* &= \argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}\log\left(p\left(\mathbf{y}^{(i)}\right)\right) - \log\left(p\left(\mathbf{y}^{'(i)}\right)\right)\\
    &= \argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}\log\left(\exp\left(\boldsymbol{\theta}^{\top}\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)})\right)\right) - \log\left(\exp\left(\boldsymbol{\theta}^{\top}\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})\right)\right)\\
    &= \argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}\boldsymbol{\theta}^{\top}\left(\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)}) - \mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})\right),\label{dasa:eq:hcrf-objective}
  \end{split}
\end{align}
where $\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)})$ and
$\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})$ mean all features
associated with label sequences $\mathbf{y}^{(i)}$ and
$\mathbf{y}^{'(i)}$ respectively.

The only thing that we now need to do to the above objective is to
introduce a regularization term
$\frac{1}{2}\norm{\boldsymbol{\theta}}^2$ in order to prevent its
divergence to infinity in the cases when
$\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)})$ and
$\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})$ are perfectly
separable.  This brings us to the final formulation:
\begin{align}
  \boldsymbol{\theta}^* &=
  \argmin_{\boldsymbol{\theta}}\frac{1}{2}\norm{\boldsymbol{\theta}}^2 -
  \sum_{i=1}^{N}\boldsymbol{\theta}^{\top}\left(\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)})
  - \mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})\right)
\end{align}
At this point, we can notice that the resulting function is identical
to the unconstrained minimization problem of structural
SVM~\cite{Taskar:03}, and we indeed can piggyback on one of the many
efficient SVM-optimization techniques to learn the parameters of our
model.  In particular, we use the block-coordinate Frank-Wolfe
algorithm~\cite{Lacoste-Julien:13}, running it for 1,000 epochs or
until convergence, whichever of these events occurs first.

\subsection{Latent-Marginalized CRF}

Another way to tackle unobserved labels is to estimate the probability
of observed tags by marginalizing (summing) out hidden variables from
the joint distribution, \ienocomma:
\begin{align*}
  p\left(\mathbf{Y}_o{=}\mathbf{y}_o\right) &=
  \sum_{\mathbf{y}_h} p\left(\mathbf{Y}_o{=}\mathbf{y}_o,
  \mathbf{Y}_h{=}\mathbf{y}_h\right).
\end{align*}
Applying this formula to Equation~\ref{dasa:eq:tree-crf}, we get:
\begin{align*}
  \begin{split}
    p\left(\mathbf{y}_o^{(i)}\vert\mathbf{x}^{(i)}; \boldsymbol{\theta}\right) &=%
    \sum_{\mathbf{y}_h^{(i)}} p\left([\mathbf{y}_o^{(i)}, \mathbf{y}_h^{(i)}]%
    \vert\mathbf{x}^{(i)}; \boldsymbol{\theta}\right)\\
    &= \frac{\sum_{\mathbf{y}_h^{(i)}}\exp\left(\sum_{t=0}^{T_i}\left[%
        \sum_v\boldsymbol{\theta}_v\mathbf{f}_v\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{(i)}_{t}\right)
        + \sum_{c\in
          ch(t)}\sum_e\boldsymbol{\theta}_e\mathbf{f}_e\left(\mathbf{y}^{(i)}_{t},
        \mathbf{y}^{(i)}_{c}\right)\right]\right)}{Z},
  \end{split}
\end{align*}
where $\mathbf{y}^{(i)}$ in the numerator is defined as before:
$\mathbf{y}^{(i)}=[\mathbf{y}_o^{(i)}, \mathbf{y}_h^{(i)}]$.

This time again, we would like to maximize the probability of the
correct assignment, setting it apart from its closest competitor by
some margin.  Unfortunately, due to the summation over all
$\mathbf{y}_h^{(i)}$, we cannot avail ourselves of the log-exp
cancellation trick, which we used previously in
Equation~\ref{dasa:eq:hcrf-objective}.  Instead of this, we replace
the difference of the log-likelihoods by the ratio of marginal
probabilities:
\begin{align}
  \begin{split}
    \boldsymbol{\theta}^* &= \argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}\frac{p(\mathbf{y}^{(i)})}{p(\mathbf{y}^{'(i)})}\\
    &= \argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}
    \frac{\sum_{\mathbf{y}_h^{(i)}}\exp\left(\sum_{t=0}^{T_i}\left[%
        \sum_v\boldsymbol{\theta}_v\mathbf{f}_v\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{(i)}_{t}\right)
        + \sum_{c\in
          ch(t)}\sum_e\boldsymbol{\theta}_e\mathbf{f}_e\left(\mathbf{y}^{(i)}_{t},
        \mathbf{y}^{(i)}_{c}\right)\right]\right)}{\sum_{\mathbf{y}_h^{(i)}}\exp\left(\sum_{t=0}^{T_i}\left[%
        \sum_v\boldsymbol{\theta}_v\mathbf{f}_v\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{'(i)}_{t}\right)
        + \sum_{c\in
          ch(t)}\sum_e\boldsymbol{\theta}_e\mathbf{f}_e\left(\mathbf{y}^{'(i)}_{t},
        \mathbf{y}^{'(i)}_{c}\right)\right]\right)}\label{dasa:eq:hmcrf-objective}
  \end{split}
\end{align}
To simplify this expression, we can introduce the following
abbreviations:
\begin{align*}
  a &\defeq \exp\left(\sum_{t=0}^{T_i}\left[
    \sum_v\boldsymbol{\theta}_v\mathbf{f}_v\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{(i)}_{t}\right)
    + \sum_{c\in
      ch(t)}\sum_e\boldsymbol{\theta}_e\mathbf{f}_e\left(\mathbf{y}^{(i)}_{t},
    \mathbf{y}^{(i)}_{c}\right)\right]\right),\\
  b &\defeq \exp\left(\sum_{t=0}^{T_i}\left[
    \sum_v\boldsymbol{\theta}_v\mathbf{f}_v\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{'(i)}_{t}\right)
    + \sum_{c\in
      ch(t)}\sum_e\boldsymbol{\theta}_e\mathbf{f}_e\left(\mathbf{y}^{'(i)}_{t},
    \mathbf{y}^{'(i)}_{c}\right)\right]\right).
\end{align*}
Now we estimate the derivatives of functions $a$ and $b$ w.r.t.~a
single parameter $\boldsymbol{\theta}_v$ as:
\begin{align*}
  \frac{\partial{}a}{\partial\boldsymbol{\theta}_v} &= a\sum_{t=0}^{T_i}\mathbf{f}_v\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{(i)}_{t}\right)\propto\mathbb{E}_{\mathbf{y}^{(i)}}\left[\mathbf{f}_v\right],\\
  \frac{\partial{}b}{\partial\boldsymbol{\theta}_v} &= b\sum_{t=0}^{T_i}\mathbf{f}_v\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{(i)}_{t}\right)\propto\mathbb{E}_{\mathbf{y}^{'(i)}}\left[\mathbf{f}_v\right];
\end{align*}
and analogously obtain:
\begin{align*}
  \frac{\partial{}a}{\partial\boldsymbol{\theta}_e} &= a\sum_{t=0}^{T_i}\sum_{c\in ch(t)}\mathbf{f}_e\left(\mathbf{y}^{(i)}_{t}, \mathbf{y}^{(i)}_{c}\right)\propto\mathbb{E}_{\mathbf{y}^{(i)}}\left[\mathbf{f}_e\right],\\
  \frac{\partial{}b}{\partial\boldsymbol{\theta}_e} &= b\sum_{t=0}^{T_i}\sum_{c\in ch(t)}\mathbf{f}_e\left(\mathbf{y}^{'(i)}_{t}, \mathbf{y}^{'(i)}_{c}\right)\propto\mathbb{E}_{\mathbf{y}^{'(i)}}\left[\mathbf{f}_e\right].
\end{align*}
With the help of these expressions, we can easily compute the gradient
of the objective function w.r.t. $\boldsymbol{\theta}$ by observing
that:
\begin{align}
  \nabla_{\boldsymbol{\theta}} &=
  \sum_{i=1}^{N}\frac{\sum_{\mathbf{y}_h^{(i)}}\nabla_{\boldsymbol{\theta}}a\sum_{\mathbf{y}_h^{(i)}}b
    -
    \sum_{\mathbf{y}_h^{(i)}}a\sum_{\mathbf{y}_h^{(i)}}\nabla_{\boldsymbol{\theta}}b}{\left(\sum_{\mathbf{y}_h^{(i)}}b\right)^{2}}.\label{dasa:eq:lmcrf-gradient}
\end{align}
We again use the block-coordinate Frank-Wolfe algorithm to optimize
the parameters of our model, but instead of pushing these parameters
in the direction
$\boldsymbol{\psi}=\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)})-\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})$
(which is the derivative of latent CRFs, see Algorithm~2 in
[\citeauthor{Lacoste-Julien:13}, \citeyear{{Lacoste-Julien:13}}]), we
now maximize them along the gradient from
Equation~\ref{dasa:eq:lmcrf-gradient}.

It is probably easier to realize the difference between the two CRF
methods (latent and latent-marginalized CRFs) more vividly by looking
at Figure~\ref{dasa:fig:lcrf-vs-lmcrf}, in which we highlighted the
paths that are used to compute the probabilities of correct and wrong
labels in both systems.  As we can see from this picture, LCRF only
considers one label sequence that leads to the maximum probability of
the correct tag (\textsc{Neut}) at the single observed \textsc{Root}
node and then compares this sequence with the path that maximizes the
probability of an incorrect tag (in this case \textsc{NEG}) at the
same node.  In contrast to this, LMCRF considers all possible label
configurations of elementary discourse units and uses this total
cumulative mass to estimate the probability of both (correct and
wrong) observed tags.

\begin{figure*}[thb]
  \centering
  \begin{subfigure}[t]{0.4\textwidth}
    \centering
    \resizebox{\hcrfwidth}{\hcrfheight}{
      \input{latent-crf-a}
    }
    \caption{Computational path of the probability of the correct
      label in latent CRF}
  \end{subfigure}
  ~
  \begin{subfigure}[t]{0.4\textwidth}
    \centering
    \resizebox{\hcrfwidth}{\hcrfheight}{
      \input{latent-crf-b}
    }
    \caption{Computational path of the probability of a wrong label in
      latent CRF}
  \end{subfigure}\\[1em]
  \begin{subfigure}[t]{0.4\textwidth}
    \centering
    \resizebox{\hcrfwidth}{\hcrfheight}{
      \input{latent-mcrf-a}
    }
    \caption{Computational path of the probability of the correct
      label in latent-marginalized CRF}
  \end{subfigure}
  ~
  \begin{subfigure}[t]{0.4\textwidth}
    \centering
    \resizebox{\hcrfwidth}{\hcrfheight}{
      \input{latent-mcrf-b}
    }
    \caption{Computational path of the probability of a wrong label in
      latent-marginalized CRF}
  \end{subfigure}
  \caption[Computational paths in LCRF and
    LMCRF]{Confronted computational paths in latent and
    latent-marginalized conditional random
    fields}\label{dasa:fig:lcrf-vs-lmcrf}
\end{figure*}

\subsection{Recursive Dirichlet Process}

Finally, the last method that we present in this chapter,
\emph{Recursive Dirichlet Process} or \emph{RDP}, goes a further step
in the probabilistic direction by assuming that not only the
probabilities of discourse units but also the parameters via which
these probabilities are computed represent random variables.

In particular, we associate a variable
$\mathbf{z}_j\in\mathbb{R}_+^3$, s.t. $\norm{\mathbf{z}}_1 = 1$, with
every RST node $j$ (which in this case can be either an elementary
discourse segment or an abstract span).\footnote{In contrast to the
  previous CRF approaches, this time, we depart from the dependency
  tree representation and adopt the discourse tree structure proposed
  by~\citet{Bhatia:15} for their R2N2 method.  In this structure, we
  keep all abstract nodes from the original RST tree, but relink all
  satellites to the abstract parents of their nuclei.}  This variable
specifies the multivariate probability of the three polarities
(\textsc{Negative}, \textsc{Neutral}, and \textsc{Positive}) for the
$j$-th node.  Since every element of $\mathbf{z}_j$ has to be
non-negative and their total sum must add up to one, it is natural to
assume that the value of this variable is drawn from a Dirichlet
distribution:
\begin{align*}
  \mathbf{z}_j \sim Dir(\boldsymbol{\alpha}).
\end{align*}
The only parameter accepted by this distribution, which simultaneously
controls both the mean and the variance of its outcomes, is vector
$\boldsymbol{\alpha}$.  Consequently, our primary goal in this method
is to find a way how to compute this parameter automatically for each
node.

An obvious starting point for this computation is the polarity scores
predicted by the base classifier for every elementary discourse unit,
which we will henceforth denote as
$\boldsymbol{z}_{j_0}\in\mathbb{R}^3_+$. Since these scores, however,
are only available for elementary segments, we initialize the
corresponding variables of the abstract spans to zeroes with the only
exception being the root node, to which we again assign the scores
returned by the LBA classifier for the whole message.

To compute the posterior distribution of the root
($\mathbf{z}_{\text{\textsc{Root}}}$), we sort all nodes of the RST
tree in reverse topological order and estimate the polarities of the
spans from the bottom up by joining the $\mathbf{z}$-scores of their
children.  But before we do this joining, we multiply the
$\mathbf{z}$-vector of each child $k$ with a special matrix $M_r$,
where $r \in \{\{\text{\textsc{Nucleus},
  \textsc{Satellite}}\}\times\{\text{\textsc{Con\-tra\-sti\-ve},
  \textsc{Non-\-Con\-tra\-sti\-ve}}\}\}$ is the discourse relation
holding between that child and its parent, and project the result of
this multiplication back to the probability simplex using the
sparsemax operation~\cite{Martins:16}:
\begin{align}
  \mathbf{z}^*_k&=
  sparsemax\left(M_r\mathbf{z}_k^{\top}\right).\label{dasa:eq:z-asterisk}
\end{align}

The main goal of matrix $M_r$ is to reflect contextual polarity
changes that might be conveyed by discourse relations: for example, a
contrastive link might stronger affect the polarity of the parent than
a non-contrastive one (compare, for instance, the contrastive
\emph{Many people support Trump, but he behaves like an alpha male}
with the non-contrastive \emph{Many people support Trump, because he
  behaves like an alpha male}).  Because this parameter also
represents a random variable, we sample it from a multivariate normal
distribution:
\begin{align*}
  M_r \sim \mathcal{N}_{3\times3}(\boldsymbol{\mu}_r, \mathbf{\Sigma}_r),
\end{align*}
setting the mean of this distribution to:\footnote{Before we do the
  actual sampling, we unroll this parameter to a vector and then
  reshape the sampled value back to a $3\times3$ matrix.}
\begin{equation*}
  \boldsymbol{\mu}_r=\begin{bmatrix}
  1 & 0 & 0\\
  0 & 0.3 & 0\\
  0 & 0 & 1
  \end{bmatrix},
\end{equation*}
and initializing its covariance matrix to all ones:
\begin{equation*}
  \boldsymbol{\Sigma}_r=\begin{bmatrix}
  1 & 1 & 1\\
  1 & 1 & 1\\
  1 & 1 & 1
  \end{bmatrix}.
\end{equation*}
With this choice of parameters, we hope to dampen the effect of
neutral EDUs\footnote{As you can see from
  Equation~\ref{dasa:eq:z-asterisk}, the middle row of the $M_r$
  matrix is responsible for propagating the neutral score of the
  $j$-the node, and by setting this row to $[0, 0.3, 0]$ we
  effectively reduce the neutral polarity by two thirds.} in order to
prevent situations where multiple objective segments vanquish the
meaning of a single polar discourse unit.

Afterwards, when seeing the $k$-th child of the $j$-th node in the RST
tree, we compute the $\boldsymbol{\alpha}$ parameter of this node as
follows:
\begin{align}
  \boldsymbol{\alpha}_{j_k}&= \boldsymbol{\beta}\odot\mathbf{z}^*_k +
  (\mathbf{1} -
  \boldsymbol{\beta})\odot\mathbf{z}_{j_{k-1}},\label{dasa:eq:alpha}
\end{align}
where $\boldsymbol{\beta}\in\mathbb{R}^3$ is another multivariate
random variable sampled from the Beta distribution $B(5., 5.)$, which
controls the amount of information we want to pass from child to its
parent; $\mathbf{z}_{j_{k-1}}$ is the value of the $\mathbf{z}$-vector
for the $j$-th node after seeing its previous ($k-1$-th) child; and
$\odot$ means elementwise multiplication.

The only thing that we now need to do to the above
$\boldsymbol{\alpha}_{j_k}$ term before drawing the actual probability
$\boldsymbol{z}_{j_k}$ is to scale this vector by a certain amount in
order to reduce the variance of the resulting Dirichlet
distribution.\footnote{Because if we keep the
  $\boldsymbol{\alpha}_{j_k}$ vector from Equation~\ref{dasa:eq:alpha}
  unchanged, most of its values will be in the range $[0,\ldots,1]$
  which will lead to an extremely high variance of the Dirichlet
  distribution.} In particular, we compute this scaling factor as
follows:
\begin{align*}
  scale&= \frac{\xi \times \left(0.1 + \cos\left(\mathbf{z}^*_k,
    \mathbf{z}_{j_{k-1}}\right)\right)}{H\left(\boldsymbol{\alpha}_{j_k}\right)};
\end{align*}
where $\xi$ is a model parameter sampled from a $\chi^2$-distribution:
$\xi\sim\chi^2(34)$; 0.1 is a constant used to prevent zero scales in
the cases when $\cos\left(\mathbf{z}^*_k, \mathbf{z}_{j_{k-1}}\right)$
is zero; and $H\left(\boldsymbol{\alpha}_{j_k}\right)$ stands for the
entropy of the $\boldsymbol{\alpha}_{j_k}$ vector.  Although this
expression looks somewhat complicated, the intuition behind it is very
simple: The $\xi$ term encodes our prior belief in the correctness of
model's prediction (the higher its value, the more we trust the
model); the cosine measures the similarity between the probabilities
of parent and child (the more similar these probabilities, the greater
will be the scale); and, finally, the entropy in the denominator tells
us how uniform the vector $\boldsymbol{\alpha}_{j_k}$ is (the more
equal its scores, the less confident we will be in the final outcome).

\begin{figure}[htb!]
  {\centering
    \includegraphics[width=\linewidth]{img/dirichlet-process.png} }
  \caption[Probability distributions computed by RDP]{Probability
    distributions of polar classes computed by the Recursive Dirichlet
    Process\\ {(higher probability regions are highlighted in red;
      \small $\mathbf{p}_{prnt}$ means the probability of the parent
      node [the values in the vector represent the scores for the
        negative, neutral, and positive polarities respectively];
      $\mathbf{p}_{chld}$ denotes the probability of the child; and
      $\boldsymbol{\alpha}$, $\boldsymbol{\mu}$, and
      $\boldsymbol{\sigma}^2$ represent the parameters of the
      resulting joint distribution shown in the
      simplices)}}\label{dasa:fig:rdp-alpha}
\end{figure}

With the $scale$ and $\boldsymbol{\alpha}_{j_k}$ terms at hand, we are
all set to compute the updated probability of polar classes for the
$j$-th node after considering its $k$-th child:
\begin{align*}
  \mathbf{z}_{j_{k}}\sim Dir(scale \times \boldsymbol{\alpha}_{j_k}).
\end{align*}

You can see some examples of this computation in
Figure~\ref{dasa:fig:rdp-alpha}, where we plotted different
configurations of parent and child probabilities
($\mathbf{z}_{j_{k-1}}$ and $\mathbf{z}_{k}$, shown to the right of
each picture) and the resulting Dirichlet distributions (represented
as simplices).  For instance, in the top-left figure, we show a
situation where the parent has a very strong probability of the
negative class ($[1, 0, 0]$), but the probability of the child is
absolutely uniform ($[0.33, 0.33, 0.33]$); in this case, the model
keeps to the negative polarity, heaping almost all probability mass in
this corner.  At the same, to account for the uncertainty about the
child, RDP slightly moves the crest of the probability hill (\ie{} its
mean) towards the positive class and makes the slopes of this hill
lower along all three axes (\ie{} increases its variance). On the
other hand, if parent and child have completely opposite semantic
orientations (say \textsc{Positive} and \textsc{Negative}), which the
base classifier is perfectly sure about, as shown in Subfigure~b, RDP
uniformly distributes the whole probability just along the
\textsc{Positive}--\textsc{Negative} edge.  Another situation is
depicted in the middle row, where parent and child again have opposite
polarities, but the base predictor is less sure about its decisions
and also admits a small chance that either of these nodes is neutral.
In this case, RDP still spreads most probability along the main polar
edge, but places the mean of this distribution right in-between the
two polar corners and also screeds some part of that mass towards the
center of the simplex.
%% Another tendency that we can see in these figures is that our model
%% does not make a distinction between parent and child, so that the
%% resulting probabilities in both cases are almost identical, even when
%% the polarities of these two nodes are swapped.
Finally, in the last row, we can see our intended discrimination of
the neutral orientation: This time, the parent node is strictly polar
(negative on the left and positive on the right), whereas its child is
neutral.  In contrast to the previous two examples, the mean of the
resulting distribution is located closer to the polar corner and not
in-between the two juxtaposed classes as before.
\begin{figure}[htb]
  \begin{center}
    \includegraphics[height=15em,width=0.4\linewidth]{img/rdp.png}
    \caption[A plate diagram of the Recursive Dirichlet Process]{A plate
      diagram of the Recursive Dirichlet Process\\{\small (without the
        final categorical draw)}}\label{dasa:fig:rdp-plate}
  \end{center}
\end{figure}

Returning back to our model, after processing all $K$ children of the
$j$-th node, we regard the last outcome $\mathbf{z}_{j_{K}}$ as the
final polarity distribution of that node and use this value to
estimate the probabilities of the remaining ancestors in the RST tree.
Finally, after finishing processing all descendants of the root, we
use the resulting $\mathbf{z}_{\text{\textsc{Root}}_K}$ vector as a
parameter of a categorical distribution from which we draw the final
prediction label $y$:
\begin{align*}
  y \sim Cat(\mathbf{z}_{\text{\textsc{Root}}}).
\end{align*}

Using this manually defined model as a starting point, we can estimate
our prior belief in the joint probability of hidden and observed
variables $p(y, \mathbf{z})$.  As it turns out, knowing this belief is
enough to derive another probability $q(\mathbf{z})$, which best
approximates the distribution of only the latent nodes.  In
particular, we define the structure of $q(\mathbf{z})$ to be the same
as in $p(y, \mathbf{z})$, but deprive it of the last step (drawing of
the observed label) and optimize the parameters $\boldsymbol{\theta}$
of this model ($\boldsymbol{\mu}_r$, $\boldsymbol{\Sigma}_r$, and the
parameters of the Beta and $\chi^2$ distributions) by maximizing the
evidence lower bound between $p$ and $q$, using stochastic gradient
descent\cite[see][]{Ranganath:14}:
\begin{align*}
  \mathcal{L}\left(\boldsymbol{\theta}\right)
  &=\mathbb{E}_{q_{\boldsymbol{\theta}}(\mathbf{z})}%
  \left[\log\left(p(y, \mathbf{z})\right) -
    \log\left(q(\mathbf{z})\right)\right].
\end{align*}
We perform this optimization for 100 epochs, picking the parameters
that yield the best macro-averaged \F{}-score on the set-aside
development data.

The results of our proposed and baseline methods are shown in
Table~\ref{dasa:tbl:res}.
\begin{table}[hbt]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.162\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
        *{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
      \toprule
      \multirow{2}*{\bfseries Method} & %
      \multicolumn{3}{c}{\bfseries Positive} & %
      \multicolumn{3}{c}{\bfseries Negative} & %
      \multicolumn{3}{c}{\bfseries Neutral} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}

      & Precision & Recall & \F{} & %
      Precision & Recall & \F{} & %
      Precision & Recall & \F{} & & \\\midrule

      \multicolumn{12}{c}{\cellcolor{cellcolor}PotTS}\\

      %% General Statistics:
      LCRF & 0.76 & 0.79 & \textbf{0.77} & %
      \textbf{0.61} & 0.53 & 0.56 & %
      0.7 & 0.71 & 0.71 & %
      0.67 & 0.709\\

      %% General Statistics:
      LMCRF & \textbf{0.77} & 0.77 & \textbf{0.77} & %
      \textbf{0.61} & 0.54 & 0.57 & %
      0.69 & \textbf{0.74} & \textbf{0.72} & %
      0.671 & \textbf{0.712}\\

      %% General Statistics:
      RDP & 0.73 & 0.82 & \textbf{0.77} & %
      \textbf{0.61} & 0.56 & \textbf{0.58} & %
      \textbf{0.73} & 0.65 & 0.69 & %
      \textbf{0.678} & 0.706\\

      %% General Statistics:
      DDR & 0.73 & 0.77 & 0.75 & %
      0.54 & \textbf{0.59} & 0.56 & %
      0.69 & 0.61 & 0.65 & %
      0.655 & 0.674\\

      %% General Statistics:
      R2N2 & 0.74 & 0.78 & 0.76 & %
      0.59 & 0.53 & 0.56 & %
      0.68 & 0.68 & 0.68 & %
      0.657 & 0.692\\

      %% General Statistics:
      WNG & 0.58 & 0.79 & 0.67 & %
      \textbf{0.61} & 0.21 & 0.31 & %
      0.61 & 0.57 & 0.59 & %
      0.487 & 0.59\\

      %% General Statistics:
      \textsc{Last} & 0.52 & \textbf{0.83} & 0.64 & %
      0.57 & 0.17 & 0.26 & %
      0.61 & 0.43 & 0.5 & %
      0.453 & 0.549\\

      %% General Statistics:
      \textsc{Root} & 0.56 & 0.73 & 0.64 & %
      0.58 & 0.22 & 0.32 & %
      0.55 & 0.54 & 0.54 & %
      0.481 & 0.56\\

      %% General Statistics:
      \textsc{No-Discourse} & 0.73 & 0.82 & \textbf{0.77} & %
      \textbf{0.61} & 0.56 & \textbf{0.58} & %
      0.72 & 0.66 & 0.69 & %
      0.677 & 0.706\\

      \multicolumn{12}{c}{\cellcolor{cellcolor}SB10k}\\

      %% General Statistics:
      LCRF & \textbf{0.64} & \textbf{0.69} & 0.66 & %
      0.45 & \textbf{0.45} & 0.45 & %
      \textbf{0.82} & 0.79 & \textbf{0.8} & %
      0.557 & 0.713\\

      %% General Statistics:
      LMCRF & \textbf{0.64} & \textbf{0.69} & \textbf{0.67} & %
      0.45 & \textbf{0.45} & 0.45 & %
      \textbf{0.82} & 0.79 & \textbf{0.8} & %
      \textbf{0.56} & \textbf{0.715}\\

      %% General Statistics:
      RDP & \textbf{0.64} & \textbf{0.69} & 0.66 & %
      0.45 & \textbf{0.45} & 0.45 & %
      0.82 & 0.79 & \textbf{0.8} & %
      0.557 & 0.713\\

      %% General Statistics:
      DDR & 0.59 & 0.63 & 0.61 & %
      \textbf{0.48} & 0.44 & \textbf{0.46} & %
      0.77 & 0.76 & 0.77 & %
      0.534 & 0.681\\

      %% General Statistics:
      R2N2 & \textbf{0.64} & \textbf{0.69} & 0.66 & %
      0.46 & \textbf{0.45} & 0.45 & %
      0.81 & 0.79 & \textbf{0.8} & %
      0.559 & 0.713\\

      WNG & 0.61 & 0.63 & 0.62 & %
      0.46 & 0.29 & 0.36 & %
      0.76 & \textbf{0.82} & 0.79 & %
      0.488 & 0.693\\

      %% General Statistics:
      \textsc{Last} & 0.56 & 0.55 & 0.56 & %
      0.46 & 0.29 & 0.36 & %
      0.73 & 0.8 & 0.76 & %
      0.459 & 0.661\\

      %% General Statistics:
      \textsc{Root} & 0.51 & 0.55 & 0.53 & %
      0.4 & 0.3 & 0.35 & %
      0.74 & 0.76 & 0.75 & %
      0.438 & 0.64\\

      %% General Statistics:
      \textsc{No-Discourse} & \textbf{0.64} & \textbf{0.69} & 0.66 & %
      0.45 & \textbf{0.45} & 0.45 & %
      \textbf{0.82} & 0.79 & \textbf{0.8} & %
      0.557 & 0.713\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Evaluation of DASA methods]{Results of discourse-aware
      sentiment analysis methods\\ {\small LCRF~--~latent conditional
        random fields, LMCRF~--~latent-marginalized conditional random
        fields, RDP~--~recursive Dirichlet process,
        DDR~--~discourse-depth reweighting~\cite{Bhatia:15},
        R2N2~--~rhetorical recursive neural network~\cite{Bhatia:15},
        WNG~--~\citet{Wang:13}, \textsc{Last}~--~polarity determined
        by the last EDU, \textsc{Root}~--~polarity determined by the
        root EDU(s), \textsc{No-Discourse}~--~discourse-unaware
        classifier}}\label{dasa:tbl:res}
  \end{center}
\end{table}

As we can see from the table, our approaches perform fairly well in
comparison with other systems, outperforming them in terms of macro-
and macro-averaged $F_1$ on both datasets.  Especially the
latent-marginalized CRF shows fairly strong scores, yielding the best
$F_1$-results for the positive and neutral classes on the PotTS and
SB10k data, which in turn leads to the highest overall micro-averaged
$F_1$-measure on these corpora.  This solution is closely followed by
the Recursive Dirichlet Process, whose $F_1$ for the positive class on
the PotTS test set is identical to that attained by LMCRF and the
$F$-score for the negative class is even one percent higher, which
allows it to reach the best macro-average on this test set.

As it turns out, the strongest competitors to our systems are the
\textsc{No-Discourse} approach and the R2N2 method by
\citet{Bhatia:15}. The former solution outperforms the latter on the
PotTS corpus on both metrics (macro- and micro-$F_1$), but falls
against it with respect to the macro-\F{} on the SB10k set.  The DDR
and WNG methods get sixth and seventh places respectively, followed by
the simplest solutions, \textsc{Last} and \textsc{Root}.
Interestingly enough, the \textsc{Last} approach beats the
\textsc{Root} method on the SB10k data, but shows worse scores on the
PotTS corpus, which is mostly due to the lower recall of the negative
class.

\subsection{Error Analysis}

Although our methods performed quite competitive, we decided to still
look at their remaining errors in order to understand the reasons for
their potential weaknesses.

The first such error shown in Example~\ref{snt:dasa:exmp:lcrf-error}
was made by the latent CRF system, which erroneously considered a
negative tweet as neutral.  But as we can see from the picture of the
automatic RST tree in this example, we can hardly expect the right
decision in this case anyway, because neither EDUs nor the root node
of this message were correctly classified as negative by the LBA
classifier.  Nevertheless, even in this apparently hopeless situation,
messages propagated from leaves to the root during the max-product
inference still tell the latter node that the predicted class better
be negative. (We inspected the belief propagation messages passed in
the forward direction and found that the total score for the negative
class amounts to 0.597, whereas the belief in the positive class [its
  closest rival] only runs up to 0.462.)  Unfortunately, these
messages cannot outweigh the high score of the neutral class that
results from the node features (the state score for this polarity is
equal to 0.524, whereas the negative class only obtains a score of
-0.118).\footnote{All scores for this example are given in the
  logarithm domain.}
\begin{example}[An Error Made by the \textsc{LCRF} System\newline]\label{snt:dasa:exmp:lcrf-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    $[$Boah , also doch wieder ein Mann , oder ?$]_1$ $[$ODER ?$]_2$
    $[$papst$]_3$}\\
  \noindent $[$Boah, a man again, isn't it ?$]_1$ $[$ISN'T ?$]_2$
  $[$pope$]_3$\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
  \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}%
    {
      \begin{center}
        \input{lcrf-error}
      \end{center}
    }
\end{example}

As it turns out, high neutral node scores of the root are also the
main reason for the misclassification in
Example~\ref{snt:dasa:exmp:lmcrf-error}, where the LMCRF system also
confuses the negative polarity with the neutral class.  This time,
however, messages coming from the leaves suggest almost equal
probabilities for both semantic orientations, so that feature scores
of the root completely call the shots in the final decision.
\begin{example}[An Error Made by the \textsc{LMCRF} System\newline]\label{snt:dasa:exmp:lmcrf-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape '
    $[$Wissen ?$]_1$ $[$Igitt geh weg damit !$]_2$}\\
  \noindent $[$Knowledge ?$]_1$ $[$Yuck , go away with it$]_2$\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
  \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}%
    {
      \begin{center}
        \input{lmcrf-error}
      \end{center}
  }
\end{example}

Unfortunately, the recursive Dirichlet process cannot withstand the
erroneous predictions of the base classifier either.  For instance, in
Example~\ref{snt:dasa:exmp:rdp-error}, LBA assigns the highest scores
to the positive class in three out of four EDUs, even though each of
these units by itself expresses a negative attitude of the author.
Alas, the only case where the base classifier correctly predicts the
negative label (``Das is noch lange nicht ausdiskutiert !''
[\emph{It's no way been talked out !}]) drowns at the very beginning
of the score propagation.  (As it turned out, the learned
$\boldsymbol{\beta}$ parameter, which controls the amount of
information passed from child to its parent in
Equation~\ref{dasa:eq:alpha}, is extremely low for the negative class,
amounting to only 0.097, whereas for the positive and negative
polarities it runs up to 0.212 and 0.279.  Due to this low value, only
one tenth of the negative score from the third EDU arrives at the
parent when the model computes the polarity scores of the abstract
span 2.)
\begin{example}[An Error Made by the \textsc{RDP} System\newline]\label{snt:dasa:exmp:rdp-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    $[$Prima , was sind das f\"ur Idioten im DFB ?$]_1$ $[$Das is eine
    Muppetsshow auf LSD !$]_2$ $[$Das is noch lange nicht ausdiskutiert !$]_3$
    $[$Kiessling ist ein Depp !$]_4$}\\
  \noindent $[$Great, who are these idiots in the DFB ? $]_1$ $[$It is
    a muppet show on LSD$]_2$ $[$It's no way been talked out !$]_3$
  $[$Kiessling is a goof !$]_4$\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
  \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{green3}{positive*}}%
    {
      \begin{center}
        \input{rdp-error}
      \end{center}
    }
\end{example}

Another interesting error shown in
Example~\ref{snt:dasa:exmp:root-error} was made by the baseline
\textsc{Root} system, which similarly to CRF-based approaches confused
the negative class with the neutral polarity.  This time, however, the
misclassification is due to the discourse structure itself rather than
wrong predictions of the underlying sentiment method.  Because LBA
correctly recognizes that the negative smiley at the end of tweet has
a strictly negative semantic orientation, but the discourse-aware
baseline does not see this EDU at all, as it only considers the
segment at the top of the tree, which merely expresses a factual
hypothesis, free of any polar connotation.
\begin{example}[An Error Made by the \textsc{Root} System\newline]\label{snt:dasa:exmp:root-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    $[$Die NSA weiss auch von dir \ldots$]_1$ $[$N\"utzt uns auch nichts .$]_2$ $[$ \%NegSmiley$]_3$}\\
  \noindent $[$The NSA also knows about you \ldots$]_1$ $[$It doesn't help us
    either$]_2$ $[$ \%NegSmiley$]_3$\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
  \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}%
    {
      \begin{center}
        \input{root-error}
      \end{center}
    }
\end{example}

Finally, the last example (\ref{snt:dasa:exmp:last-error}) shows an
error made by the \textsc{Last} baseline system, which predicts the
neutral label for a negative tweet based on the polarity of its
right-most EDU\@.  This unit indeed admits some positive moments with
regard to the sad news expressed in the first segment, but in contrast
to the movie description from Example~\ref{disc-snt:exmp-pang02},
where the last sentence completely overturned the polarity of the
whole text, this time, the final opinion does not alter the general
negative mood of the message, but only dampens its effect.
\begin{example}[An Error Made by the \textsc{Last} System]\label{snt:dasa:exmp:last-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    $[$' ( :'( :'( Die letzte Aussprache war wohl das schwerste
      Telefonat meines gesamten Lebens :'( :'( :'($]_1$ $[$Aber wir
      gehen friedlich und als F $\ldots]_2$}\\
  \noindent $[$' ( :'( :'( The last talk was probably the most
    difficult call in my entire life :'( :'( :'($]_1$ $[$But we go
    apart peacefully and as f $\ldots]_1$\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
  \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

\section{Evaluation}

As we could see from the examples in Section~\ref{sec:dasa:methods},
the results of our proposed methods were significantly limited by two
key factors:
\begin{inparaenum}[(i)]
\item scores predicted by the base sentiment system for tweets and
  EDUs and
\item the structure of RST trees constructed for these messages.
\end{inparaenum}
In order to estimate the effect of these factors more precisely, we
decided to rerun our experiments, trying alternative solutions for
each of these aspects.

\begin{figure*}[hbt!]
{ \centering
\begin{subfigure}{\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/dasa-potts-bc-macro-F1.png}
  \caption{\texttt{Macro-\F{}}}\label{dasa:fig:potts-base-classifier-macro-F1}
\end{subfigure}\\
\begin{subfigure}{\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/dasa-potts-bc-micro-F1.png}
  \caption{\texttt{Micro-\F{}}}\label{dasa:fig:potts-base-classifier-micro-F1}
\end{subfigure}
}
\caption[PotTS results of discourse-aware classifiers with different
  base classifiers]{Results of discourse-aware sentiment analysis
  methods with different base classifiers on the PotTS
  corpus}\label{dasa:fig:potts-base-classifier}
\end{figure*}

\subsection{Base Classifier}

To assess the impact of the former factor (the quality of the base
sentiment classifier), we replaced all polarity scores produced by the
LBA system with the respective values predicted by the best lexicon-
and machine-learning--based MLSA methods (the systems of
\citeauthor{Hu:04} [\citeyear{Hu:04}] and \citeauthor{Mohammad:13}
           [\citeyear{Mohammad:13}] respectively) and retrained all
           DASA approaches on the updated data, subsequently
           evaluating them on the PotTS and SB10k test sets.  The
           results of this evaluation are shown in
           Figures~\ref{dasa:fig:potts-base-classifier}
           and~\ref{dasa:fig:sb10k-base-classifier}.

\begin{figure*}[htb!]
{ \centering
\begin{subfigure}{\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/dasa-sb10k-bc-macro-F1.png}
  \caption{\texttt{Macro-\F{}}}\label{dasa:fig:sb10k-base-classifier-macro-F1}
\end{subfigure}\\
\begin{subfigure}{\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/dasa-sb10k-bc-micro-F1.png}
  \caption{\texttt{Micro-\F{}}}\label{dasa:fig:sb10k-base-classifier-micro-F1}
\end{subfigure}
}
\caption[SB10k results of discourse-aware classifiers with different
  base classifiers]{Results of discourse-aware sentiment analysis
  methods with different base classifiers on the SB10k
  corpus}\label{dasa:fig:sb10k-base-classifier}
\end{figure*}

As we can see from the first figure, our initially chosen LBA approach
is indeed a more amenable basis to almost all discourse-aware
sentiment methods on the PotTS corpus.  A few exceptions to this
general rule are the macro-averaged \F{}-score of the \textsc{Last}
baseline, which surprisingly improves in combination with the
lexicon-based system, and the micro-average of the \textsc{RDP} and
\textsc{Last} methods, which attain their best results (0.713 and
0.582) in conjunction with the SVM classifier of \citet{Mohammad:13}.

A slightly different situation is observed on the SB10k corpus though.
On this dataset, LBA still leads to higher macro-\F{}--scores for
\textsc{DDR}, \textsc{R2N2}, \textsc{WNG}, \textsc{Last}, and
\textsc{Root}; but the approach of \citet{Mohammad:13} improves the
results of \textsc{LCRF}, \textsc{LMCRF}, \textsc{RDP}, and
\textsc{No-Discourse}.  The SVM classifier is also the unequivocal
leader in terms of the micro-averaged \F, yielding the highest scores
for all systems except WNG\@.  Unfortunately, the lexicon-based
predictor of \citet{Hu:04} performs much weaker than SVM and LBA:\ the
highest macro- and micro-averaged \F{}-scores achieved with this
approach run up to 0.422 (\textsc{RDP}) and 0.625 (\textsc{Last})
respectively.  The most disappointing result for us, however, is that
the LMCRF system completely fails to predict any polar class except
\textsc{Neutral} on the SB10k test set when trained with the scores of
this method (see
Figure~\ref{dasa:fig:sb10k-base-classifier-macro-F1}).  Similarly,
LCRF yields considerably lower scores in combination with this
solution, reaching only 0.239 macro-\F{}.

\subsection{Parsing Quality and Relation Scheme}

Another factor that could significantly influence the results of
discourse-aware methods was the quality of automatic RST parsing and
the set of discourse relations distinguished by the parser system.
Although improving the results of DPLP let alone manually annotating
the complete PotTS and SB10k datasets was beyond the scope of our
dissertation (even though we have made such attempt, see
[\citeauthor{Sidarenka:15a}, \citeyear{Sidarenka:15a}]), we decided to
check whether at least evaluating the DASA methods on manually
annotated data would improve their results.  For this purpose, we
asked a student assistant to segment and parse 88\% of the tweets from
the PotTS test set\footnote{Unfortunately, due to the limited
  availability of the student, we could not annotate the whole test
  set.} and tested all DASA approaches on these hand-crafted RST data.
\begin{table}[bht!]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.162\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
        *{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
      \toprule
      \multirow{2}*{\bfseries Method} & %
      \multicolumn{3}{c}{\bfseries Positive} & %
      \multicolumn{3}{c}{\bfseries Negative} & %
      \multicolumn{3}{c}{\bfseries Neutral} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}

      & Precision & Recall & \F{} & %
      Precision & Recall & \F{} & %
      Precision & Recall & \F{} & & \\\midrule

      \multicolumn{12}{c}{\cellcolor{cellcolor}PotTS}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.82      0.82      0.82       541
      %% negative       0.66      0.55      0.60       220
      %% neutral       0.69      0.75      0.72       351
      %% avg / total       0.75      0.75      0.75      1112
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 70.96%
      %% Micro-Averaged F1-Score (All Classes): 74.7302%
      LCRF & 0.82 & 0.82 & \textbf{0.82} & %
       \textbf{0.66} & 0.55 & 0.6 & %
       0.69 & 0.75 & 0.72 & %
       0.71 & 0.747\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.83      0.81      0.82       541
      %% negative       0.65      0.55      0.60       220
      %% neutral       0.69      0.78      0.73       351
      %% avg / total       0.75      0.75      0.75      1112
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 70.86%
      %% Micro-Averaged F1-Score (All Classes): 74.9101%
     LMCRF & \textbf{0.83} & 0.81 & \textbf{0.82} & %
       0.65 & 0.55 & 0.6 & %
       0.69 & \textbf{0.78} & \textbf{0.73} & %
       0.709 & 0.749\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.80      0.84      0.82       541
      %% negative       0.64      0.58      0.61       220
      %% neutral       0.72      0.71      0.72       351
      %% avg / total       0.75      0.75      0.75      1112
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 71.76%
      %% Micro-Averaged F1-Score (All Classes): 75.0899%
      RDP & 0.8 & 0.84 & \textbf{0.82} & %
       0.64 & 0.58 & 0.61 & %
       \textbf{0.72} & 0.71 & 0.72 & %
       \textbf{0.718} & 0.751\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.78      0.75      0.77       541
      %% negative       0.58      0.66      0.62       220
      %% neutral       0.66      0.63      0.64       351
      %% avg / total       0.70      0.70      0.70      1112
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 69.31%
      %% Micro-Averaged F1-Score (All Classes): 69.7842%
      DDR & 0.78 & 0.75 & 0.77 & %
       0.58 & \textbf{0.66} & \textbf{0.62} & %
       0.66 & 0.63 & 0.64 & %
       0.693 & 0.698\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.81      0.82      0.81       541
      %% negative       0.64      0.53      0.58       220
      %% neutral       0.68      0.74      0.71       351
      %% avg / total       0.73      0.74      0.73      1112
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 69.67%
      %% Micro-Averaged F1-Score (All Classes): 73.6511%
      R2N2 & 0.81 & 0.82 & 0.81 & %
       0.64 & 0.53 & 0.58 & %
       0.68 & 0.74 & 0.71 & %
       0.697 & 0.737\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.58      0.74      0.65       541
      %% negative       0.63      0.19      0.29       220
      %% neutral       0.51      0.51      0.51       351
      %% avg / total       0.56      0.56      0.53      1112
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 46.97%
      %% Micro-Averaged F1-Score (All Classes): 55.7554%
      WNG & 0.58 & 0.74 & 0.65 & %
       0.63 & 0.19 & 0.29 & %
       0.51 & 0.51 & 0.51 & %
       0.47 & 0.558\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.55      0.86      0.67       541
      %% negative       0.51      0.11      0.18       220
      %% neutral       0.56      0.35      0.43       351
      %% avg / total       0.54      0.55      0.50      1112
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 42.59%
      %% Micro-Averaged F1-Score (All Classes): 55.0360%
      \textsc{Last} & 0.55 & \textbf{0.86} & 0.67 & %
       0.51 & 0.11 & 0.18 & %
       0.56 & 0.35 & 0.43 & %
       0.426 & 0.55\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.58      0.56      0.57       541
      %% negative       0.58      0.25      0.35       220
      %% neutral       0.43      0.60      0.50       351
      %% avg / total       0.53      0.51      0.50      1112
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 46.03%
      %% Micro-Averaged F1-Score (All Classes): 51.2590%
      \textsc{Root} & 0.58 & 0.56 & 0.57 & %
       0.58 & 0.25 & 0.35 & %
       0.43 & 0.6 & 0.5 & %
       0.46 & 0.513\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.81      0.84      0.82       541
      %% negative       0.65      0.57      0.61       220
      %% neutral       0.72      0.73      0.73       351
      %% avg / total       0.75      0.75      0.75      1112
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 71.63%
      %% Micro-Averaged F1-Score (All Classes): 75.2698%
      \textsc{No-Discourse} & 0.81 & 0.84 & \textbf{0.82} & %
       0.65 & 0.57 & 0.61 & %
       \textbf{0.72} & 0.73 & \textbf{0.73} & %
       0.716 & \textbf{0.753}\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Results of DASA methods on manually annotated RST
      trees]{Results of discourse-aware sentiment analysis methods on
      the PotTS corpus with manually annotated RST trees}\label{dasa:tbl:res-gold}
  \end{center}
\end{table}

As we can see from the results in Table~\ref{dasa:tbl:res-gold}, the
scores of all systems except \textsc{WNG}, \textsc{Last}, and
\textsc{Root} increase by three to four percent.  Even the
macro-averaged \F{}-measure of the discourse-unaware classifier
improves from 0.677 to 0.716, as does its micro-\F{}--score, which
rises from 0.706 to 0.753 \F{}.  These last changes, however, are
exclusively due to the reduced size of the test data (on which the
base classifier performs better than on the full test set), since the
discourse-unaware method does not take RST trees into account.
Unfortunately, this time, \textsc{No-Discourse} also outperforms all
discourse-aware approaches in terms of the micro-averaged \F{},
achieving an accuracy of 75,3\%, although it still loses to the
Recursive Dirichlet Process on the macro-averaged metric, yielding a
0.2\% worse result than RDP (0.716 versus 0.718 macro-\F{}).
%% Despite its significant improvements, LMCRF cannot beat the
%% micro-\F{}--score of the discourse-unaware baseline either, even
%% though it achieves the same \F{}-results for the positive and neutral
%% classes as the discourse-unaware method.
Another surprising finding for us is that in the gold discourse
annotation, EDUs that determine the actual polarity of the tweet are
unlikely to appear either at the end of a message or at the top of its
RST tree, which leads to the degradation of the scores for the
\textsc{Last} and \textsc{Root} baselines.

Although manually annotated RST trees do improve the results of most
discourse-aware sentiment methods, this fact is of little help to us
if we are bound to the output of an automatic parser. A common way to
improve the quality of automatic RST analysis and ease the task of
DASA methods is to reduce the number of discourse relations
distinguished by the parsing system.  Drawing on the work of~\citet{Bhatia:15}, we also used this approach, projecting all
discourse relations from the Potsdam Commentary Corpus~\cite{Stede:14}
to the binary set of \textsc{Contrastive} and \textsc{Non-Contrastive}
ones.  Although similar approximations were made in almost all other
discourse-aware solutions~\cite[cf.
][]{Chenlo:13,Heerschop:11,Zhou:11}, we were not sure whether the
subset that we used was indeed optimal and sufficient to reflect all
possible discourse interactions that could play an important role in
sentiment composition.

To answer this question, we retrained the DPLP parser on the PCC,
using the subsets of relations proposed by \citet{Chenlo:13},
\citet{Heerschop:11}, and \citet{Zhou:11}, and also tried the original
set of all RST links from the Potsdam Commentary Corpus.  A detailed
overview of these sets is given in Table~\ref{dasa:tbl:rst-rel-sets}.
\begin{table}[hbt]
  \begin{center}
    \bgroup{}
    \setlength\tabcolsep{0.8\tabcolsep}\scriptsize
    \begin{tabular}{p{0.135\columnwidth} % first columm
        *{1}{>{\centering\arraybackslash}p{0.4\columnwidth}}
        *{1}{>{}p{0.4\columnwidth}}} % next two columns
      \toprule
      \textbf{Scheme} & \textbf{Relation Set} & {\centering\textbf{Equivalence Classes}}\\\midrule

      \citeauthor{Bhatia:15} & \{\textsc{Contrastive},
      \textsc{\bfseries Non-Contrastive}\} & \textsc{Contrastive} $\defeq$
      \{\textsc{Antithesis}, \textsc{Antithesis-E},
      \textsc{Comparison}, \textsc{Concession},
      \textsc{Consequence-S}, \textsc{Contrast},
      \textsc{Problem-Solution}\}.\\

      \citeauthor{Chenlo:13} & \{\textsc{Attribution},
      \textsc{Background}, \textsc{Cause}, \textsc{Comparison},
      \textsc{Condition}, \textsc{Consequence}, \textsc{Contrast},
      \textsc{Elaboration}, \textsc{Enablement}, \textsc{Evaluation},
      \textsc{Explanation}, \textsc{Joint}, \textsc{Otherwise},
      \textsc{Temporal}, \textsc{\bfseries Other}\} & \\

      \citeauthor{Heerschop:11} & \{\textsc{Attribution},
      \textsc{Background}, \textsc{Cause}, \textsc{Condition},
      \textsc{Contrast}, \textsc{Elaboration}, \textsc{Enablement},
      \textsc{Explanation}, \textsc{\bfseries Other}\} & \\

      \textsc{PCC} & \{\textsc{Antithesis}, \textsc{Background},
      \textsc{Cause}, \textsc{Circumstance}, \textsc{Concession},
      \textsc{Condition}, \textsc{Conjunction}, \textsc{Contrast},
      \textsc{Disjunction}, \textsc{E-Elaboration},
      \textsc{Elaboration}, \textsc{Enablement},
      \textsc{Evaluation-N}, \textsc{Evaluation-S}, \textsc{Evidence},
      \textsc{Interpretation}, \textsc{Joint}, \textsc{Justify},
      \textsc{List}, \textsc{Means}, \textsc{Motivation},
      \textsc{Otherwise}, \textsc{Preparation}, \textsc{Purpose},
      \textsc{Reason}, \textsc{Restatement}, \textsc{Restatement-MN},
      \textsc{Result}, \textsc{Sequence}, \textsc{Solutionhood},
      \textsc{Summary}, \textsc{Unconditional}, \textsc{Unless},
      \textsc{Unstated-Relation}\} & \\

      \citeauthor{Zhou:11} & \{\textsc{Contrast}, \textsc{Condition},
      \textsc{Continuation}, \textsc{Cause}, \textsc{Purpose},
      \textsc{\bfseries Other}\} & \textsc{Contrast} $\defeq$
      \{\textsc{Antithesis}, \textsc{Concession}, \textsc{Contrast},
      \textsc{Otherwise}\};\newline \textsc{Continuation} $\defeq$
      \{\textsc{Continuation}, \textsc{Parallel}\};\newline
      \textsc{Cause} $\defeq$ \{\textsc{Evidence}, \textsc{Nonvolitional-Cause},
      \textsc{Nonvolitional-Result}, \textsc{Volitional Cause},
      \textsc{Volitional-Result}\};\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[RST relations used in different discourse-aware sentiment
      methods]{RST relations used in the original Potsdam Commentary
      Corpus and different discourse-aware sentiment methods\\ {\small
        (default relation, which subsumes the rest of the links, is
        shown in \textbf{boldface})}}\label{dasa:tbl:rst-rel-sets}
  \end{center}
\end{table}

To check whether cardinalities of these sets indeed correlated with
the quality of automatic RST parsing, we evaluated each retrained
system on the held-out PCC test data and present the results of this
evaluation in Table~\ref{dasa:tbl:dplp-results}.  As is evident from
the scores, coarser relation schemes in fact improve parsing quality,
especially in terms of relation \F{}.  In the most extreme case (\eg{}
\citeauthor{Bhatia:15}, which has only two links, versus PCC, which
comprises 34 relations), these gains can reach up to seven
percent. However, with respect to other metrics (span and nuclearity
\F{}), the gaps are notably smaller and might even be in favor of the
richer relation set (cf.\ nuclearity \F{} for PCC).
\begin{table}[htb]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.22\columnwidth} % first columm
        *{3}{>{\centering\arraybackslash}p{0.25\columnwidth}}} \toprule

      \textbf{Relation Scheme} & \textbf{Span \F{}} &
      \textbf{Nuclearity \F{}} & \textbf{Relation \F{}}\\\midrule

      \citeauthor{Bhatia:15} & \textbf{0.777} & 0.512 & \textbf{0.396}\\

      \citeauthor{Chenlo:13} & 0.769 & 0.505 & 0.362\\

      \citeauthor{Heerschop:11} & 0.774 & 0.51 & 0.361\\

      \textsc{PCC} & 0.776 & \textbf{0.534} & 0.326\\

      \citeauthor{Zhou:11} & 0.776 & 0.501 & 0.388\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Results of the DPLP parser on PCC~2.0]{Results of the
      DPLP parser on PCC~2.0 with different relation schemes\\}\label{dasa:tbl:dplp-results}
  \end{center}
\end{table}

To see how this varying quality affected the net results of
discourse-aware sentiment methods, we re-evaluated all DASA approaches
on the updated automatic RST trees and show the results of this
evaluation in Figures~\ref{dasa:fig:potts-rel-schemes}
and~\ref{dasa:fig:sb10k-rel-schemes}.
\begin{figure*}[bh!]
{
\centering
\begin{subfigure}{\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/dasa-potts-macro-F1.png}
  \caption{\texttt{Macro-\F{}}}\label{dasa:fig:potts-rel-schemes-macro-F1}
\end{subfigure}\\
\begin{subfigure}{\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/dasa-potts-micro-F1.png}
  \caption{\texttt{Micro-\F{}}}\label{dasa:fig:potts-rel-schemes-micro-F1}
\end{subfigure}
}
\caption[PotTS results of discourse-aware classifiers with different
  relation schemes]{Results of discourse-aware sentiment classifiers
  for different relation schemes on the PotTS
  corpus}\label{dasa:fig:potts-rel-schemes}
\end{figure*}

As it turns out, latent-marginalized CRF can still hold the overall
record in both macro- and micro-averaged \F{} on the PotTS corpus,
although its margin to the closest competitor (R2N2) is relatively
small, amounting to only 0.1 percent.  Interestingly enough, both
top-performing methods (LMCRF and R2N2) achieve their best results
with richer relation sets than the one we used in our initial
experiment: For example, LMCRF attains its highest macro-score in
combination with the relation scheme of~\citet{Heerschop:11} and
yields the best micro-\F{} when used with the scheme
of~\citet{Chenlo:14}.  The rhetorical recursive neural network, vice
versa, attains its highest macro-average with the latter relation set
and reaches its best micro-\F{} in conjunction with the former subset.

A different situation is observed with other DASA approaches though.
For example, LCRF and RDP perform best when used with the initially
chosen set of \citet{Bhatia:15}.  On the other hand, discourse-depth
reweighting strongly benefits from the full unconstrained set of PCC
relations, which is probably due to the better nuclearity
classification achieved with this scheme.  Finally, \textsc{WNG} and
\textsc{Root} reach their best results with the relation subsets
proposed by~\citeauthor{Chenlo:14} and \citeauthor{Heerschop:11},
respectively.

\begin{figure*}[htb!]
{ \centering
\begin{subfigure}{\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/dasa-sb10k-macro-F1.png}
  \caption{\texttt{Macro-\F{}}}\label{dasa:fig:sb10k-rel-schemes-macro-F1}
\end{subfigure}\\
\begin{subfigure}{\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/dasa-sb10k-micro-F1.png}
  \caption{\texttt{Micro-\F{}}}\label{dasa:fig:sb10k-rel-schemes-micro-F1}
\end{subfigure}
}
\caption[SB10k results of discourse-aware classifiers for different
  relation schemes]{Results of discourse-aware sentiment classifiers
  for different relation schemes on the SB10k
  corpus}\label{dasa:fig:sb10k-rel-schemes}
\end{figure*}

A much more uniform situation is observed on the SB10k corpus (see
Figure~\ref{dasa:fig:sb10k-rel-schemes}), where the \F-scores of our
methods vary only slightly across different relation schemes.  The
only significant improvements that we can notice this time are higher
macro- and micro-averaged \F{}s achieved by the RDP approach in
combination with the \citeauthor{Heerschop:11}'s subset.  This subset
is also most amenable to the \textsc{Root} baseline, which reaches
0.488 macro-\F{} and 0.663 micro-\F{}, significantly improving on its
initial results.  At the same time, discourse-depth reweighting and
the approach of~\citeauthor{Wang:13} capitalize on the relations
defined by~\citeauthor{Chenlo:13} so much that the former system even
achieves the highest overall macro-\F{}--score (0.572), being on a par
with the R2N2 system.

\section{Summary and Conclusions}

At this point, our chapter has come to an end and, concluding it, we
would like to recap that in this part of the thesis:
\begin{itemize}
  \item we have presented an overview of the most popular approaches
    to automatic discourse analysis (RST, PDTB, and SDRT) and
    explained why we think that one of these frameworks (Rhetorical
    Structure Theory) would be more amenable to the purposes of
    discourse-aware sentiment analysis than the others;
  \item to substantiate our claims and to see whether the
    lexicon-based attention system introduced in the previous chapter
    would indeed benefit from information on discourse structure, we
    segmented all microblogs from the PotTS and SB10k corpora into
    elementary discourse units using the SVM-based segmenter
    of~\citet{Sidarenka:15} and parsed these messages with the RST
    parser of~\citet{Ji:14}, which had been previously retrained on
    the Potsdam Commentary Corpus~\cite{Stede:14};
  \item afterwards, we estimated the results of existing
    discourse-aware sentiment methods (the systems
    of~\citeauthor{Wang:15}~[\citeyear{Wang:15}] and
    \citeauthor{Bhatia:15}~[\citeyear{Bhatia:15}]) and also evaluated
    two simpler baselines (in which we the predicted semantic
    orientation of a tweet by taking the polarity of its last and root
    EDUs), getting the best results with the R2N2 solution
    of~\citet{Bhatia:15} (0.657 and 0.559 macro-\F{} on PotTS and
    SB10k respectively);
  \item we could, however, improve on these scores and also outperform
    the plain LBA system (although by a not very large margin) with
    our three proposed discourse-aware sentiment solutions: latent and
    latent-marginalized conditional random fields and Recursive
    Dirichlet Process; pushing the macro-averaged \F{}-score on PotTS
    up to 0.678 and increasing the result on SB10k to 0.56 macro-\F{};
  \item a subsequent evaluation of these approaches with different
    settings showed that the results of all discourse-aware methods
    largely correlated with the scores of the base sentiment
    classifier and also revealed an important drawback of the
    latent-marginalized CRFs, which failed to predict any positive or
    negative instance on the test set of the SB10k corpus when trained
    in combination with the lexicon-based approach of~\citet{Hu:04};
  \item nevertheless, almost all DASA solutions could improve their
    scores when tested on manually annotated RST trees or used with a
    richer set of discourse relations.
\end{itemize}