-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsentiment_discourse.tex
2217 lines (2046 loc) · 110 KB
/
sentiment_discourse.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% FILE: main.tex Version 2.1
% AUTHOR:
% Universität Duisburg-Essen, Standort Duisburg
% AG Prof. Dr. Günter Törner
% Verena Gondek, Andy Braune, Henning Kerstan
% Fachbereich Mathematik
% Lotharstr. 65., 47057 Duisburg
% entstanden im Rahmen des DFG-Projektes DissOnlineTutor
% in Zusammenarbeit mit der
% Humboldt-Universitaet zu Berlin
% AG Elektronisches Publizieren
% Joanna Rycko
% und der
% DNB - Deutsche Nationalbibliothek
\chapter{Discourse-Aware Sentiment Analysis}\label{chap:discourse}
Although message-level sentiment analysis methods do a fairly good job
at classifying the overall polarity of a message,
%% putting their best leg forward to incorporate the compositional
%% principle into that prediction,
a crucial limitation of all these systems is that they completely
overlook the structural nature of their input by either considering it
as a single whole (\eg{} bag-of-features approaches) or analyzing it
as a monotone sequence of equally important elements (\eg{} recurrent
neural methods). Unfortunately, both of these solutions violate the
hierarchical principle of language~\cite{Saussure:90,Hjelmslev:70},
which states that complex linguistic units are formed from smaller
language elements in the bottom-up way, \eg{} words are created by
putting together morphemes, sentences are made of several words, and
discourses are composed of multiple coherent sentences. Moreover,
apart from this inherent structural heterogeneity, even units of the
same linguistic level might play a different role and be of unequal
importance when joined syntagmatically into the higher-level whole.
For example, in words, the root morpheme typically conveys more
lexical meaning than the affixes; in sentences, the syntactic head
usually dominates its grammatical dependents; and, in discourse, one
of the sentences frequently expresses more relevant ideas than the
rest of the text.
%% At the same time, even auxiliary modifying elements might completely
%% overturn the meaning of the central part to its opposite (cf. \emph{to
%% like} vs. \emph{to dislike}; \emph{She enjoyed this song}
%% vs. \emph{She didn't enjoy this song}; \emph{Trump is a good
%% businessman} vs. \emph{Trump is a good businessman, but a terrible
%% employer}).
Exactly the lack of discourse information was one of the main reasons
for the misclassifications made by the systems of \citet{Severyn:15},
\citet{Baziotis:17}, and our own LBA method in
Examples~\ref{snt:cgsa:exmp:severyn-error},\ \ref{snt:cgsa:exmp:baziotis-error},
and~\ref{snt:cgsa:exmp:lba-error}. Since none of these approaches
explicitly took discourse structure into account, we decided to check
whether making the last of these solutions (the LBA classifier) aware
of discourse phenomena would improve its results. But before we
present these experiments, we first would like to make a short
digression into the theory of discourse and give an overview of the
most popular approaches to text-level analysis that exist in the
literature nowadays. Afterwards, in Section~\ref{sec:dasa:data}, we
will describe the way how we inferred discourse information for PotTS
and SB10k tweets. Then, in Section~\ref{sec:dasa:methods}, we will
summarize the current state of the art in discourse-aware sentiment
analysis (DASA) and also present our own methods, evaluating them on
the aforementioned datasets. After analyzing the effects of various
common factors (such as the impact of the underlying sentiment
classifier and the amenability of various discourse relation schemes
to different DASA approaches), we will recap the results and summarize
our findings in the last part of this chapter.
\section{Discourse Analysis}\label{sec:dasa:theory}
Since the main focus of our experiments will be on \emph{discourse
analysis}, we first need to clarify what discourse analysis actually
means and which common ways there are to represent and analyze
discourse automatically.
In a nutshell, discourse analysis is an area of research which
explores and analyzes language phenomena beyond the sentence
level~\cite{Stede:11}. Although the scope of this research can be
quite large, ranging from the use of pronouns in a sentence to the
logical composition of the whole document, in our work we will
primarily concentrate on the coherence structure of a text, \ie{} its
segmentation into \emph{elementary discourse units} (typically single
propositions) and induction of hierarchical \emph{coherence relations}
(semantic or pragmatic links) between these EDUs.
Although the idea of splitting the text into smaller meaningful pieces
and inferring semantic relationships between these parts is anything
but new, dating back to the very origins of general
linguistics~\cite{Aristotle:10} and in particular its structuralism
branch~\cite{Saussure:90}, an especially big surge of interest in this
field happened in the 1970-s with the fundamental works of
\citet{vanDijk:72} and \citet{vanDijk:83}, who introduced the notion
of local and global coherence, defining the former as a set of ``rules
and conditions for the well-formed concatenation of pairs of sentences
in a linearly ordered sequence'' and specifying the latter as
constraints on the macro-structure of the
narrative~\cite[see][]{Hoey:83}. Similar ideas were also proposed
by~\citet{Longacre:79,Longacre:96}, who considered the paragraph as a
unit of tagmemic grammar that was composed of multiple sentences
according to a predefined set of compositional principles. Almost
contemporary with these works, \citet{Winter:77} presented an
extensive study of various lexical means that could connect two
sentences and grouped these means into two major categories:
\textsc{Matching} and \textsc{Logical Sequence}, depending on whether
they introduced sentences that were giving more details on the
preceding content (\textsc{Matching}) or adding new information to the
narrative (\textsc{Logical Sequence}).
The increased interest of traditional linguistics in text-level
analysis has rapidly spurred the attention of the broader NLP
community. Among the first who stressed the importance of discourse
phenomena for automatic generation and understanding of texts was
\citet{Hobbs:79}, who argued that semantic ties between sentences were
one the most important component for building a coherent discourse.
Similarly to \citeauthor{Winter:77}, \citeauthor{Hobbs:79} also
proposed a classification of inter-sentence relations, dividing them
into \textsc{Elaboration}, \textsc{Parallel}, and \textsc{Contrast}.
Albeit this taxonomy was obviously too small to accommodate all
possible semantic and pragmatic relationships that could exist between
two clauses, this division had laid the foundations for many
successful approaches to automatic discourse analysis that appeared in
the following decades.
\paragraph{RST.}
One of the best-known such approaches, \emph{Rhetorical Structure
Theory} or \emph{RST}, was presented by~\citet{Mann:88}. Besides
revising \citeauthor{Hobbs:79}' inventory of discourse relations and
expanding it to 23 elements (including new items such as
\textsc{Antithesis}, \textsc{Circumstance}, \textsc{Evidence}, and
\textsc{Elaboration}), the authors also grouped all coherence links
into nucleus-satellite (hypotactic) and multinuclear (paratactic)
ones, depending on whether the arguments of these edges were of
different or equal importance to the content of the whole text. Based
on this grouping, they formally described each relation as a set of
constraints on the \emph{Nucleus} (N), \emph{Satellite} (S), \emph{the
N+S combination}, and \emph{the effect} of the whole combination on
the reader (R). An excerpt from the original description of the
\textsc{Antithesis} relation is given in
Example~\ref{dasa:exmp:rst-evidence}
\begin{example}[Definition of the \textsc{Antithesis} Relation]\label{dasa:exmp:rst-evidence}
\textbf{Relation Name:} \textsc{Antithesis}
\textbf{Constraints on N:} W has positive regard for the situation
presented in N
\textbf{Constraints on S:} None
\textbf{Constraints on the N+S Combination:} the situations presented
in N and S are in contrast (\ie{} are
\begin{inparaenum}[(a)]
\item comprehended as the same in many respects,
\item comprehended as differing in a few respects and
\item compared with respect to one or more of these differences
\end{inparaenum}); because of an incompatibility that arises from the contrast, one
cannot have positive regard for both the situations presented in N
and S\@; comprehending S and the incompatibility between the
situations presented in N and S increases R's positive regard for
the situation presented in N
\textbf{Effect:} R's positive regard for N is increased
\textbf{Locus of the Effect:} N
\end{example}
The authors then defined the general structure of discourse as a
projective (constituency) tree whose nodes were either elementary
discourse units or subtrees, which were connected to each other via
discourse relations.
You can see an example of such a discourse tree from the original
Rhetorical Structure Treebank~\cite{Carlson:01a} in
Figure~\ref{dasa:fig:rst-tree}.
\begin{figure*}[htb!]\label{dasa:fig:rst-tree}
\input{rst.tex}
\end{figure*}
Despite its immense popularity and practical utility~\cite[see
][]{Marcu:98,Yoshida:14,Bhatia:15,Goyal:16}, RST has often been
criticized for the rigidness of the imposed tree
structure~\cite{Wolf:05} and unclear distinction between discourse
relations~\cite{Nicholas:94,Miltsakaki:04}. As a result of this
criticism, two alternative approaches to automatic discourse analysis
were proposed in later works.
\paragraph{PDTB.}
One of these approaches, \emph{PDTB} (named so after the Penn
Discourse Treebank [\citeauthor{Prasad:04}, \citeyear{Prasad:04}]),
was developed by a research group at University of
Pennsylvania~\cite{Miltsakaki:04,Miltsakaki:04a,Prasad:08}. Instead
of fully specifying the hierarchical structure of the whole text and
providing an all-embracing set of discourse relations, the authors of
this theory mainly focused on the grammatical and lexical means that
could connect two sentences (\emph{connectives}) and express a
semantic relationship (\emph{sense}) between these predicates.
Typical such means are coordinating or subordinating conjunctions
(\eg{} \emph{and}, \emph{because}, \emph{since}) and discourse
adverbials (\eg{} \emph{however}, \emph{otherwise}, \emph{as a
result}), which can denote a \textsc{Comparison}, a
\textsc{Contingency}, or some other sense\footnote{In particular, the
authors of PDTB distinguished four major senses
(\textsc{Comparison}, \textsc{Contingency}, \textsc{Expansion}, and
\textsc{Temporal}), and subdivided each of these categories into
further subtypes, \eg{} \textsc{Comparison} included
\textsc{Concession} and \textsc{Contrast}, whereas
\textsc{Contingency} sense was further divided into \textsc{Cause}
and \textsc{Condition}.} between two sentential arguments
(\textsc{Arg1} and \textsc{Arg2}).
%% The choice of these senses was explicitly restricted for each word:
%% for example, the set of possible senses for \emph{nonetheless}
%% included \textsc{Comparison}, \textsc{Conjunction},
%% \textsc{Contra-Expectation}, and \textsc{Contrast}.
Apart from \emph{explicitly} mentioned connectives, \citet{Prasad:04}
also allowed for situations where a connective was missing but could
be easily inferred from the text. They called such cases
\emph{implicit} discourse relations and demanded the arguments of such
structures be determined as well. Furthermore, if there was no
connective at all, the authors of PDTB distinguished three different
possibilities:
\begin{itemize}
\item the coherence relation was either expressed by an alternative
lexical means, which made the connective redundant
(\textsc{AltLex}),
\item or it was achieved by referring to the same entities in both
arguments (\textsc{EntRel}),
\item or there was no coherence relation at all (\textsc{NoRel});
\end{itemize}
and also provided a special \textsc{Attribution} label for marking the
authors of reported speech.
Example~\ref{dasa:exmp:pdtb-analysis} shows the previous fragment of
the Rhetorical Treebank now annotated according to the PDTB scheme.
As we can see from the analysis, PDTB is indeed more flexible than
RST, as it allows its discourse units (arguments) to overlap, be
disjoint or even embedded into other segments. The assignment of
sense relations is also more straightforward and mainly determined by
the connectives that link the arguments. But, at the same time, the
structure of this annotation is completely flat so that we can neither
infer which of the sentences plays a more prominent role nor see the
modification scope of other supplementary statements.
\begin{example}[Example of PDTB Analysis]\label{dasa:exmp:pdtb-analysis}
\fbox{Analysts said,} \argone[1]{profit for the dozen or so big drug
makers, as a group, is estimated to have climbed between 11\% and
14\%.} \connective[1]{\textsc{implicit}$:=$in fact}
\argtwo[1]{\connective[2]{\textsc{explicit}$:=$While}
\argtwo[2]{that's not spectacular}}, \fbox{Neil Sweig, an analyst
with Prudential Bache, said} \argtwo[1]{\argone[2]{\argone[3]{that
the rate of growth will ``look especially good as compared to
other companies} \connective[3]{\textsc{explicit}:
if}\argtwo[3]{the economy turns downward}}}.''
\end{example}
\paragraph{SDRT.}
Another alternative to RST, \emph{Segmented Discourse Representation
Theory} or \emph{SDRT}, was proposed by \citet{Lascarides:01}.
Although developed from a completely different angle of view (the
authors of SDRT mainly drew their inspiration from predicate logic,
dynamic semantics, and anaphora theory), this theory shares many of
its features with Rhetorical Structure Theory, as it also assumes a
graph-like structure of text and distinguishes between coordinating
and subordinating relations. However, unlike RST, Segmented Discourse
Representation explicitly allows the text structure to be a multigraph
and not only tree (\ie{} a discourse node can have multiple parents
and can also be connected via multiple links to the same vertex),
provided that it does not have crossing dependencies (\ie{} does not
violate the right-frontier constraint).
We can also notice the relatedness of the two theories by looking at
the SDRT analysis of the previous RST fragment in
Example~\ref{dasa:fig:sdrt-graph}. Although the names of the
relations in the presented graph differ from those used in Rhetorical
Structure Theory, many of these links have the same (or at least
similar) meaning as the respective edges in the first analysis: for
example, the \textsc{Source} relation in SDRT almost completely
corresponds to the \textsc{Attribution} edge in
Example~\ref{dasa:fig:rst-tree}, and the \textsc{Contrast} link is
similar to the \textsc{Comparison} relation defined by
\citet{Carlson:01b}.
%% These discrepancies between paratactic dependencies in SDRT and
%% their hypotactic equivalents in RST account for the lion's share of
%% the differences between the two discourse representations in
%% Figures~\ref{dasa:fig:rst-tree} and \ref{dasa:fig:sdrt-graph}.
%% Another dissimilarity stems from the different scopes of the
%% commentary \texttt{While that's not spectacular} assigned by SDRT and
%% RST: while the SDRT graph suggests that this opinion primarily relates
%% to the actual statement of Neil Sweig, RST tree widens the
%% modification scope of this opinion also to the fact of making this
%% statement.
\begin{figure}[htbp]
\begin{center}
\begin{tikzpicture}[>=triangle 45,semithick]
\tikzstyle{edu}=[]; \tikzstyle{cdu}=[draw,shape=rectangle];
\node[edu] (1a) at (1,0) {$\pi_{1a}$}; \node[edu] (1b) at (1,-2)
{$\pi_{1b}$};
\node[edu] (p'') at (7,0) {$\pi''$};
\node[edu] (p') at (5.5,-2) {$\pi'$};
\node[edu] (1g) at (8.5,-2) {$\pi_{1g}$};
\node[edu] (1e) at (4,-4) {$\pi_{1e}$};
\node[edu] (1f) at (7,-4) {$\pi_{1f}$};
\node[edu] (1c) at (2,-2) {$\pi_{1c}$};
\node[edu] (1d) at (4,-2) {$\pi_{1d}$};
\draw[->] (1a) to node [auto] {Source} (1b);
\draw[->] (1a) to node [auto] {Narration} (p'');
\draw[-] (p'') to node [auto] {} (p');
\draw[-] (p'') to node [auto] {} (1g);
\draw[->] (p') to node [auto] {Precondition} (1g);
\draw[-] (p') to node [auto] {} (1e);
\draw[-] (p') to node [auto] {} (1f);
\draw[->] (1e) to node [auto] {Contrast} (1f);
\draw[->] (p'') to node [xshift=-8mm,yshift=-0.35mm] {Commentary} (1c);
\draw[->] (p'') to node [xshift=-0mm,yshift=0mm] {Source} (1d);
\end{tikzpicture}
\caption{Example of an SDRT graph}\label{dasa:fig:sdrt-graph}
\end{center}
\end{figure}
\paragraph{Final choice.}
Because it was unclear which of these approaches (RST, PDTB, or SDRT)
would be more amenable to our sentiment experiments, we have made our
decision by considering the following theoretical and practical
aspects: From theoretical perspective, we wanted to have a strictly
hierarchical discourse structure for each analyzed tweet so that we
could infer the semantic orientation of that message by recursively
accumulating polarity scores of its elementary discourse segments.
From practical point of view, since there was no discourse parser
readily available for German, we wanted to have a maximal assortment
of such systems available for English so that we could pick one that
would be easiest to retrain on German data. Fortunately, both of
these concerns have lead us to the same solution---Rhetorical
Structure Theory, which was the only formalism that explicitly
guaranteed a single root for each analyzed text and also offered a
wide variety of open-source parsing
systems~\cite[\eg][]{Hernault:10,Feng:14,Ji:14,Yoshida:14,Joty:15}.
\section{Data Preparation}\label{sec:dasa:data}
To prepare the data for our experiments, we split all microblogs from
%% This figure was generated using the iPython notebook
%% `notebooks/dasa.ipynb`.
\begin{figure*}[htb]
\centering { \centering
\begin{subfigure}{0.7\textwidth}
\centering
\includegraphics[width=\linewidth]{img/dasa_potts_edu_distribution.png}
\caption{PotTS}\label{dasa:fig:data-distribution-potts}
\end{subfigure}
}
\centering
{
\centering
\begin{subfigure}{0.7\textwidth}
\centering
\includegraphics[width=\linewidth]{img/dasa_sb10k_edu_distribution.png}
\caption{SB10k}\label{dasa:fig:data-distribution-sb10k}
\end{subfigure}
}
\caption[EDU distribution in PotTS and SB10k]{Distribution
of elementary discourse units and polarity classes in the
training and development sets of PotTS and
SB10k}\label{dasa:fig:data-distribution}
\end{figure*}
the PotTS and SB10k corpora into elementary discourse units using the
ML-based discourse segmenter of \citet{Sidarenka:15}, which had been
previously trained on the Potsdam Commentary Corpus~\cite[PCC~2.0;
][]{Stede:14}. After filtering out all tweets that had only one
EDU,\footnote{Since the focus of this chapter is mainly on discourse
phenomena, we skip all messages that consist of a single discourse
segment, because their overall polarity is unaffected by the
discourse structure and can be normally determined with the standard
discourse-unaware sentiment techniques.} we obtained 4,771 messages
(12,137 segments) for PotTS and 3,763 posts (9,625 segments) for the
SB10k corpus. In the next step, we assigned polarity scores to the
segments of these microblogs with the help of our lexicon-based
attention classifier, analyzing each elementary unit in isolation,
independently of the rest of the tweet. We again used the same
70--10--20 split into training, development, and test sets as we did
in the previous chapters, considering message-level labels inferred
from the annotation of the second expert as gold standard for the
PotTS corpus and using provided manual sentiment labels for tweets as
reference for the SB10k data.
As we can see from the statistics in
Figure~\ref{dasa:fig:data-distribution}, most tweets that consist of
multiple EDUs typically have two or three segments, whereas messages
with more than three discourse units are extremely rare. This is also
not surprising regarding that the maximum length of a microblog is
constrained to 140 characters. Nonetheless, even with this severe
length restriction, there still are a few messages that have up to 13
EDUs. Since it was somewhat surprising for us to see that many
segments in a single tweet, we decided to have a closer look at these
cases. As it turned out, such high number of discourse units
typically resulted from spurious punctuation marks, which were
carelessly used by Twitter users and evidently confused the segmenter
(see Example~\ref{dasa:exmp:many-segments}).
\begin{example}[SB10k Tweet with 13 EDUs]\label{dasa:exmp:many-segments}
\noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
[Guinness on Wheelchairs :]$_1$ [Das .]$_2$ [Ist .]$_3$ [Verdammt
.]$_4$ [Noch .]$_5$ [Mal .]$_6$ [Einer .]$_7$ [Der .]$_8$
[Besten .]$_9$ [Werbespots .]$_{10}$ [Des .]$_{11}$ [Jahrzehnts
.]$_{12}$ [( Auch ...]$_{13}$ }\\
{\textup{[}Guinness on
Wheelchairs :\textup{]$_1$} \textup{[}This .\textup{]$_2$}
\textup{[}Is .\textup{]$_3$} \textup{[}Gosh .\textup{]$_4$}
\textup{[}Darn .\textup{]$_5$} \textup{[}It .\textup{]$_6$}
\textup{[}One .\textup{]$_7$} \textup{[}Of .\textup{]$_8$}
\textup{[}The best .\textup{]$_9$} \textup{[}Commercials
.\textup{]$_{10}$} \textup{[}Of .\textup{]$_{11}$} \textup{[}The
Decade .\textup{]$_{12}$} \textup{[}( Also ...\textup{]$_{13}$}}
\end{example}
Another noticeable trend that we can see in the data is that the
distribution of polar classes in messages with multiple segments
largely corresponds to the frequencies of these polarities in the
complete datasets: For example, the positive semantic orientation
still dominates the PotTS corpus, whereas the neutral polarity
constitutes the vast majority of the SB10k set. At the same time,
negative microblogs again are the least represented class in both
cases and account for only 22\% of the former corpus and for 16\% of
the latter data.
To obtain RST trees for these messages, we retrained the DPLP
discourse parser of~\citet{Ji:14} on PCC, after converting all
discourse relations to the binary scheme $\{$\textsc{Contrastive},
\textsc{Non-Contrastive}$\}$ as suggested
by~\citet{Bhatia:15}.\footnote{See Table~\ref{dasa:tbl:rst-rel-sets}
for more details regarding this mapping.} In contrast to the
original DPLP implementation though, we did not use Brown
clusters~\cite{Brown:92}, because this resource was not available for
German, nor did we apply the linear projection of the features,
because the released parser code was missing this component either.
In part due to these modifications, but mostly because of the
specifics of the German language (richer morphology, higher lexical
variety, and syntactic ambiguity) and a skewed distribution of
discourse relation, the results of the retrained model were
considerably lower than the figures reported for the English treebank,
amounting to 0.777, 0.512, and 0.396~\F{} for span, nuclearity, and
relation classification on PCC~2.0 versus corresponding 82.08, 71.13,
and 61.63~\F{} on the RST Treebank.\footnote{Following \citet{Ji:14},
we use the span-based evaluation metric of~\citet{Marcu:00}.}
\begin{figure*}[htb]
\input{twitter-rst.tex}
\end{figure*}
An example of an automatically induced RST tree is shown in
Figure~\ref{dasa:fig:twitter-rst-tree}. As we can see from this
picture, the adapted parser can correctly distinguish between
contrastive and non-contrastive relations in the analyzed tweet (even
though it only predicts the former class for two percent of all edges
on the PotTS and SB10k data [see
Figure~\ref{dasa:fig:relation-distribution}]), but apparently
struggles with the disambiguation of the nuclearity status, assigning
the highest importance in this example to the initial discourse
segment (``Mooooiiinn.'' [\emph{Hellloooo!}]), which is merely a
greeting, and weighing the second EDU (``Gegen solche N\"achte hilft
die beste Kur nicht.'' [\emph{Even the best cure won't help against
such nights.}]) less than the third one (``Aber Kaffee!''
[\emph{But coffee!}]), although traditional RST would rather consider
both units as equally relevant and join them via the multi-nuclear
\textsc{Contrast} link.
\begin{figure*}[bht]
\centering
{
\centering
\begin{subfigure}{0.7\textwidth}
\centering
\includegraphics[width=\linewidth]{img/dasa_potts_rel_distribution.png}
\caption{PotTS}\label{dasa:fig:relation-distribution-potts}
\end{subfigure}
}
\centering
{
\centering
\begin{subfigure}{0.7\textwidth}
\centering
\includegraphics[width=\linewidth]{img/dasa_sb10k_rel_distribution.png}
\caption{SB10k}\label{dasa:fig:relation-distribution-sb10k}
\end{subfigure}
}
\caption[Relation distribution in PotTS and
SB10k]{Distribution of discourse relations in the training
and development sets of PotTS and
SB10k}\label{dasa:fig:relation-distribution}
\end{figure*}
\section{Discourse-Aware Sentiment Analysis}\label{sec:dasa:methods}
% \done[inline]{\citet{Bickerstaffe:10}}
% \citet{Bickerstaffe:10} also considered the rating prediction task,
% addressing this problem with the minimum-spanning-tree (MST) SVM
% approach. In the initial step of this method, they constructed a
% strongly connected graph whose vertices were associated with the most
% representative example (determined via the average all-pairs Tanimoto
% coefficient) of each star rating and the edge weights represented the
% Tanimoto distances between those nodes. Afterwards, they determined
% the MST of this graph using the Kruskal's
% algorithm~\cite[see][pp.~567--574]{Cormen:09} and, finally,
% constructed a decision tree from this MST, replacing the MST vertices
% with binary SVM classifiers, which had to discern the respective
% rating groups. An evaluation on the four-star review corpus
% of~\citet{Pang:05} showed an improvement by up to~7\% over the
% previous state of the art, boosting it to 59.37\% average accuracy.
Now before we use these data in our sentiment experiments, let us
first revise the most prominent approaches to discourse-aware
sentiment analysis that exist in the literature nowadays.
As it turns out, even the very first works on opinion mining already
pointed out the importance of discourse phenomena for classification
of the overall polarity of a text. For example, in the seminal paper
of~\citet{Pang:02}, where the authors tried to predict the semantic
orientation of movie reviews, they quickly realized the fact that it
was insufficient to rely on the mere presence or even the majority of
polarity clues in the text, because these clues could any time be
reversed by a single counter-argument of the critic (see
Example~\ref{disc-snt:exmp-pang02}). This observation was also
confirmed by \citet{Polanyi:06}, who ranked discourse relations among
the most important factors that could significantly affect the
intensity and polarity of a sentiment. To prove this claim, they gave
several convincing examples, where a concessive statement considerably
weakened the strength of a polar opinion, and vice versa, an
elaboration notably increased its persuasiveness.
\citet{Pang:04} were also among the first who incorporated a
discourse-aware component into a document-level sentiment classifier.
For this purpose, they developed a two-stage system in which the first
predictor distinguished between subjective and objective statements by
constructing a graph of all sentences (linking each sentence to its
neighbors and also connecting it to two abstract polarity nodes) and
then partitioning this graph into two clusters (subjective and
objective) based on its minimum cut; the second classifier then
inferred the overall polarity of the text by only looking at the
sentences from the first (subjective) group. With this method,
\citeauthor{Pang:04} achieved a statistically significant improvement
(86.2\% versus 85.2\% for the Na\"{\i}ve Bayes system and 86.15\%
versus 85.45\% for SVM) over classifiers that analyzed all text
sentences at once, without any filtering.
%% (Later on, a similar approach was also proposed by
%% \citeauthor{Yessenalina:10}~[\citeyear{Yessenalina:10}], who used
%% an expectation-maximization algorithm to select a small subset of
%% the most indicative sentences and then classified the document [as
%% either positive or negative] with the help of this subset,
%% achieving 93.22\% accuracy on the aforementioned IMDB dataset.)
\begin{example}[Polarity reversal via discourse antithesis]\label{disc-snt:exmp-pang02}
\noindent\upshape This film should be brilliant. It sounds like a
great plot, the actors are first grade, and the supporting cast is
good as well, and Stallone is attempting to deliver a good
performance. However, it can't hold up.~\cite{Pang:02}
\end{example}
Although an oversimplification, the core idea that locally adjacent
sentences are likely to share the same subjective orientation
(\emph{local coherence}) was dominating the following DASA research
for almost a decade. For example, \citet{Riloff:03} also improved the
accuracy of their Na\"{\i}ve Bayes predictor of subjective expressions
by almost two percent after adding a set of local coherence features.
Similarly, \citet{Hu:04} could better disambiguate users' attitudes to
particular product attributes by taking the semantic orientation of
previous sentences into account.
At the same time, another line of discourse-aware sentiment research
concentrated on the joint classification of all opinions in the text,
where in addition to predicting each sentiment in isolation, the
authors also sought to maximize the ``total happiness'' (\emph{global
coherence}) of these assignments, ensuring that related subjective
statements received agreeing polarity scores. Notable works in this
direction were done by \citet{Snyder:07}, who proposed the Good Grief
algorithm for predicting users' satisfaction with different restaurant
aspects, and \citet{Somasundaran:08a,Somasundaran:08}, who introduced
the concept of \emph{opinion frames} (OF), a special data structure
for capturing the relations between opinions in discourse. Depending
on the type of these opinions (arguing~[\emph{A}] or
sentiment~[\emph{S}]), their polarity towards the target
(positive~[\emph{P}] or negative~[\emph{N}]), and semantic
relationship between these targets (alternative~[\emph{Alt}] or the
same~[\emph{same}]), the authors distinguished 32 types of possible
frames (\emph{SPSPsame}, \emph{SPSNsame}, \emph{APAPalt}, etc.),
dividing them into reinforcing and non-reinforcing ones. In later
works, \citet{Somasundaran:09a,Somasundaran:09b} also presented two
joint inference frameworks (one based on the iterative classification
and another one relying on integer linear programming) for determining
the best configuration of all frames in text, achieving 77.72\%
accuracy on frame prediction in the AMI meeting
corpus~\cite{Carletta:05}.
%% \done[inline]{\citet{Somasundaran:09a,Somasundaran:09b}}
%% In a later work, \citet{Somasundaran:09b,Somasundaran:09a} also
%% introduced a joint inference framework based on the Iterative
%% Classification Algorithm (ICA) and Integer Linear Programming (ILP)
%% for joinly predicting the best configuration of single opinions and
%% their frames. In this approach, the authors first applied a local SVM
%% classifier to compute the probabilities of polarity classes (positive,
%% negative, or neutral) of individual dialog acts and then harnessed the
%% ICA and ILP systems to determine which of the predicted opinions were
%% connected via opinion frames and whether these frames were reinforcing
%% or not. Given a perfect information about the opinion links, this
%% joint method outperformed the local classifier by more than 9
%% percentage points, reaching 77.72\% accuracy on the AMI meeting
%% corpus~\cite{Carletta:05}.
%% \done[inline]{\citet{Mao:06}}
%% \citet{Mao:06} proposed the idea of isotonic CRFs in which they
%% explicitly modeled the constraint that features which were stronger
%% associated with either polarity classes had to have higher
%% coefficients than less predictive attributes. After proving that this
%% formalism also allowed to directly model the ordinal scale of
%% sentiment scores (with lower CRF outputs indicating the negativity of
%% a sentence, and higher scores showing its positive class), the authors
%% used this approach to model the sentiment flow in a document. For
%% this purpose, they first predicted the polarity value for each
%% sentence of a document in isolation and then convolved these outputs
%% with a Gaussian kernel, getting a smoothed polarity curve for the
%% whole analyzed text at the end.
%% \done[inline]{\citet{Thomas:06}}
%% \citet{Thomas:06} enhanced an SVM-based sentiment classification
%% system for predicting speaker's attitude in political speeches with
%% information about the inter-speaker agreement, incorporating these
%% links into the global cost function. Thanks to this change, the
%% authors achieved $\approx$4\% improvement in accuracy (from 66.05 to
%% 70.81\%) over the baseline classifer which analyzed each utterance in
%% isolation.
An attempt to unite local and global coherence was made by
\citet{McDonald:07}, who tried to simultaneously predict the polarity
of a document and classify semantic orientations of its sentences.
For this purpose, the authors devised an undirected probabilistic
graphical model based on the structured linear
classifier~\cite{Collins:02}. Similarly to \citet{Pang:04}, they
connected the label nodes of each sentence to the labels of its
neighboring clauses and also linked these nodes to the overarching
vertex representing the polarity of the text. After optimizing this
model with the MIRA learning algorithm~\cite{Crammer:03},
\citeauthor{McDonald:07} achieved an accuracy of 82.2\% for
document-level classification and 62.6\% for sentence-level prediction
on a corpus of online product reviews, outperforming pure document and
sentence classifiers by up to four percent. A crucial limitation of
this system though was that its optimization required the gold labels
of sentences and documents to be known at the training time, which
considerably limited its applicability to other domains with no such
data.
%% A similar approach was also suggested by~\citet{Sadamitsu:08}, who
%% attained 82.74\% accuracy on predicting the polarity of customer
%% reviews with the help of hidden conditional random fields.
Another significant drawback of all previous approaches is that they
completely ignored traditional discourse theory and, as a result,
severely oversimplified discourse structure. Among the first who
tried to overcome this omission were \citet{Voll:07}, who proposed two
discourse-aware enhancements of their lexicon-based sentiment
calculator (SO-CAL). In the first method, the authors let the SO-CAL
analyze only the topmost nucleus EDU of each sentence, whereas in the
second approach, they expanded its input to all clauses that another
classifier had considered as relevant to the main topic of the
document. Unfortunately, the former solution did not work out as well
as expected, yielding 69\% accuracy on the corpus of Epinion
reviews~\cite{Taboada:06}, but the latter system could perform much
better, achieving 73\% on this two-class prediction task.
Other ways of adding discourse information to a sentiment system were
explored by \citet{Heerschop:11}, who experimented with three
different approaches:
\begin{inparaenum}[(i)]
\item increasing the polarity scores of words that appeared near the
end of the document,
\item assigning higher weights to nucleus tokens, and finally
\item learning separate scores for nuclei and satellites using a
genetic algorithm.
\end{inparaenum}
An evaluation of these methods on the movie review corpus
of~\citet{Pang:04} showed better performance of the first option
(60.8\% accuracy and 0.597 macro-\F), but the authors could
significantly improve the results of the last classifier at the end by
adding an offset to the decision boundary of this method, which
increased both its accuracy and macro-averaged \F{} to 0.72.
Further notable contributions to RST-based sentiment analysis were
made by \citet{Zhou:11}, who used a set of heuristic rules to infer
polarity shifts of discourse units based on their nuclearity status
and outgoing relation links; \citet{Zirn:11}, who used a lexicon-based
sentiment system to predict the polarity scores of elementary
discourse units and then enforced consistency of these assignments
over the RST tree with the help of Markov logic constraints; and,
finally, \citet{Wang:13}, who determined the semantic orientation of a
document by taking a linear combination of the polarity scores of its
EDUs and multiplying these scores with automatically learned
coefficients.
%% \footnote{Similarly to the approach of~\citet{Zirn:11}, these
%% coefficients depended on the status of the segment in the RST
%% tree (whether nucleus or sattelite) and relation, which connected
%% the respective discourse node to the ancestor.} A similar system
%% was also described by \citet{Chenlo:13,Chenlo:14}, who used their
%% model to analyze user blog posts, achieving significantly better
%% results on the TREC corpus \cite{Macdonald:09} than any
%% discourse-unaware baselines.
Among the most recent advances in RST-aware sentiment research, we
should especially emphasize the work of \citet{Bhatia:15}, who
proposed two different DASA systems:
\begin{itemize}
\item discourse-depth reweighting (DDR)
\item and rhetorical recursive neural network (R2N2).
\end{itemize}
In the former approach, the authors estimated the relevance
$\lambda_i$ of each elementary discourse unit $i$ as:
\begin{equation*}
\lambda_i = \max\left(0.5, 1 - d_i/6\right),
\end{equation*}
where $d_i$ stands for the depth of the $i$-th EDU in the document's
discourse tree. Afterwards, they computed the sentiment score
$\sigma_i$ of that unit by taking the dot product of its binary
feature vector $\mathbf{w}_i$ (token unigrams) with polarity scores
$\boldsymbol{\theta}$ of these unigrams:
\begin{equation*}
\sigma_i = \boldsymbol{\theta}^{\top}\mathbf{w}_i;
\end{equation*}
and then calculated the overall semantic orientation of the
document~$\Psi$ as the sum of sentiment scores for all units,
multiplying these scores by their respective discourse-depth factors:
\begin{equation*}
\Psi = \sum_i\lambda_i\boldsymbol{\theta}^T\mathbf{w}_i = \boldsymbol{\theta}^T\sum_i\lambda_i\mathbf{w}_i,
\end{equation*}
In the R2N2 system, the authors largely adopted the RNN method
of~\citet{Socher:13} by recursively computing the polarity scores of
discourse units as:
\begin{equation*}
\psi_i = \tanh\left(K_n^{(r_i)} \psi_{n(i)} + K_s^{(r_i)}\psi_{s(i)} \right),
\end{equation*}
where $K_n^{(r_i)}$ and $K_s^{(r_i)}$ stand for the nucleus and
satellite coefficients associated with the rhetorical relation $r_i$,
and $\psi_{n(i)}$ and $\psi_{s(i)}$ represent sentiment scores of the
nucleus and satellite of the $i$-th vertex. This approach achieved
84.1\% two-class accuracy on the movie review corpus
of~\citet{Pang:04} and reached 85.6\% on the dataset
of~\citet{Socher:13}.
For the sake of completeness, we should also note that there exist
discourse-aware sentiment approaches that build upon PDTB and SDRT\@.
For example, \citet{Trivedi:13} proposed a method based on latent
structural SVM~\cite{Yu:09}, where they represented each sentence as a
vector of features produced by a feature function $\mathbf{f}(y,
\mathbf{x}_i, h_i)$, in which $y\in\{-1, +1\}$ denotes the potential
polarity of the whole document, $h_i \in \{0, 1\}$ stands for the
assumed subjectivity class of sentence $i$, and $\mathbf{x}_i$
represents the surface form of that sentence; and then tried to infer
the most likely semantic orientation of the document $\hat{y}$ over
all possible assignments $\mathbf{h}$, \ie{}:
\begin{equation*}
\hat{y} =
\argmax_y\left(\max_{\mathbf{h}}\mathbf{w}^{\top}\mathbf{f}(y,
\mathbf{x}, \mathbf{h})\right).
\end{equation*}
To ensure that these assignments were still coherent, the authors
additionally extended their feature space with special
\emph{transitional} attributes, which indicated whether two adjacent
sentences were likely to share the same subjectivity given the
discourse connective between them. With the help of these features,
\citeauthor{Trivedi:13} could improve the accuracy of the
connector-unaware model on the movie review corpus of~\citet{Maas:11}
from 88.21 to 91.36\%.
The first step towards an SDRT-based sentiment approach was made by
\citet{Asher:08}, who presented an annotation scheme and a pilot
corpus of English and French texts that were analyzed according to the
SDRT theory and enriched with additional sentiment information.
Specifically, the authors asked the annotators to ascribe one of four
opinion categories (reporting, judgment, advice, or sentiment) along
with their subclasses (\eg{} inform, assert, blame, recommend) to each
discourse unit that had at least one opinionated word from a sentiment
lexicon. Afterwards, they showed that with a simple set of rules, one
could easily propagate opinions through SDRT graphs, increasing the
strengths or reversing the polarity of the sentiments, depending on
the type of the discourse relation that was linking two segments.
In general, however, PDTB- and SDRT-based sentiment systems are much
less common than RST-inspired solutions. Because of this fact and due
to the reasons described in Section~\ref{sec:dasa:theory}, we will
primarily concentrate on the RST-based of methods. In particular, for
the sake of comparison, we replicated the linear combination approach
of \citet{Wang:13} and also reimplemented the DDR and R2N2 systems
of~\citet{Bhatia:15}. Furthermore, to see how these techniques would
perform in comparison with much simpler baselines, we additionally
created two methods that predicted the polarity of a message by only
considering its last or topmost nucleus EDU (henceforth \textsc{Last}
and \textsc{Root}), and also estimated the results of our original LBA
classifier without any discourse-related modifications (henceforth
\textsc{No-Discourse}).
Apart from the above baselines and existing methods, we propose
several novel DASA solutions, which will be briefly described below.
\subsection{Latent CRF}
In the first of these solutions, called \emph{Latent Conditional
Random Fields} or \emph{LCRFs}, we consider the problem of
message-level sentiment analysis as an inference task over an
undirected graphical model, where the nodes of the model represent
polarity probabilities of elementary discourse units and the structure
of the graph reflects the RST dependency tree of the
message.\footnote{Drawing on the work of~\citet{Bhatia:15}, we obtain
this representation using the DEP-DT algorithm of~\citet{Hirao:13}
with a minor modification that we do not follow any satellite
branches while computing the heads of abstract RST nodes in Step 1
of this procedure~\cite[see][pp.~1516--1517]{Hirao:13}.} In
particular, we define CRF graph $\mathcal{G}=(\mathcal{V},
\mathcal{E})$ as a set of vertices $\mathcal{V}=
\mathcal{Y}\cup\mathcal{X}$, in which $\mathcal{Y}=\{y_{(i, j)}\mid
i\in\{\text{\textsc{Root}}, 1, 2, \ldots, T\}, j
\in\{\text{\textsc{Negative}, \textsc{Neutral},
\textsc{Positive}}\}\}$ represents (partially observed) random
variables (with $T$ standing for the number of EDUs in the tweet), and
$\mathcal{X}=\{x_{(i, j)}\mid i\in\{\text{\textsc{Root}}, 1, 2,
\ldots, T\}, j \in[0,\ldots, 3]\}$ denotes the respective features of
these nodes (three polarity scores returned by the LBA classifier plus
an additional offset feature whose value is always \texttt{1}
irrespectively of the input). Since the \textsc{Root} vertex,
however, does not have a corresponding discourse segment in the RST
tree, we use the polarity scores predicted by the LBA classifier for
the whole message as features for this node.
Graph edges $\mathcal{E}$ connect random variables to their
corresponding features and also link every pair of vertices
$(v_{(k,\cdot)},v_{(i,\cdot)})$ if node $k$ appears as the parent of
node $i$ in the RST dependencies.\footnote{In fact, we use two edges
to connect each child to its parent: one for the
\textsc{Contrastive} relation and another one for the
\textsc{Non-Contrastive} link.} You can see an example of such
automatically induced CRF tree in Figure~\ref{dasa:fig:latent-crf}.
\begin{figure*}[thb]
\centering \input{latent-crf}
\caption[Example of an RST-based Latent-CRF]{Example of an
automatically constructed RST-based latent-CRF tree\\ {\small
(random variables are shown as circles, fixed input parameters
appear as rectangles, and observed values are displayed in
gray)}}\label{dasa:fig:latent-crf}
\end{figure*}
%% Figure~\ref{dasa:fig:latent-crf} shows a real example of such
%% automatically induced CRF tree where we can already notice a few
%% tendencies regarding the obtained discourse graph: First of all, our
%% segmenter clearly tends to oversegment its input, also considering
%% conjoined predicates and adverbial subordinate clauses as separate
%% discourse units. Even though this behavior violates the principles of
%% standard RST, it actually comes advantageous to our particular
%% sentiment application as it allows the base classifier to be more
%% fine-grained (and consequently more precise) in its predictions. At
%% the same time, we again can see that the automatic parser has
%% difficulties with determining the correct nuclearity status of
%% discourse segments, putting the segment ``f\"uhlt sich fast an''
%% (\textit{almost feels}) in the top-most position, which we can hardly
%% call the right decision. Finally, we also can observe that despite an
%% incorrect prediction of the polarity of the whole tweet (the LBA
%% system considers it as a negative message, although human experts
%% regarded it as neutral) our base classifier might still have better
%% guesses for single EDUs, giving us at least a hypothetical possibility
%% to overcome its general error.
Now before we describe the training of our model, let us briefly
recall that in the standard CRF optimization we typically try to find
optimal parameters $\boldsymbol{\theta}^*$ that maximize the
log-likelihood of all label sequences $\mathbf{y}^{(i)}$ on the
training set $\mathcal{D}=\left\{\left(\mathbf{x}^{(i)},
\mathbf{y}^{(i)}\right)\right\}_{i=1}^{N}$, \ienocomma:
\begin{equation*}
\boldsymbol{\theta}^* = \argmax_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}) = \sum_{i=1}^{N}\log\left(p\left(\mathbf{y}^{(i)}\vert\mathbf{x}^{(i)}; \boldsymbol{\theta}\right)\right),\label{dasa:eq:crf-objective}
\end{equation*}
where the conditional likelihood is normally estimated as:
\begin{equation*}
p\left(\mathbf{y}^{(i)}\vert\mathbf{x}^{(i)}; \boldsymbol{\theta}\right) =
\frac{\exp\left(\sum_{t=1}^{T_i}\sum_k\boldsymbol{\theta}_k\mathbf{f}_k\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{(i)}_{t-t},\mathbf{y}^{(i)}_{t}\right)\right)}{Z}.
\end{equation*}
Adapting this equation to our RST-based CRF structures, we obtain:
\begin{equation}
p\left(\mathbf{y}^{(i)}\vert\mathbf{x}^{(i)}; \boldsymbol{\theta}\right) =
\frac{\exp\left(\sum_{t=0}^{T_i}\left[%
\sum_v\boldsymbol{\theta}_v\mathbf{f}_v\left(\mathbf{x}^{(i)}_t,\mathbf{y}^{(i)}_{t}\right)
+ \sum_{c\in
ch(t)}\sum_e\boldsymbol{\theta}_e\mathbf{f}_e\left(\mathbf{y}^{(i)}_{t},
\mathbf{y}^{(i)}_{c}\right)\right]\right)}{Z},\label{dasa:eq:tree-crf}
\end{equation}
where $ch(t)$ denotes the children of node $t$, $v$ stands for the
indices of node features, and $e$ represents the indices of edge
attributes.
A crucial problem with this formulation though is that in our task,
only a small subset of labels from $\mathbf{y}^{(i)}$ (namely those of
the root node) are actually observed at the training time, whereas the
rest of the tags (those which pertain to EDUs) are unknown. We will
denote these observed and hidden subsets as $\mathbf{y}_o^{(i)}$ and
$\mathbf{y}_h^{(i)}$ respectively. Using this notation, we can
redefine the training objective of our model as finding such
parameters $\boldsymbol{\theta}^*$ that maximize the log-likelihood of
\emph{observed} labels, \ienocomma:
\begin{equation*}
\boldsymbol{\theta}^* =
\argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}\log\left(p\left(\mathbf{y}_o^{(i)}\vert\mathbf{x}^{(i)};
\boldsymbol{\theta}\right)\right).
\end{equation*}
With this formulation, however, it is still unclear what we should do
with hidden tags $\mathbf{y}_h^{(i)}$, because the values of their
features remain undefined.
One possible way to approach the problem of unobserved states in the
input is to assume that any label sequence $\mathbf{y}_h^{(i)}$ might
be true, and then try to maximize the parameters along the path that
leads to the maximum probability of the correct observed tag,
\ienocomma:
\begin{align}
\begin{split}
\mathbf{y}^{(i)}&=[\mathbf{y}_o^{(i)}, \mathbf{y}_h^{*(i)}]\text{, where}\\\label{dasa:eq:y_i}
\mathbf{y}_h^{*(i)}&=\argmax_{\mathbf{y}_h^{(i)}}p\left(\mathbf{y}_o^{(i)}\vert\mathbf{x}^{(i)}\right),
\end{split}
\end{align}
and which we can easily find using standard Viterbi decoding.
Unfortunately, if we simply consider label sequence $\mathbf{y}^{(i)}$
from Equation~\ref{dasa:eq:y_i} as the ground truth and penalize all
labels that disagree with this sequence, we might overly commit
ourselves to the model's guess of unknown tags and unduly discriminate
against other possible hidden label assignments. To mitigate this
effect, we can instead penalize only one other sequence, namely the
one that maximizes the probability of an incorrect label at the
observed state:
\begin{align*}
\mathbf{y}^{'(i)}&=\argmax_{\mathbf{y}_o^{'(i)}\neq\mathbf{y}_o^{(i)}}p\left([\mathbf{y}_o^{'(i)},
\mathbf{y}_h^{*(i)}]\vert\mathbf{x}^{(i)}\right)\text{,
where}\\
\mathbf{y}_h^{*(i)}&=\argmax_{\mathbf{y}_h^{(i)}}p\left(\mathbf{y}_o^{'(i)}\vert\mathbf{x}^{(i)}\right).
\end{align*}
Correspondingly, we reformulate our objective and instead of
maximizing the log-likelihood of the training set will now maximize
the difference between the log-probabilities of the correct and most
likely wrong assignments:
\begin{align}
\begin{split}
\boldsymbol{\theta}^* &= \argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}\log\left(p\left(\mathbf{y}^{(i)}\right)\right) - \log\left(p\left(\mathbf{y}^{'(i)}\right)\right)\\
&= \argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}\log\left(\exp\left(\boldsymbol{\theta}^{\top}\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)})\right)\right) - \log\left(\exp\left(\boldsymbol{\theta}^{\top}\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})\right)\right)\\
&= \argmax_{\boldsymbol{\theta}}\sum_{i=1}^{N}\boldsymbol{\theta}^{\top}\left(\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)}) - \mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})\right),\label{dasa:eq:hcrf-objective}
\end{split}
\end{align}
where $\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)})$ and
$\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})$ mean all features
associated with label sequences $\mathbf{y}^{(i)}$ and
$\mathbf{y}^{'(i)}$ respectively.
The only thing that we now need to do to the above objective is to
introduce a regularization term
$\frac{1}{2}\norm{\boldsymbol{\theta}}^2$ in order to prevent its
divergence to infinity in the cases when
$\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)})$ and
$\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})$ are perfectly
separable. This brings us to the final formulation:
\begin{align}
\boldsymbol{\theta}^* &=
\argmin_{\boldsymbol{\theta}}\frac{1}{2}\norm{\boldsymbol{\theta}}^2 -
\sum_{i=1}^{N}\boldsymbol{\theta}^{\top}\left(\mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{(i)})
- \mathbf{f}(\mathbf{x}^{(i)},\mathbf{y}^{'(i)})\right)
\end{align}
At this point, we can notice that the resulting function is identical
to the unconstrained minimization problem of structural
SVM~\cite{Taskar:03}, and we indeed can piggyback on one of the many
efficient SVM-optimization techniques to learn the parameters of our
model. In particular, we use the block-coordinate Frank-Wolfe
algorithm~\cite{Lacoste-Julien:13}, running it for 1,000 epochs or
until convergence, whichever of these events occurs first.
\subsection{Latent-Marginalized CRF}
Another way to tackle unobserved labels is to estimate the probability
of observed tags by marginalizing (summing) out hidden variables from
the joint distribution, \ienocomma:
\begin{align*}
p\left(\mathbf{Y}_o{=}\mathbf{y}_o\right) &=
\sum_{\mathbf{y}_h} p\left(\mathbf{Y}_o{=}\mathbf{y}_o,
\mathbf{Y}_h{=}\mathbf{y}_h\right).
\end{align*}
Applying this formula to Equation~\ref{dasa:eq:tree-crf}, we get:
\begin{align*}
\begin{split}