-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsentiment_fgsa.tex
2299 lines (2047 loc) · 110 KB
/
sentiment_fgsa.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% FILE: sentiment_fgsa.tex Version 0.0.1
% AUTHOR: Uladzimir Sidarenka
% This is a modified version of the file main.tex developed by the
% University Duisburg-Essen, Duisburg, AG Prof. Dr. Günter Törner
% Verena Gondek, Andy Braune, Henning Kerstan Fachbereich Mathematik
% Lotharstr. 65., 47057 Duisburg entstanden im Rahmen des
% DFG-Projektes DissOnlineTutor in Zusammenarbeit mit der
% Humboldt-Universitaet zu Berlin AG Elektronisches Publizieren Joanna
% Rycko und der DNB - Deutsche Nationalbibliothek
\chapter{Fine-Grained Sentiment Analysis}\label{chap:fgsa}
The task of fine-grained sentiment analysis (FGSA) is to automatically
recognize subjective evaluative opinions (\emph{sentiments}), holders
of these opinions (\emph{sources}), and their respective evaluated
entities (\emph{targets}) in text. Since an accurate automatic
prediction of these elements would allow us to track public's attitude
towards literally any object (\eg{} a product, a service, or a
political decision), FGSA is traditionally considered as one of the
most attractive, necessary, but, unfortunately, also challenging
objectives in the opinion-mining field.
Researchers usually interpret this goal as a sequence labeling (SL)
objective, and address it with one of two most popular SL techniques:
conditional random fields (CRFs) or recurrent neural networks (RNNs).
The former approach represents a discriminative probabilistic
graphical model, which relies on an extensive set of hand-crafted
features, whereas the latter methods use a recursive computational
loop and learn their feature representations completely automatically.
In this chapter, we are going to evaluate each of these solutions in
detail in order to find out which of these algorithms is better suited
for the domain of German Twitter. But before we proceed with our
experiments, we should first briefly discuss evaluation metrics that
we are going to use to estimate the quality of these systems.
%% \section{Definition of the Sentiment, Target, and Source Spans}
%% Despite some notable advances and an ongoing active research on
%% fine-grained opinion extraction, the crucial task of defining the
%% exact boundaries of sentiment spans and the spans of their respective
%% targets and sources has not been addressed in the literature with the
%% due attention yet. Researchers typically overlook this problem,
%% leaving its solution to the discretion of their annotators
%% \cite[see][]{Wiebe:05,Klinger:13}.
%% In contrast to these works, instead of relying on rather intuitive
%% decisions of our coders, we explicitly provided a rule for determining
%% opinions' boundaries by telling the experts to assign the
%% \textsc{sentiment} label to ``\emph{minimal complete syntactic or
%% discourse-level units that included both the target of an opinion
%% and its actual evaluation}.''
%% % According to this instruction, during the annotation, linguists first
%% % had to identify evaluated objects (targets) in text, then find the
%% % respective evaluative expressions of these objects (usually but not
%% % necessarily polar terms), and, finally, determine the smallest
%% % syntactic components (typically noun or verb phrases) or discourse
%% % units (clauses or sentences) where both of these entities appeared
%% % together.
%% A sample annotation analyzed in compliance with this rule is shown in
%% Example~\ref{snt:fgsa:exmp:sent-anno1}:
%% \begin{example}[Annotation of a Sentiment Span]\label{snt:fgsa:exmp:sent-anno1}
%% \upshape\sentiment{Der neue Papst gilt als
%% bescheidener, zur\"uckgenommener Typ.}\\[0.8em]
%% \noindent\sentiment{The new Pope is believed to be a sober, modest
%% man.}
%% \end{example}
%% \noindent In this sentence, an expert had to label the complete
%% sentence as a sentiment, since this unit was the minimal syntactic
%% constituent which included both the object of the evaluation---``der
%% neue Papst'' (\textit{the new pope})---and the evaluation
%% itself---``bescheidener, zur\"uckgenommener Typ'' (\textit{a sober,
%% modest man}).
%% We applied the same principles of minimality and completeness to the
%% annotation of targets and sources, requiring the main components of
%% these elements (typically nouns or verbs) to be labeled along with all
%% their syntactic dependents. Accordingly, the correct annotation of
%% the target in the previous example had to look as follows:
%% \begin{example}[Annotation of a Target Span]\label{snt:fgsa:exmp:sent-anno2}
%% \upshape\sentiment{\target{Der neue Papst} gilt als
%% bescheidener, zur\"uckgenommener Typ.}\\[0.8em]
%% \noindent\sentiment{\target{The new Pope} is believed to be a sober,
%% modest man.}
%% \end{example}
%% \noindent with the \textsc{target} span assigned to the whole noun
%% phrase---``der neue Papst'' (\textit{the new pope})---and not only its
%% main word.
%% Similarly, source elements had to cover complete syntactic structures
%% as shown in Example~\ref{snt:fgsa:exmp:src-anno1}:
%% \begin{example}[Annotation of a Source Span]\label{snt:fgsa:exmp:src-anno1}
%% \upshape\sentiment{Die Homosexuellenehe war f\"ur \source{den Kardinal, der jetzt Papst ist,} eine Zerst\"orung von Gottes Plan}\\[0.8em]
%% \noindent\sentiment{For \source{the cardinal, who is the Pope now,}
%% the same-sex marriage was a destruction of God's plan.}
%% \end{example}
%% \noindent This time, again, the whole noun phrase including the
%% dependent attributive clause---``den Kardinal, der jetzt Papst ist,''
%% (\textit{the cardinal, who is the Pope now,})---had to be labeled with
%% the \textsc{source} tag because this constituent was the only
%% \emph{minimal complete} syntactic node which encompassed both the
%% immediate holder of the opinion---``Kardinal'' \textit{cardinal}---and
%% its grammatical dependents, without including any of its parental
%% elements.
\section{Evaluation Metrics}
Because fine-grained sentiment analysis operates on \emph{spans} of
sentiment labels, which typically consist of multiple contiguous tags,
we cannot straightforwardly apply metrics that are used for evaluation
of single independent instances to this objective, as it is unclear
which instances should be measured---single tokens or complete
spans---and how partial matches should be counted in the latter case.
One possibility to estimate the quality of FGSA prediction is to
compute precision, recall, and \F{}-scores of predicted spans by using
\emph{binary-overlap} or \emph{exact-match}
metrics~\cite[see][]{Choi:06,Breck:07}. The first method considers an
automatically labeled span as correct if it has at least one token in
common with a labeled element from the gold annotation. The second
metric only regards an automatic span as true positive if its
boundaries are absolutely identical with the span annotated by the
human expert. Unfortunately, both of these approaches are problematic
to a certain extent: While binary overlap might be overly optimistic,
always assigning perfect scores to automatic spans that cover the
whole sentence; exact match might, vice versa, be too drastic,
considering the whole assignment as false if only one (possibly
irrelevant) token is classified incorrectly.
Instead of relying on these measures, we decided to use a ``golden
mean'' solution proposed by \citet{Johansson:10a}, in which they
penalize predicted spans proportionally to the number of tokens whose
labels are different from the gold annotation. More precisely, given
two sets of manually and automatically tagged spans ($\mathcal{S}$ and
$\widehat{\mathcal{S}}$, respectively), \citeauthor{Johansson:10a}
estimate the precision of automatic assignment as:
\begin{equation}\label{eq:fgsa:jmmetric}
P(\mathcal{S}, \widehat{\mathcal{S}}) = \frac{C(\mathcal{S},
\widehat{\mathcal{S}})}{|\widehat{\mathcal{S}}|},
\end{equation}
where $C(\mathcal{S},\widehat{\mathcal{S}})$ stands for the proportion
of overlapping tokens across all pairs of manually ($s_i$) and
automatically ($s_j$) annotated spans:
\begin{equation*}
C(\mathcal{S}, \widehat{\mathcal{S}}) = \sum_{s_i \in
\mathcal{S}}\sum_{s_j \in \widehat{\mathcal{S}}}c(s_i, s_j),
\end{equation*}
and the $|\widehat{\mathcal{S}}|$ term denotes the total number of
spans automatically labeled with the given tag.
Similarly, the recall of this assignment is estimated as:
\begin{equation*}
R(\mathcal{S}, \widehat{\mathcal{S}}) = \frac{C(\mathcal{S},
\widehat{\mathcal{S}})}{|\mathcal{S}|}.
\end{equation*}
Using these two values, one can normally compute the \F{}-measure as:
\begin{equation*}
F_1 = 2\times\frac{P \times R}{P + R}.
\end{equation*}
Because this estimation adequately accommodates both extrema of
automatic annotation (too long and too short spans) and also penalizes
erroneous labels, we will rely on this measure throughout our
subsequent experiments.
\section{Data Preparation}\label{snt:fgsa:subsec:data}
In order to evaluate CRFs and RNNs on our dataset, we split the
complete corpus annotated by the second annotator, which we will
henceforth consider as gold standard in all subsequent experiments,
into three parts, using 70\% of it for training, 10\% as development
data, and the remaining 20\% as a test set. We tokenized all tweets
with the same adjusted version of Potts' tokenizer that we used
previously while creating the initial corpus files, and preprocessed
these microblogs with the rule-based normalization pipeline of
\citet{Sidarenka:13}. In this procedure, we:
\begin{itemize}
\item \emph{unified Twitter-specific phenomena} such as @-mentions,
hyperlinks, and e-mail addresses by replacing these entities with
special tokens that represented their semantic classes (\eg{}
``\%Username'' for @-mentions, ``\%URI'' for hyperlinks). We
removed these elements from the input, if they were grammatically
independent from the rest of the tweet and did not play a potential
role for the expression of sentiments (\eg{} we stripped off all
retweet mentions and hyperlinks appearing at the very end of the
microblog if they were not preceded by a preposition). Furthermore,
we substituted all emoticons with special placeholders representing
their semantic orientation (\eg{} \smiley{} $\rightarrow$
``\%PosSmiley,'' \frownie{} $\rightarrow$ ``\%NegSmiley,''
\texttt{:-O} $\rightarrow$ ``\%Smiley''), and removed the hash sign
(\#) from all hashtags (\eg{} ``\#gl\"ucklich'' $\rightarrow$
``gl\"ucklich'');
\item In addition to this, we \emph{restored frequent misspellings}
(\eg{} ``zuguckn'' $\rightarrow$ ``zugucken'' [\emph{to watch}],
``Tach'' $\rightarrow$ ``Tag'' [\emph{day}]), using a set of
manually-defined heuristic rules;
\item and, finally, \emph{replaced frequent slang terms and
abbrebiations with their standard-language equivalents} (\eg{} ``n
bissl'' $\rightarrow$ ``ein bisschen'' [\emph{a bit of}], ``iwie''
$\rightarrow$ ``irgendwie'' [\emph{somehow}], ``nix'' $\rightarrow$
``nichts'' [\emph{nothing}]).
\end{itemize}
%% During the normalization, Twitter-specific phenomena like @-mentions,
%% retweets, and URIs that were not syntactically integrated in any
%% sentence of the message were removed from the tweets and those
%% elements which played an integral syntactic role were replaced with
%% the special artificial tokens \%User, \%Link etc. Emoticons like :-),
%% \smiley{}, \frownie{} etc. were also replaced with the placeholders
%% \%PosSmiley, \%NegSmiley, or simply \%Smiley depending on their prior
%% polarity. Furthemore, out-of-vocabulary words which could be
%% converted to in-vocabulary terms with a pre-defined set of
%% transformations were also normalized.
Afterwards, we labeled all normalized sentences with part-of-speech
tags using \textsc{TreeTagger}\footnote{In particular, we used
\textsc{TreeTagger} Version~3.2 with the German parameter file
UTF-8.}~\cite{Schmid:95}, and parsed them with the \textsc{Mate}
dependency parser\footnote{We used \textsc{Mate} Version \texttt{3.61}
with the German parameter model 3.6.}
\cite{Bohnet:13}.\footnote{The choice of these tools was motivated by
their better results in our evaluation study, which we conducted
while working on the normalization module \cite{Sidarenka:13}.}
Finally, since \texttt{MMAX2} did not provide a straightforward
support for character offsets of annotated tokens and because
automatically tokenized data could disagree with the original corpus
tokenization, we aligned manual annotation with automatically split
words with the help of the Needleman-Wunsch
algorithm~\cite{Needleman:70}.
\section{Conditional Random Fields}
The first method that we evaluated using the obtained data was
conditional random fields. First introduced by \citet{Lafferty:01},
CRFs have rapidly grown in popularity, turning into one of the most
widely used probabilistic frameworks, which was dominating the NLP
field for almost a decade.
The main reasons for the success of this model are:
\begin{enumerate}[1)]
\item the \emph{structural nature} of CRFs, which, in contrast to
single-entity classifiers, such as logistic regression or SVM, make
their predictions over structured input, trying to find the most
likely label assignment to the whole structure (typically a chain or
a tree) and not only its individual elements;
\item the \emph{discriminative power} of this framework, which, in
contrast to generative probabilistic models such as HMMs
\cite{Rabiner:86}, optimizes conditional probability
$P(\boldsymbol{Y}|\boldsymbol{X})$ instead of joint distribution
$P(\boldsymbol{X},\boldsymbol{Y})$ and consequently can efficiently
deal with overlapping and correlated features;
%% \begin{example}[Overlapping and Correlated Features]
%% In order to demonstrate the different effects of correlated and
%% overlapping features on generative and discriminative models, let us
%% go through an example where we need to predict whether a tweet
%% mentioning ``Merkel'' and ``Steinmeier'' is about the Christian
%% Democratic Union (\texttt{CDU}) or Social Democratic Party of
%% Germany (\texttt{SPD}).
%% As features for this task, we will use lexical unigrams appearing in
%% the training data. Assuming that our training set consists of three
%% messages mentioning ``Merkel'' and one microblog mentioning
%% ``Steinmeier'' which are labeled as \texttt{CDU}, plus one tweet
%% mentioning ``Merkel'' and three posts mentioning ``Steinmeier''
%% which are annotated as \texttt{SPD}, the generative Na\"{i}ve Bayes
%% model would estimate the probability of the two competing classes
%% as:
%% \begin{align*}
%% P(\mathbf{x}, CDU) =& P(\textrm{Merkel},\textrm{Steinmeier}|CDU)\times P(CDU)\\
%% =& P(\textrm{Merkel}|CDU)\times P(\textrm{Steinmeier}|CDU) \times P(CDU)\\
%% =&\frac{3}{4}\times\frac{1}{4}\times\frac{4}{8}\approx 0.0938\\
%% P(\mathbf{x}, SPD) =& P(\textrm{Merkel},\textrm{Steinmeier}|SPD)\times P(SPD)\\
%% =& P(\textrm{Merkel}|SPD)\times P(\textrm{Steinmeier}|SPD) \times P(SPD)\\
%% =&\frac{1}{4}\times\frac{3}{4}\times\frac{4}{8}\approx 0.0938.\\
%% \end{align*}
%% After normalizing these probabilities, we would get equal 50\%
%% chances for each of the parties, which is fair regarding the token
%% distribution in our corpus. However, if we replace ``Merkel'' with
%% ``von der Leyen'' both in the training data and test example, and
%% rerun this experiment once again, the probability would get
%% significantly skewed towards the CDU class:
%% \begin{align*}
%% P(\mathbf{x}, CDU) =& P(\textrm{von},\textrm{der},\textrm{Leyen},\textrm{Steinmeier}|CDU)\times P(CDU)\\
%% =& P(\textrm{von}|CDU)\times P(\textrm{der}|CDU)\times P(\textrm{Leyen}|CDU)\\
%% &\times P(\textrm{Steinmeier}|CDU) \times P(CDU)\\
%% =&\frac{3}{4}\times\frac{3}{4}\times\frac{3}{4}\times\frac{1}{4}\times\frac{4}{8}\approx 0.0527\\
%% P(\mathbf{x}, SPD) =& P(\textrm{von},\textrm{der},\textrm{Leyen},\textrm{Steinmeier}|SPD)\times P(SPD)\\
%% =& P(\textrm{von}|SPD)\times P(\textrm{der}|SPD)\times P(\textrm{Leyen}|SPD)\\
%% &\times P(\textrm{Steinmeier}|SPD) \times P(SPD)\\
%% =&\frac{1}{4}\times\frac{1}{4}\times\frac{1}{4}\times\frac{3}{4}\times\frac{4}{8}\approx 0.0059,\\
%% \end{align*}
%% which, after normalization, would result in 90\% chances for
%% \texttt{CDU}, and a 10\% score for \texttt{SPD}, even though we only
%% changed the name of the politician.
%% A different situation can be observed with discriminative models
%% such as maximum entropy classifier: Instead of optimizing the joint
%% distribution $P(\mathbf{x}, y)$ as it is done in the generative
%% frameworks, discriminative systems seek to optimize the conditional
%% likelihood $P(y|\mathbf{x})$ by maximizing the total probability of
%% the training set $\sum_{i=1}^N\log P(y_i|\mathbf{x}_i, \mathbf{w})$.
%% This probability is usually estimated using the sigmoid function
%% $\frac{1}{1 + e^{-(\mathbf{x}_i, \mathbf{w})}}$, where
%% $\mathbf{x}_i$ denotes the input features of the $i$-th training
%% instance, and the vector $\mathbf{w}$ stands for the respective
%% weights of these features. By optimizing this function using
%% gradient descent, we will arrive at the optimal solution
%% $w_1 \approx 0.5$ for the feature ``Merkel'' and $w_2 \approx -0.5$
%% for the feature ``Steinmeier'' for the first example, which would
%% again result in equal 50\% chances for both classes. In the second
%% example, however, all three features ``von,'' ``der,'' and ``Leyen''
%% would get an equal weight of $\approx 0.3$, and the ``Steinmeier''
%% feature would receive a coefficient of $\approx -0.4$, which would
%% result in 60\% probability for the test message being about the CDU,
%% and 40\% that the tweet is about the SPD. Even though this still
%% means a slight skewness towards \texttt{CDU}; this time, the effect
%% of correlated features is much less dramatic than in the generative
%% case.
%% \end{example}
\item and, finally, the \emph{avoidance of the label bias problem},
which other discriminative classifiers, such as maximum entropy
Markov networks~\cite{McCallum:00}, are known to be susceptible to.
\begin{example}[Label Bias Problem]
The label bias problem arises in the cases where a locally optimal
decision outweighs globally superior solutions. Consider, for
example, the sentence ``Aber gerade Erwachsene haben damit
Schwierigkeiten.'' (\textit{But especially adults have
difficulties with it.}), for which we need to compute the most
probable sequence of part-of-speech tags.
\begin{center}
\begin{tikzpicture}[node distance=5cm]
\tikzstyle{tag}=[circle split,draw=gray!50,%
minimum size=2.5em,inner ysep=2,inner xsep=0,%
circle split part fill={yellow!20,blue!30}]
\tikzstyle{word}=[draw=none,inner sep=10pt]
\node[word] (A) at (1, 1) {Aber};
\node[tag] (B) at (1, 3) {\footnotesize KON \nodepart{lower} 1.};
\node[word] (D) at (3, 1) {gerade};
\node[tag] (E) at (3, 2) {\footnotesize ADJA \nodepart{lower} .5};
\node[tag] (F) at (3, 4) {\footnotesize ADV \nodepart{lower} .5} ;
\node[word] (G) at (7, 1) {Erwachsene};
\node[tag] (I) at (7,2) {\footnotesize ADJA \nodepart{lower} .5} ;
\node[tag] (H) at (7,4) {\footnotesize NN \nodepart{lower} .5};
\node[word] (J) at (9,1) {haben};
\node[tag] (K) at (9,3) {\footnotesize VA \nodepart{lower}\small 1.};
\node[word] (J) at (11,1) {\ldots};
\path [-] (B) edge node[below] {$.5$} (E);
\path [-] (B) edge node[above] {$.5$} (F);
\path [-] (E) edge node[below] {$.3$} (I);
\path [-] (E) edge node[below left=0.4] {$.7$} (H);
\path [-] (F) edge node[above left=0.4] {$.8$} (I);
\path [-] (F) edge node[above] {$.2$} (H);
\path [-] (I) edge node[below] {$.1$} (K);
\path [-] (H) edge node[above] {$.9$} (K);
\end{tikzpicture}
\captionof{figure}{Example of a CRF graph}\label{fig:snt:memm-crf}
\end{center}
Using features weights shown in Figure~\ref{fig:snt:memm-crf}, we
will first estimate the probability of the correct label sequence
for the initial part of this sentence using the Maximum Entropy
Markov Model (MEMM)---the predecessor of the Conditional Random
Fields. According to the MEMM's definition, the probability of
correct labeling ($KON-ADV-NN-VA$) is equal to:
\begin{align*}
P(KON, ADV, NN, VA) &= P(KON)\times P(ADV|KON)\\
&\times P(NN|ADV)\times P(VA|NN)\\
&=\frac{\exp(1)}{\exp(1)}\times\frac{\exp(0.5 + 0.5)}{\exp(0.5 + 0.5) + \exp(0.5 + 0.5)}\\%
&\times\frac{\exp(0.2 + 0.5)}{\exp(0.2 + 0.5) + \exp(0.8 + 0.5)}\\
&\times\frac{\exp(0.9 + 1.)}{\exp(0.9 + 1.)} \approx 0.177
\end{align*}
At the same time, the probability of the wrong variant
($KON-ADV-ADJA-VA$) amounts to $\approx$ 0.323 and will therefore
be preferred by the automatic tagger.
A different situation is observed with CRFs, where the normalizing
factor in the denominator is computed over the whole input
sequence without factorizing into individual terms for each
transition as it is done in MEMM\@. This way, the probability of
correct labels will run up to:
\begin{align*}
P(KON, ADV, NN, VA) =& P(KON)\times
P(ADV|KON)\times P(NN|ADV)\\
&\times P(VA|NN)\\ =&\frac{\exp(1 + 0.5
\times 3 + 0.2 + 0.9 + 1)}{Z} \approx 0.252,
\end{align*}
where $Z = \exp(1 + 0.5 \times 3 + 0.2 + 0.9 + 1) + \exp(1 + 0.5
\times 3 + 0.8 + 0.1 + 1) + \exp(1 + 0.5 \times 3 + 0.7 + 0.9 + 1)
+ \exp(1 + 0.5 \times 3 + 0.3 + 0.1 + 1)$ is the total score of
all possible label assignments; the incorrect alternative
($KON-ADV-ADJA-VA$), however, will get a probability score of
$\approx$ 0.207, which is less than the score of the correct
labeling.
\end{example}
\end{enumerate}
\paragraph{Training.}
CRFs have these useful properties due to a neatly formulated objective
function in which they seek to optimize the global log-likelihood of
gold labels $\mathbf{Y}$ conditioned on training data $\mathbf{X}$.
In particular, given a set of training instances $\mathcal{D} =
\{(\mathbf{x}^{(n)}, \mathbf{y}^{(n)})\}_{n=1}^N$, where
$\mathbf{x}^{(n)}$ stands for the covariates of the $n$-th instance,
and $\mathbf{y}^{(n)}$ denotes its respective gold labels, CRFs try to
find feature coefficients $\mathbf{w}$ that maximize the
log-probabilities $\ell$ of $\mathbf{y}^{(i)}$ given
$\mathbf{x}^{(i)}$ over the whole corpus:
\begin{equation}\label{snt:fgsa:eq:crf-w}
\mathbf{w} = \argmax_{\mathbf{w}}\sum_{n=1}^N\ell
\left(\mathbf{y}^{(n)}|\mathbf{x}^{(n)}\right).
\end{equation}
The log-likelihood $\ell(\mathbf{y}^{(n)}|\mathbf{x}^{(n)})$ in this
equation is commonly estimated as the logarithm of globally (\ie{}
w.r.t\@. to the whole instance) normalized softmax function:
\begin{equation}\label{snt:fgsa:eq:crf-ell}
\ell\left(\mathbf{y}^{(n)}|\mathbf{x}^{(n)}\right) =
\ln\left(P(\mathbf{y}^{(n)}|\mathbf{x}^{(n)})\right) =
\ln\left(\frac{ \exp\left(\sum_{m=1}^{M}\sum_{j}w_{j} \cdot f_j(x_{m},
y_{m-1}, y_{m})\right)}{Z}\right),
\end{equation}
in which $M$ means the length of the $n$-th training example;
$f_j(x_{m}, y_{m-1}, y_{m})$ denotes the value of the $j$-th feature
function $f$ at position $m$; $w_j$ represents the corresponding
weight of this feature; and $Z$ is a normalization factor calculated
over all possible label assignments:
\begin{equation*}
Z \defeq
\sum_{y'\in\mathcal{Y},y''\in\mathcal{Y}}\exp\left(\sum_{m=1}^{M}\sum_{j}w_{j}
\cdot f_j(x_{m}, y'_{m-1}, y''_{m})\right).
\end{equation*}
Since this normalizing term appears in the denominator and couples
together all feature weights that need to be optimized, it becomes
prohibitively expensive to find the best solution to
Equation~\ref{snt:fgsa:eq:crf-w} analytically, with a single shot. A
possible remedy to this problem is to resort to other optimization
techniques, such as gradient descent, where feature weights are
successively changed in the direction of their gradient until they
reach the minimum of the loss function.
From Equation~\ref{snt:fgsa:eq:crf-ell}, we can see that the partial
derivative of log-likelihood w.r.t\@. a single feature weight $w_j$ is:
\begin{equation*}
\frac{\partial}{\partial w_j}\ell =%
\sum_{n=1}^N\sum_{m=1}^{M}f_j(x_{m}, y_{m-1}, y_{m}) -%
\sum_{n=1}^N\sum_{m=1}^{M}\sum_{y'\in\mathcal{Y},y''\in\mathcal{Y}}f_j(x_{m},%
y'_{m-1}, y''_{m})P(y',y''|\mathbf{x}^{(n)}),
\end{equation*}
which, after dividing both parts of the equation by the constant term
$N$ (the size of the corpus) can be transformed into:
\begin{equation*}
\frac{1}{N}\frac{\partial}{\partial w_j}\ell = \E[f_j(\mathbf{x},
\mathbf{y})] - \E_{\mathbf{w}}[f_j(\mathbf{x}, \mathbf{y})],
\end{equation*}
where the first term ($\E[f_j(\mathbf{x}, \mathbf{y})]$) is the
expectation of feature $f_j$ under empirical distribution, and the
second term ($\E_{\mathbf{w}}[f_j(\mathbf{x}, \mathbf{y})]$) is the
same expectation under model's parameters $\mathbf{w}$. In other
words, the optimal solution to the log-likelihood objective in
Equation~\ref{snt:fgsa:eq:crf-ell} is achieved when model's
expectation of features matches their (true) empirical expectation on
the corpus.
The marginal probabilities of these features, which are required for
computing their expectations, can be estimated dynamically using the
forward-backward (FB) algorithm~\cite{Rabiner:90}, which is a
particular case of the more general belief-propagation
method~\cite[see][p.~81]{Barber:12}.
The only modification that one usually makes to
Equation~\ref{snt:fgsa:eq:crf-w} in practice, before applying it the
the provided training set, is the addition of so-called
\emph{regularization terms} (L1 and L2), which penalize excessively
high feature weights, thus preventing the model from overfitting the
training data, \ie{} we no longer seek feature weights that simply
maximize the probability of observed data, but we also want these
weights to be as small as possible:
\begin{equation}\label{snt:fgsa:eq:crf-w-regularization}
\mathbf{w} = \argmax_{\mathbf{w}}\sum_{n=1}^N\ell
\left(\mathbf{y}^{(n)}|\mathbf{x}^{(n)}\right) -
\lambda_1\lVert\mathbf{w}\rVert_1 - \lambda_2\lVert\mathbf{w}\rVert_2,
\end{equation}
where $\lambda_1$ and $\lambda_2$ are manually set hyper-parameters,
which control the amount of penalty that we want impose on the L1 and
L2 norms of the weights.
In our experiments, we also adopted this enhanced objective, picking
hyper-parameter values that yielded the best results on the held-out
development set. Furthermore, in order to reduce the noise that is
typically introduced by rare, sporadic features, we only optimized the
weights of features that occurred two or more times in the training
corpus, ignoring all singleton attributes from these data.
\paragraph{Inference.}
Once optimal feature weights have been learned, one can
unproblematically compute the most likely label assignment for a new
instance by using the Viterbi algorithm~\cite{Viterbi:67}, which
effectively corresponds to the forward pass of the FB method with the
summation over the alternative preceding states replaced by the
maximum operator (hence the other name for this algorithm,
``max-product'').
\paragraph{Features.}
A crucial component that accounts for a huge part of the success (or
failure) of CRFs is features that are provided to this classifier as
input.
Traditionally, feature functions in CRFs are divided into transition-
and state-based ones. Transition features represent real- or
binary-valued functions $f(\mathbf{x}, y'', y')\rightarrow\mathbb{R}$
associated with some data predicate
$\phi(\mathbf{x})\rightarrow\mathbb{R}$ and two labels $y''$
(typically the label of the previous token) and $y'$ (usually the
label of the current word). The value of this function at position
$m$ in sequence $\mathbf{x}$ is then defined as:
\begin{equation*}
f(\mathbf{x}_m, y'', y') = \begin{cases} \phi(\mathbf{x}_m), &
\mbox{if } \mathbf{y}_{m-1} = y''\mbox{ and }\mathbf{y}_{m} =
y'\\ 0, & \mbox{otherwise;}
\end{cases}
\end{equation*}
where predicate~$\phi$ usually represents a simple unit function:
$\phi(\mathbf{x}_m)\mapsto 1$, $\forall\mathbf{x}_m$.
In contrast to ternary transition features, state attributes are
typically associated with binary predicates, whose output depends on
the input data at the given position and label $y'$ at the respective
state:
\begin{equation*}
f(\mathbf{x}_m, y') = \begin{cases} \phi(\mathbf{x}_m), & \mbox{if }
\mathbf{y}_{m} = y'\\ 0, & \mbox{otherwise.}
\end{cases}
\end{equation*}
This time, predicate~$\phi$ is usually much more sophisticated and
reflects various properties of the input, such as whether the current
token is capitalized or whether it begins with a specific prefix or
ends with a certain suffix. This type of features commonly accounts
for the overwhelming majority of all attributes in CRFs.
As state attributes in our experiments, we used the following
features, which, for simplicity, are listed in groups:
\begin{itemize}
\item\emph{formal}, which included the initial three characters of
each token (\eg{} $\phi_{abc}(\mathbf{x}_m) = 1\mbox{ if
}\mathbf{x}_m\sim\mbox{ /\textasciicircum{}abc/ else } 0$), its last
three characters, and the spelling class of that word (\eg{}
alphanumeric, digit, or punctuation);
\item\emph{morphological}, which encompassed part-of-speech tags of
analyzed tokens, grammatical case and gender of inflectable PoS
types, degree of comparison for adjectives, as well as mood, tense,
and person forms for verbs;
\item\emph{lexical}, which comprised the actual lemma and form of the
analyzed token (using one-hot encoding), its polarity class
(positive, negative, or neutral), which we obtained from the Zurich
Polarity Lexicon~\cite{Clematide:10};
\item and, finally, \emph{syntactic} features, which reflected the
dependency relation via which token $x_m$ was connected to its
parent. In addition to this, we also used two binary attributes
that showed whether the previous token in the sentence was the
parent (first feature) or a child (second feature) of the current
word. Apart from that, we devised two more features, one of which
encoded the dependency relation of the previous token in the
sentence to its parent + the dependency relation of the current
token to its ancestor; another feature reflected the dependency link
of the next token + the dependency relation of the current token to
its parent.
\end{itemize}
Besides the above attributes, we also introduced a set of complex
\emph{lexico-syntactic} features, which simultaneously reflected
several semantic and syntactic traits. These were:
\begin{itemize}
\item the lemma of the syntactic parent;
\item the part-of-speech tag and polarity class of the grandparent in
the syntactic tree;
\item the lemma of the child node + the dependency relation between
the current token and its child;
\item the PoS tag of the child node + its dependency relation + the
PoS tag of the current token;
\item the lemma of the child node + its dependency relation + the
lemma of the current token;
\item the overall polarity of syntactic children, which was computed
by summing up the polarity scores of all immediate dependents, and
checking whether the resulting value was greater, less than, or
equal to zero.\footnote{We again used the Zurich Polarity Lexicon
of~\citet{Clematide:10} for computing these scores.}
\end{itemize}
\paragraph{Results.}
The results of our experiments are shown in
Table~\ref{snt-fgsa:tbl:crf-res}. As we can see from the table, with
the given set of features, CRF can perfectly well fit the training
data, achieving a macro-averaged \F-score of~0.904. This model,
however, can only partially generalize to unseen messages, where its
macro-\F{} reaches merely~0.287, despite the fact that the size of the
training corpus is almost 3.5 times bigger than the size of the test
set (5,616 versus 1,584 tweets).
%% There are 5,616 training tweets, 792 development and 1,584 test microblogs.
%% ./scripts/cmp_features --min-cnt=2 lingtmp/sentiment/crf/train/ lingtmp/sentiment/crf/devtest/
%% Features in first dataset: 34597
%% Features in second dataset: 28579
%% Common features: 11317
%% Total features: 51859
%% ./scripts/cmp_features --min-cnt=2 lingtmp/sentiment/crf/train/ lingtmp/sentiment/crf/test/
%% Features in first dataset: 34597
%% Features in second dataset: 49626
%% Common features: 15440
%% Total features: 68783
%% Another notable tendency that can be observed both on the training and
%% test sets, is that the recall of the CRF system is generally lower
%% than its precision. This again can be attributed to the high variance
%% of the classifier.
\begin{table*}
\begin{center}
\bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
\begin{tabular}{p{0.162\columnwidth} % first columm
*{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
*{1}{>{\centering\arraybackslash}p{0.136\columnwidth}}} % last two columns
\toprule
\multirow{2}*{\bfseries Data Set} & \multicolumn{3}{c}{\bfseries Sentiment} & %
\multicolumn{3}{c}{\bfseries Source} & %
\multicolumn{3}{c}{\bfseries Target} & %
\multirow{2}{0.136\columnwidth}{\bfseries\centering Macro\newline \F{}}\\\cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}
& Precision & Recall & \F{} & %
Precision & Recall & \F{} & %
Precision & Recall & \F{} &\\\midrule
Training Set & 0.949 & 0.908 & 0.928 & 0.903 & 0.87 & 0.886 & %
0.933 & 0.865 & 0.898 & 0.904\\
Test Set & 0.37 & 0.28 & 0.319 & 0.305 & 0.244 & 0.271 & 0.304 & %
0.244 & 0.271 & 0.287\\\bottomrule
\end{tabular}
\egroup{}
\caption{Results of fine-grained sentiment analysis with the
first-order linear-chain CRFs}\label{snt-fgsa:tbl:crf-res}
\end{center}
\end{table*}
\subsection{Feature Analysis}
To estimate the effect of different features on the net results of the
CRF system, we performed an ablation test, removing one group of state
attributes at a time and rechecking the performance of the model on
the development data.
\begin{table}[hbt]
\begin{center}
\bgroup\setlength\tabcolsep{0.47\tabcolsep}\scriptsize
\begin{tabular}{p{0.14\columnwidth} % first columm
*{6}{>{\centering\arraybackslash}p{0.13\columnwidth}}} % next five columns
\toprule
\multirow{2}{0.2\columnwidth}{\bfseries Element} &
\multirow{2}{0.1\columnwidth}{\bfseries Original\newline \F-Score} &
\multicolumn{5}{c}{\bfseries \F-Score after Feature Removal}\\\cline{3-7}
& & Formal & Morphological & Lexical & Syntactic & Complex\\\midrule
Sentiment & 0.346 & 0.343\negdelta{0.003} & 0.344\negdelta{0.002} & 0.326\negdelta{0.02} & 0.345\negdelta{0.001} & 0.324\negdelta{0.022}\\
Source & 0.309 & 0.321\posdelta{0.012} & 0.313\posdelta{0.004} & 0.265\negdelta{0.044} & 0.359\posdelta{0.05} & 0.271\negdelta{0.038}\\
Target & 0.26 & 0.282\posdelta{0.022} & 0.252\negdelta{0.008} & 0.263\posdelta{0.003} & 0.233\negdelta{0.027} & 0.263\posdelta{0.003}\\\bottomrule
\end{tabular}
\egroup{}
\caption[Results of the feature ablation tests for the CRF
model]{Results of the feature ablation tests for the CRF
model\\{\small\itshape (negative changes w.r.t\@. the original
scores on the development set are shown in
\textsuperscript{\textcolor{red3}{red}}; positive changes are
depicted in \textsuperscript{\textcolor{seagreen}{green}}
superscripts)\footnotemark}}\label{tbl:ablation}
\end{center}
\end{table}
As we can see from the results in
Table~\ref{tbl:ablation},\footnotetext{Negative changes indicate good
features in this context, since their removal leads to a degradation
of the results.} all feature groups are useful for predicting
\markable{sentiment}s, as their removal leads to a degradation of its
scores. This quality drop, however, is usually quite small,
suggesting that other features can easily make up for the removed
attributes. A different situation is observed with \markable{source}s
and \markable{target}s though. In the former case, removing formal,
morphological, and syntactic features shows a strong positive effect,
improving the \F{}-scores for \markable{source}s by up to five
percent. Removing lexical and lexico-syntactic features, on the
contrary, worsens these results, tearing the \F-measure down by 4.4\%.
Except for the formal group, all these attributes behave completely
differently when applied to \markable{target}s, which benefit from
morphological and syntactic features, but apparently get confused by
lexical and complex attributes.
\begin{table}[hbt]
\begin{center}
\bgroup\setlength\tabcolsep{0.47\tabcolsep}\scriptsize
\begin{tabular}{%
>{\centering\arraybackslash}p{0.045\columnwidth} % first columm
>{\centering\arraybackslash}p{0.3\columnwidth} % second columm
>{\centering\arraybackslash}p{0.1\columnwidth} % third columm
>{\centering\arraybackslash}p{0.23\columnwidth} % fourth columm
>{\centering\arraybackslash}p{0.1\columnwidth}} % next four columns
\toprule
\multirow{2}{0.2\columnwidth}{Rank} &
\multicolumn{2}{c}{\bfseries State Features} &
\multicolumn{2}{c}{\bfseries Transition Features}\\\cmidrule(lr){2-3}\cmidrule(lr){4-5}
& Feature & Score & Feature & Score\\\midrule
1 & prntLemma=meiste $\rightarrow$ TRG & 18.68 & NON $\rightarrow$ TRG & -7.01\\
2 & prntLemma=rettungsschirme $\rightarrow$ TRG & 18.3 & NON $\rightarrow$ SRC & -6.85\\
3 & initChar=sty $\rightarrow$ NON & -16.04 & NON $\rightarrow$ SNT & -5.39\\
4 & form=meisten $\rightarrow$ NON & 15.99 & TRG $\rightarrow$ SRC & -2.99\\
5 & prntLemma=urlauberin $\rightarrow$ SNT & 14.74 & NON $\rightarrow$ NON & 2.69\\
6 & lemma=anfechten $\rightarrow$ SNT & 14.07 & SRC $\rightarrow$ NON & -2.59\\
7 & form=thomasoppermann $\rightarrow$ TRG & 13.44 & SNT $\rightarrow$ SNT & 2.54\\
8 & form=bezeichnete $\rightarrow$ SNT & 13.25 & TRG $\rightarrow$ TRG & 2.31\\
9 & deprel[0]|deprel[1]=NK|AMS $\rightarrow$ NON & 12.92 & SRC $\rightarrow$ SRC & 2.19\\
10 & trailChar=te. $\rightarrow$ NON & 12.77 & SRC $\rightarrow$ TRG & -2.07\\\bottomrule
\end{tabular}
\egroup{}
\caption[Top-10 state and transition features learned by the CRF
model]{Top-10 state and transition features learned by the CRF
model\\{\small (sorted by the absolute values of their
weights)}}\label{fgsa:tbl:ablation}
\end{center}
\end{table}
In order to get a better insight into the learned model's parameters,
we additionally extracted top-ten state and transition features,
ranked by the absolute values of their weights. As we can see from
the statistics in Table~\ref{fgsa:tbl:ablation}, three of five
top-ranked state attributes (``meiste'' [\emph{most}],
``rettungsschirme'' [\emph{bailout}], and ``urlauberin'') are complex
features that reflect the lemma of the syntactic parent. Another
common group of features is lemma and form of the current token: here,
we again encounter the word ``meisten'' (\emph{most}), which, however,
indicates the absence of any sentiments this time, and we also can see
two other attributes (``anfechten'' [\emph{doubt}] and ``bezeichnete''
[\emph{called}]) that represent the so-called \emph{direct speech
events} and correlate with \markable{sentiment}s. The remaining
feature (``thomasopperman'') is a person name, which frequently
appears as sentiment's \markable{target} in our corpus.
An interesting pattern can be observed with transition features: As we
can see from the results, top three of these attributes indicate a
strong belief in that an objective token is very unlikely to be
followed by a \markable{target}, \markable{source}, or
\markable{sentiment} tag (hence, the high negative weights of
transitions emanating from \textsc{NON}). It is, however, quite
common that a \textsc{NON} tag will precede another \textsc{NON} (as
we can see from line 5 of the table). Other transitions also mainly
reflect plausible regularities: It is, for instance, uncommon that a
\markable{target} of an opinion will appear immediately before a
source (\textsc{TRG}$\rightarrow$\textsc{SRC} $= -2.99$); in the same
vein, it is fairly improbable that an \textsc{SRC} tag will precede a
\markable{TRG} element (\textsc{SRC}$\rightarrow$\textsc{TRG} $=
-2.07$); nonetheless, is is perfectly acceptable that the same tag
will continue over multiple words (\eg{}
\textsc{SNT}$\rightarrow$\textsc{SNT} $= 2.54$,
\textsc{TRG}$\rightarrow$\textsc{TRG} $= 2.31$).
In order to better understand the reason for the observed overfitting
of the weights to the training data, we also compared all features
that appeared in the training set with the attributes that occurred in
the test part of the corpus. As it turned out, more than two thirds
of all unique test features (34,186 out of 49,626) have never been
observed during the training and consequently had no meaningful model
weights.
%% This figure was generated using the iPython notebook `notebooks/cgsa.ipynb`.
\begin{figure*}[bht]
{
\centering
\begin{subfigure}{.5\textwidth}
\centering
\includegraphics[width=\linewidth]{img/fgsa_lambda1.png}
\caption{$\lambda_1$}\label{fgsa:fig:crf-lambda1}
\end{subfigure}%
\begin{subfigure}{.5\textwidth}
\centering
\includegraphics[width=\linewidth]{img/fgsa_lambda2.png}
\caption{$\lambda_2$}\label{fgsa:fig:crf-lambda2}
\end{subfigure}
}
\caption[CRF results for different regularization values]{Results of
the linear-chain CRFs with different values of regularization
parameters}\label{fgsa:fig:crf-regularization}
\end{figure*}
Another factor that could significantly affect the generalization of
the CRF system was the regularization parameters $\lambda_1$ and
$\lambda_2$, which controlled the amount of penalty imposed on too big
learned feature weights (see
Equation~\ref{snt:fgsa:eq:crf-w-regularization}). Because we chose
these parameters based on the model's results on the held-out
development data, a possible reason for rather low scores on the test
set could be a considerable difference between the distribution of
\markable{sentiment}s, \markable{source}s, and \markable{target}s in
the development and test parts of the corpus. To see whether it
indeed was the case, we recomputed the \F{}-scores on the development
and test data, using different $\lambda$ values, and present the
results of this computation in
Figure~\ref{fgsa:fig:crf-regularization}. As is evident from the
figure, model's \F-measure on the development set largely correlates
with its performance on the test corpus, and almost monotonically
decreases with larger $\lambda$s.
\subsection{Error Analysis}
Besides looking into model's parameters, we also decided to analyze
some errors made by the CRF system in order to understand the reasons
for its misclassifications.
\begin{example}[An Error Made by the CRF System]\label{snt:fgsa:exmp:crf-error-1}
%% TARGET O TARGET überall
%% TARGET O TARGET npd
%% TARGET O TARGET plakat
%% SENTIMENT O SENTIMENT %negsmiley
%% LABELS = {
%% 0: TARGET
%% 1: O
%% 2: SENTIMENT
%% 3: SOURCE
%% }
%% state[0][0] = -2.816934
%% state[0][1] = 6.514906
%% state[0][2] = -3.183343
%% state[0][3] = -3.923175
%% state[1][0] = 7.019282
%% state[1][1] = 5.362580
%% state[1][2] = -0.696150
%% state[1][3] = 1.651658
%% state[2][0] = 3.253026
%% state[2][1] = 0.130018
%% state[2][2] = -1.490348
%% state[2][3] = -5.405246
%% state[3][0] = -5.770071
%% state[3][1] = -1.818361
%% state[3][2] = 2.438365
%% state[3][3] = -9.148613
\noindent\textup{\bfseries\textcolor{darkred}{Gold Labels:}} {\upshape \"Uberall/TRG NPD/TRG Plakate/TRG \%NegSmiley/SNT}\\
\noindent Everywhere/TRG NPD/TRG posters/TRG \%NegSmiley/SNT\\[\exampleSep]
\noindent\textup{\bfseries\textcolor{darkred}{Predicted Labels:}} {\upshape \"Uberall/NON NPD/NON Plakate/NON \%NegSmiley/NON}\\
\noindent Everywhere/NON NPD/NON posters/NON \%NegSmiley/NON\\[\exampleSep]
\end{example}
One such error is shown in Example~\ref{snt:fgsa:exmp:crf-error-1}.
In this case, the classifier has erroneously overlooked a negative
emoticon, which expresses author's attitude to election posters of the
National Democratic Party of Germany (NPD), and assigned the NON
(none) tags to all tokens of the tweet. As it turns out, despite this
incorrect assignment, the state potentials of the smiley still achieve
their highest scores with the correct SNT (\markable{sentiment}) tag.
Moreover, the state scores of the word ``Plakate'' (\emph{posters})
also reach their maximum value (0.13 in the logarithmic domain) with
the correct TRG (\markable{target}) label. Unfortunately, these good
guesses of single tags are overruled by the extremely high score of
the NON label (6.515) that is assigned to the first word of this
message (``\"uberall'' [\emph{everywhere}]) and is reinforced by the
transition features, which prefer contiguous runs of \textsc{NON}s.
This kind of mistakes is by far the most common type of errors that we
have observed on the development set, followed by spans with different
boundaries and invalid label sequences similar to the one shown
Example~\ref{snt:fgsa:exmp:crf-error-2}, where the classifier assigned
only SNT tags to all input tokens, although a \markable{sentiment} in
our original corpus annotation could only appear in the presence of a
\markable{target} element.
\begin{example}[An Error Made by the CRF System]\label{snt:fgsa:exmp:crf-error-2}
%% O SENTIMENT O so
%% O SENTIMENT O müssen
%% O O O die
%% O O O sein
%% O O O \%possmiley
%% O O O piraten+
%% LABELS = {
%% 0: TARGET
%% 1: O
%% 2: SENTIMENT
%% 3: SOURCE
%% }
%% state[0][0] = -3.038067
%% state[0][1] = 1.681227
%% state[0][2] = -0.927401
%% state[0][3] = -13.950086
%% state[1][0] = -6.345182
%% state[1][1] = -5.990096
%% state[1][2] = 1.520121
%% state[1][3] = -11.317471
%% state[2][0] = 1.443866
%% state[2][1] = -1.640298
%% state[2][2] = -1.497256
%% state[2][3] = -6.206925
%% state[3][0] = 4.883781
%% state[3][1] = 6.219811
%% state[3][2] = 5.258309
%% state[3][3] = -6.819373
%% state[4][0] = -3.866373
%% state[4][1] = -0.408577
%% state[4][2] = -0.374299
%% state[4][3] = -3.834682
%% state[5][0] = -7.603441
%% state[5][1] = -1.632574
%% state[5][2] = -5.785819
%% state[5][3] = -8.896403
%% SENTIMENT:0.905768
%% SENTIMENT:0.922549
%% O:0.291830
%% O:0.713491
%% O:0.864184
%% O:0.951727
\noindent\textup{\bfseries\textcolor{darkred}{Gold Labels:}}
{\upshape So/SNT muss/SNT das/SNT sein/SNT
\%PosSmiley/SNT piraten+/TRG}\\
\noindent That/SNT 's/SNT the/SNT way/SNT how/SNT it/SNT 's/SNT\\
supposed/SNT to/SNT be/SNT \%PosSmiley/SNT
piraten+/TRG\\[\exampleSep]
\noindent\textup{\bfseries\textcolor{darkred}{Predicted Labels:}}
{\upshape So/SNT muss/SNT das/SNT sein/SNT
\%PosSmiley/SNT piraten+/SNT}\\
\noindent That/SNT 's/SNT the/SNT way/SNT how/SNT it/SNT 's/SNT\\
supposed/SNT to/SNT be/SNT \%PosSmiley/SNT
piraten+/TRG\\[\exampleSep]
\end{example}
\section{Recurrent Neural Networks}
A competitive alternative to CRFs is deep recurrent neural networks
(RNNs). Introduced in the mid-nineties~\cite{Hochreiter:97}, RNNs
have become one of the most popular trends in the raging tsunami of
deep learning applications, demonstrating superior results on many
important NLP tasks including part-of-speech
tagging~\cite{Wang:15:pos}, dependency parsing~\cite{Kiperwasser:16a},
and machine
translation~\cite{Kalchbrenner:13,Bahdanau:14,Sutskever:14}. Key
factors that account for this success are
\begin{enumerate}[1)]
\item \emph{the ability of RNN systems to learn optimal feature
representations automatically}, which favorably sets them apart from
traditional supervised machine-learning frameworks, such as SVMs or
CRFs, where all features need to be defined by the user; and
\item \emph{the ability to deal with arbitrary sequence lengths},
which advantageously distinguishes these methods from other NN
architectures, such as plain feed-forward networks or convolutional
systems without pooling, where the size of the input layer has to be
constant.
\end{enumerate}
The main component that underlies any modern RNN approach is a
fixed-size hidden vector $\vec{h}$, which is recurrently updated
during the analysis of an input sequence $\mathbf{x}$ and is meant to
encode the meaning of that sequence. The general form of this vector
at input state $t$ is usually defined as:
\begin{align*}
\vec{h}^{(t)} = f(\vec{h}^{(t-1)}, \mathbf{x}^{(t)});
\end{align*}
where $f$ represents some non-linear transformation function,
$\vec{h}^{(t-1)}$ denotes the state of the hidden vector at the
previous time step, and $\mathbf{x}^{(t)}$ is the input vector at
position $t$.
\paragraph{LSTM.}
A fundamental problem that arises from the above definition is that
the gradients of model's parameters rapidly vanish to zero or explode
to infinity (depending on whether the absolute values of $\vec{h}$ are
less or greater than one) as the length of the input sequence
increases. In order to solve this issue, \citet{Hochreiter:97}
proposed the long short-term memory mechanism (LSTM), in which they
explicitly incorporated the goal of keeping the gradients within an
appropriate range. In particular, given an input sequence