-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathpaper.Rtex
319 lines (249 loc) · 18 KB
/
paper.Rtex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
\documentclass[11pt,a4paper]{article}
\usepackage[hyperref]{acl2018}
\usepackage{times}
\usepackage{latexsym}
\usepackage{natbib}
\usepackage{url}
%% Title information
\title{Entity Discovery and Linking} %% [Short Title] is optional;
%% when present, will be used in
%% header instead of Full Title.
%\subtitle{Subtitle} %% \subtitle is optional
%% Author information
%% Each author should be introduced by \author, followed by
%% \affiliation, and \email.
%% Author with single affiliation.
%\author{Johannes Ahnlide}
%\affiliation{
% \institution{F13, Lund University, Sweden} %% \institution is required
%}
%\email{[email protected]} %% \email is recommended
%
%\author{Dwight Lidman}
%\affiliation{
% \institution{D13, Lund University, Sweden} %% \institution is required
%}
%\email{[email protected]} %% \email is recommended
\aclfinalcopy{}
\author{Johannes Ahnlide \\
Lund University\\
{\tt [email protected]} \\\And
Dwight Lidman \\
Lund University\\
{\tt [email protected]} \\}
\date{\today}
%\setcopyright{none}
%% Bibliography style
%\bibliographystyle{ACM-Reference-Format}
\usepackage[acronym]{glossaries}
\usepackage{booktabs}
%\makeglossaries
\pagenumbering{gobble}
\newcommand{\namper}{NAM-PER}
\newcommand{\nomper}{NOM-PER}
\newcommand{\linking}{strong typed all match}
\newcommand{\mention}{strong typed mention match}
\begin{document}
\newacronym{edl}{EDL}{Entity Discovery and Linking}
\newacronym{md}{MD}{Mention Detection}
\newacronym{ner}{NER}{Named Entity Recognition}
\newacronym{el}{EL}{Entity Linking}
\newacronym{bilstm}{BiLSTM}{Bidirectional Long Short-Term Memory}
\newacronym{lstm}{LSTM}{Long Short-Term Memory}
\newacronym{crf}{CRF}{Conditional Random Fields}
\newacronym{oov}{OOV}{Out of Vocabulary}
\newacronym{pos}{POS}{Part of Speech}
\newacronym{gui}{GUI}{Graphical User Interface}
\newacronym{tedl}{TEDL}{Trilingual Entity Discovery and Linking}
\newacronym{tac}{TAC}{Text Analysis Conference}
\newacronym{kbp}{KBP}{Knowledge Base Population}
\newacronym{nam}{NAM}{Named Entity}
\newacronym{nom}{NOM}{Nominal Entity}
\newacronym{per}{PER}{Person}
\newacronym{gpe}{GPE}{Geopolitical Entity}
\newacronym{org}{ORG}{Organization}
\newacronym{loc}{LOC}{Location}
\newacronym{fac}{FAC}{Faculty}
\newacronym{ttl}{TTL}{Title}
\maketitle
\begin{abstract}
In this paper we outline an attempt to build a system for \acrfull{tedl} based on a paper by \citet{YangThe2017}.
\acrlong{edl} is the problem of finding and identifying names of people, organizations, locations and so on (i.e. \textit{entities}) in arbitrary unstructured bodies of text.
The solution to this problem can be divided into two distinct parts: \acrfull{md} -- the process of detecting and capturing sequences of words as references to an entity, and \acrfull{el} -- the \textit{linking} of a detected entity in the text to an existing entity. This latter step requires disambiguating the detected entity from a list of possible candidates of existing entities based on some set of heuristics.
Our model for detecting entity mentions is based on a deep neural network consisting of two \acrfull{bilstm} layers. The entity linking is accomplished by creating a mapping from words to Wikipedia articles and generating a list of candidates from that mapping for each detected entity.
We also describe a graphical interface for demonstrating our system where news articles are fetched and annotated with entity classifications and Wikipedia links.
We evaluate the system using neleval \cite{Nothman2014Neleval} on data from the \acrfull{tac} 2017 and achieve F1 scores of 0.692 in \mention{} and 0.494 in \linking{}.
Finally, as this is a work in progress, we discuss some planned improvements for the future of this project.
\end{abstract}
\section{Introduction}
%Entity discovery and linking
\acrfull{edl} is the process of finding mentions of entities in a text and linking these entities to the corresponding entry in a knowledge base. \acrfull{md} and \acrfull{el} are the primary constituent steps of this process and in this paper we will describe the mechanisms by which our system performs these steps, both as separate processes and as a unified process (EDL).
In this project we attempt to reproduce the results achieved by \citet{YangThe2017}, using the methods outlined in their paper.
Their system, the \textit{TAI system}, was submitted to the \acrfull{tedl} track at \acrfull{tac} 2017. %\acrfull{kbp} % TODO
Our course of action for developing this system has been to iteratively reproduce the parts of the model from \citet{YangThe2017}, evaluating the model and adding features in each step of the iteration.
We implemented the majority of our system in Python using Keras to develop the neural network \citep{Francois2015Keras}.
%\textit{Something about reproducing the model in iterative steps, comparing the performance as more features and layers are added?}
\section{Mention Detection}
In reproducing the TAI system we reduce the task of mention detection to detecting named (\acrshort{nam}) and nominal entities (\acrshort{nom}) in text and classifying them into six predetermined categories, namely: \acrfull{per}, \acrfull{org}, \acrfull{gpe}, \acrfull{loc}, \acrfull{fac} and \acrfull{ttl}.
As a first step in our iteration we leverage readily existing features from CoreNLP with word embeddings to evaluate baseline performance in a basic reconstruction of the network described in \citet{YangThe2017}.
CoreNLP provides us with basic \acrfull{ner} and \acrfull{pos} tagging and we use GloVe as a pretrained embedding for English \cite{ManningTheToolkit,PenningtonGloVe:Representation}.
\subsection{Features}
We have considered the following for our features, as per \citet{YangThe2017} as well as arriving at our own heuristics through trial and error.
\subsubsection{Word form}
Based on term frequency in our data set, we rank word forms and include only the most frequent ones in order to reduce noise arising from infrequent terms.
We then match each word with its corresponding GloVe vector while replacing \acrshort{oov} instances with a specific \acrshort{oov} vector.
\subsubsection{Part of speech}
Using CoreNLP, we predict \acrshort{pos} tags and use these predictions as features.
\subsubsection{Preemptive entity type}
In a similar manner, we use CoreNLP to detect entity boundaries and a preemptive entity type and use these as features.
\subsubsection{Capitalization signifier}
In English and Spanish, the first letter in a name is capitalized. Furthermore, acronyms of organizations and the like are often written in uppercase. For that reason, we use a boolean signifier which is true if the word contains an uppercase letter and use that boolean value as a feature.
\subsubsection{Special tokens}
We include a separate feature for "special" tokens in English, like prepositions and possessive signifiers -- as well as often recurring words like \textit{president} -- in order to give greater weight to these tokens during training.
\subsection{Network Architecture}
We use the terms \textit{input layer}, \textit{representative layer} and \textit{output layer} to help give a high-level description of our architecture.
Figure~\ref{fig:neural_net} shows an overview of the network where each node in the graph represents a layer in the network.
The input layer consists of three trainable embeddings representing the word form, the \acrshort{pos} and the named entity type. In addition to the embeddings, two untrainable input layers representing word capitalization and special tokens, respectively.
The embeddings are concatenated with the untrainable input layers and fed into the representative layer.
The model is centered around the representative layer which consists of two consecutive \acrfull{bilstm} layers.
The concatenated input is routed through a dropout layer with a dropout rate of 0.25 which in turn feeds into the \acrshort{bilstm} layers.
The output from the first \acrshort{bilstm} layer and the output from the previously mentioned dropout layer is concatenated to form the input for the second \acrshort{bilstm} layer.
To speed up training we used CuDNNLSTM, available in Keras, as the implementation of the \acrshort{bilstm}s. We chose the output dimensionality of each \acrshort{bilstm} to be 100.
Our output layer consists of a simple dense layer which receives the concatenated output from the representative layer and is activated by a softmax function.
<< neural_net, engine = "dot", fig.lp="fig:", fig.cap = "A high-level representation of our implemented neural network.", cache=TRUE, echo=FALSE >>=
digraph G {
concentrate=True;
rankdir=TB;
node [shape=record];
140648102684096 [label="Form: Input"];
140648102683032 [label="POS: Input"];
140648179719976 [label="NE: Input"];
140648102685328 [label="Embedding"];
140648102684432 [label="Embedding"];
140648179720032 [label="Embedding"];
140648245566880 [label="Capital: Input"];
140648102684264 [label="Special: Input"];
140648102684376 [label="Concatenate"];
140648093942784 [label="Dropout"];
140648093678000 [label="Bidirectional(CuDNNLSTM)"];
140648094104656 [label="Concatenate"];
140647917305696 [label="Bidirectional(CuDNNLSTM)"];
140647916844200 [label="Dense"];
140648102684096 -> 140648102685328;
140648102683032 -> 140648102684432;
140648179719976 -> 140648179720032;
140648102685328 -> 140648102684376;
140648102684432 -> 140648102684376;
140648179720032 -> 140648102684376;
140648245566880 -> 140648102684376;
140648102684264 -> 140648102684376;
140648102684376 -> 140648093942784;
140648093942784 -> 140648093678000;
140648093942784 -> 140648094104656;
140648093678000 -> 140648094104656;
140648094104656 -> 140647917305696;
140647917305696 -> 140647916844200;
}
@
%\begin{figure}
%\centering
%\includegraphics[width=0.5\textwidth]{neural_net2.png}
%\caption{A high-level representation of our implemented neural network.}
%\label{fig:neural_net}
%\end{figure}
\subsection{Training}
We trained the model on data from the \acrshort{tedl} task from previous years (2015 and 2016) and used early stopping to prevent overfitting.
\subsection{XML Data}
The TAC Dataset includes some annotated structured data. The \texttt{author} attributes of XML \texttt{<post>} tags is an example and they are marked as names of persons (\namper{}). To catch these we added a simple regular expression.
\section{Entity Linking}
We created a mapping from words to Wikipedia pages by iterating over all anchors in all Wikipedia pages in the respective languages. For all anchors with the same text the most frequent target page was selected as the target entity in the dictionary.
Due to the size of Wikipedia we used Apache Spark to perform the processing and wrote that part of the code in Scala to maximize performance on Spark. In the implementation we also used Docria which implements an input specification for Map-Reduce jobs \cite{Zaharia2016ApacheSpark,Klang2018Docria}.
%\subsection{Candidate Generation}
%\subsection{Candidate Ranking}
\section{Evaluation}
We evaluated the system's performance by applying the neleval \texttt{tac2016} evaluation script on our system output with the TAC 2017 gold standard. %TODO: ref?
The results of this evaluation for some chosen metrics are presented in tables~\ref{tab:mention} and~\ref{tab:linkning}.
\begin{table}[]
\centering
\begin{tabular}{lrrr}
\toprule
Language & Precision & Recall & F1 \\
\midrule
All & 0.844 & 0.586 & 0.692 \\
ENG & 0.832 & 0.689 & 0.754 \\
SPA & 0.864 & 0.564 & 0.682 \\
CMN & 0.832 & 0.529 & 0.647 \\
\bottomrule
\end{tabular}
\caption{The mention detection (\mention) results from running the neleval script on the output from our system.}
\label{tab:mention}
\end{table}
\begin{table}[]
\centering
\begin{tabular}{lrrr}
\toprule
Language & Precision & Recall & F1 \\
\midrule
All & 0.602 & 0.418 & 0.494 \\
ENG & 0.551 & 0.456 & 0.499 \\
SPA & 0.666 & 0.434 & 0.526 \\
CMN & 0.596 & 0.379 & 0.463 \\
\bottomrule
\end{tabular}
\caption{The linking (\linking) results from running the neleval script on the output from our system.}
\label{tab:linkning}
\end{table}
\begin{figure}
\centering
\includegraphics[width=0.5\textwidth]{plot_tmm.png}
\caption{Scatter plot of the \textit{strong typed mention match} from the results of all submitted runs for the TEDL task at TAC-KBP 2017. The red and green dots signify our results and the the Tencent AI team's results respectively. The contour lines show the F1-score.}
\label{fig:tmm}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=0.5\textwidth]{plot_tam.png}
\caption{Scatter plot of the \textit{strong all mention match} from the results of all submitted runs for the TEDL task at TAC-KBP 2017. The red and green dots signify our results and the the Tencent AI team's results respectively. The contour lines show the F1-score.}
\label{fig:tam}
\end{figure}
In figures~\ref{fig:tmm} and~\ref{fig:tam}
we have plotted our own results as well as the results of all the systems that participated in the \acrshort{tedl} task at TAC-KBP 2017.
As can be seen in these plots, our system is relatively performant compared to the other entries.
\section{Graphical User Interface}
\begin{figure*}
\centering
\includegraphics[width=\textwidth]{news_ui3.png}
\caption{The web based \acrshort{gui} showing the result of running the network on the description of a news article acquired from \href{https://newsapi.org}{newsapi.org}.}
\label{fig:my_label}
\end{figure*}
We developed a web based \acrfull{gui} using Flask for the backend and \href{https://elm-lang.org}{Elm} for the frontend. The \acrshort{gui} fetches descriptions of news articles and feeds them to the neural network and the entity linker. Each entity that was successfully linked becomes a link to Wikipedia. When the user hovers over such an entity a summary is shown including its Wikidata label, an image and a short description.
The news information is fetched from \href{https://newsapi.org}{newsapi.org} and the summary data is acquired using the Wikidata SPARQL Query Service.
\section{Future work}
Our initial goal was to investigate whether the results are reproducible and additionally to potentially add support for other languages, seeking similar performance. As we underestimated the workload, we did not have time to extend the system described in \citet{YangThe2017} and therefore we set extension of the system as a goal in future work.
\subsection{Mention Detection}
In the future we intend to replace our dense output layer with a \acrfull{crf} layer as per \citet{YangThe2017} because we expect it to yield significant performance increases in \acrshort{ner} and classification.
For Chinese we currently use CoreNLP for word segmentation but this might introduce errors since there are no word delimiters in Chinese. One way to avoid this problem is to only work with the character sequence. When we attempted this it decreased the performance of our system. However, we did not add the CoreNLP word boundaries as a feature as was done by \citet{YangThe2017}.
In order to speed up training while simultaneously reducing the network's dependence on external tools and giving us greater control over our embeddings, we intend to pre-train a separate network to replace the features for which we currently depend on CoreNLP.
A possible improvement to our word embeddings would be to train an \textit{unknown} word embedding which can be concatenated to form word embeddings from characters in order to resolve \acrshort{oov} problems as described by \citet{Lample2016NeuralRecognition}.
\subsection{Entity Linking}
\citet{YangThe2017}
used LambdaMART ranking with several handcrafted features such as number of languages on Wikipedia and a metric based on PageRank on Wikipedia to achieve better linking. We would like to implement this kind of ranking and at least a few of these features \cite{Burges2010FromOverview}.
\subsection{Graphical User Interface}
To make the \acrshort{gui} more usable we would like to optimize the speed as well as the memory footprint of the server.
To achieve this we will look into switching to another data structure for containing the mapping from entity name to Wikidata entity number.
If fast enough performance can be achieved we would like to create an interface where the user can write text and \acrshort{edl} is performed in real time as the text is typed.
Another feature that we plan to introduce is the annotation of websites. The user would provide an URL and select a language and the \acrshort{gui} would render the website annotated as in the current interface using the predictions of our system.
\section{Conclusion}
Reproducing the performance in the system built in \citet{YangThe2017} proved to be a difficult task in part due to critical details being left out of the paper, not having access to the same training data, as well as our relative inexperience working with machine learning models.
Given these difficulties we still managed to achieve relatively good results.
Our system is particularly good at \acrlong{md} while our results in \acrlong{el} are still unsatisfactory.
Our intention for the future is to reach feature parity with the TAI system and to extend the language support beyond the current trilingual domain.
Some major features that we predict would yield significant improvements to our \acrfull{md} are the previously mentioned character-to-word embeddings as per \citet{Lample2016NeuralRecognition}, and a \acrshort{crf} layer.
A major feature we would like to add to our graphical interface is an option to annotate any site or article given a URL and the language it is written in.
%% Acknowledgments
\section*{Acknowledgments}
We would like to thank Pierre Nugues and Marcus Klang for their guidance and support throughout the project.
A special thanks to Marcus for creating Docria and providing us with the data we needed in accessible formats.
%% Bibliography
\bibliography{ref}
\bibliographystyle{acl_natbib}
\end{document}