-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
90 lines (79 loc) · 4.58 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
Improving methods to learn word representations
===============================================
for efficient semantic similarities computations
================================================
ABOUT
This repository contains the PhD thesis of Julien Tissier, entitled,
"Improving methods to learn word representations for efficient
semantic similarities computations"
It also contains all the source materials used to produce the thesis,
including the Latex .tex source files, the images and their respective
source files to generate or modify them (either the Libreoffice Draw
source or the Python code) and the slides of the PhD defense.
CONTENT
This repository is composed of:
- chapters/: this folder contains all the chapters of the thesis, as
".tex" source files. There are 10 chapters (from 00-introduction.tex
to 09-software.tex), a cover page (000-garde.tex) and the
bibliography (99-bibliography.bib).
- images/: this folder contains all the images used in the thesis
(i.e. with the \includegraphics{} command in the .tex files) either
as PNG or PDF.
- images-code/: this folder contains the Python code used to generate
some plots or illustration images of the thesis with the Matplotlib
library.
- images-src/: this folder contains the source files of some
illustrations images used in the thesis, as Libreoffice Draw files
(.odg).
- PhD-Defense-Julien-Tissier.pdf: the defense presentation as PDF, 48
slides.
- PhD-Thesis-Julien-Tissier.pdf: the thesis as PDF, 127 pages.
- makefile: used to generate the thesis from source files. Use the
command `make` at the root of this repository to produce it. You
will need the following tools: make, pdflatex and bibtex.
- phd-thesis.tex: the main .tex file, containing all the Latex package
to use and the different chapters to include.
SUMMARY
Many natural language processing applications rely on word embeddings
(also called word representations) to achieve state-of-the-art results.
These numerical representations of the language should encode both
syntactic and semantic information to perform well in downstream tasks.
However, common models (word2vec, GloVe) use generic corpus like
Wikipedia to learn them and they therefore lack specific semantic
information. Moreover it requires a large memory space to store them
because the number of representations to save can be in the order of a
million.
The topic of my thesis is to develop new learning algorithms to both
improve the semantic information encoded within the representations
while making them requiring less memory space for storage and their
applications in NLP tasks.
The first part of my work is to improve the semantic information
contained in word embeddings. I developed dict2vec, a model that uses
additional information from online lexical dictionaries when learning
word representations. The dict2vec word embeddings perform ∼15% better
against the embeddings learned by other models on word semantic
similarity tasks.
The second part of my work is to reduce the memory size of the
embeddings. I developed an architecture based on an autoencoder to
transform commonly used real-valued embeddings into binary embeddings,
reducing their size in memory by 97% with only a loss of ∼2% in accuracy
in downstream NLP tasks.
AUTHOR
Written by Julien Tissier <[email protected]>.
COPYRIGHT
This thesis and all the files in this repository are licensed under the
"Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
Public License". By using or downloading this repository, you agree to:
1. NonCommercial - You may not use the material for commercial
purposes.
2. Attribution - You must give appropriate credit, provide a link to
the licensor, and indicate if changes were made. You may do so in
any reasonable manner, but not in any way that suggests the
licensor endorses you or your use.
3. ShareAlike - If you remix, transform, or build upon the material,
you must distribute your contributions under the same license as the
original.
4. No additional restrictions - You may not apply legal terms or
technological measures that legally restrict others from doing
anything the license permits.
For more details, see https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode