The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis

Scripts used for the analysis in a paper "The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis" by Ondřej Draganov and Steven Skiena.

Most of the code is in Python with shell skripts connecting various Python scripts together. The final analysis is in a Jupyter Notebook,

The code here is split in three parts:

A-word_embeddings_to_persistence_diagrams
B-from_persistence_diagrams_to_language_distances
C-tree_construction_and_evaluation

Each folder has its own README file with more details.

To run the analysis, follow the numbered files:

1-download_word_embeddings.sh
2-point_clouds_to_persistence_bars.sh
3-compute_language_distances.sh
4-construct_and_evaluate_trees.ipynb
5-summary_figures.ipynb

The folder structure is set up and empty in the repository. Contents of the folders with the same names should be copied from one step to the next. You can also edit the paths in the script to avoid data duplication.

Questions

In case of any questions, do not hesitate to contact me (see the GitHub profile) or raise an issue at the repository.

Cite

To cite the code, please cite the paper:

Ondřej Draganov and Steven Skiena. 2024. The Shape of Word Embeddings: Quantifying Non-Isometry with Topological Data Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 12080–12099, Miami, Florida, USA. Association for Computational Linguistics.

BibTeX:

@inproceedings{draganov-skiena-2024-shape,
    title = {The Shape of Word Embeddings: Quantifying Non-Isometry with Topological Data Analysis},
    author = {Draganov, Ond{\v{r}}ej and Skiena, Steven},
    booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024},
    month = {11},
    year = {2024},
    address = {Miami, Florida, USA},
    publisher = {Association for Computational Linguistics},
    url = {https://aclanthology.org/2024.findings-emnlp.705},
    pages = {12080--12099}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
A-word_embeddings_to_persistence_diagrams		A-word_embeddings_to_persistence_diagrams
B-from_persistence_diagrams_to_language_distances		B-from_persistence_diagrams_to_language_distances
C-tree_construction_and_evaluation		C-tree_construction_and_evaluation
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis

Questions

Cite

About

Releases

Packages

Languages

License

OnDraganov/shape-of-word-embeddings

Folders and files

Latest commit

History

Repository files navigation

The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis

Questions

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages