Skip to content

jkitchin/litdb

Repository files navigation

litdb - a literature and document database

./litdb.png

litdb concept

litdb is a tool to help you curate and use your collection of scientific literature. You use it to collect and search papers. You can use it to collect older articles, and to keep up with newer articles. litdb uses https://openalex.org for searching the scientific literature, and https://turso.tech/libsql to store results in a local database.

The idea is you add papers to your database, and then you can search it with natural language queries, and interact with it via an ollama GPT application. It will show you the papers that best match your query. You can read those papers, get bibtex entries for them, or add new papers based on the references, papers that cite that paper, or related papers. You can also set up filters that you update when you want to get new papers created since the last time you checked.

videos

  1. https://www.youtube.com/live/e-J3Bh2Uti4 Introduction to litdb
  2. https://www.youtube.com/live/teW68WogulU local files (volume is very low for some reason)
  3. https://youtube.com/live/3LltpiiQaR8 CrossRef, reviewer suggestions, COA
  4. https://youtube.com/live/ZkKKuvVUWkE litdb and Emacs
  5. https://youtube.com/live/j7rItPwWDaY litdb and Jupyter Lab
  6. https://youtube.com/live/SUtvtc7l6y0 litdb + GPT enhancements

installation

litdb is on PyPi.

pip install litdb

To get the cutting edge package, you can install it directly from GitHUB.

pip install git+https://github.com/jkitchin/litdb

configuration

You have to create a toml configuration file. This file is called litdb.toml. The directory this file is in is considered the root directory. All commands will start in the current working directory and look up to find this file. You can put this file in your home directory, or you can have sub-directories, e.g. a per project litdb.

There are a few choices you have to make. You have to choose a SentenceTransformer model, and specify the size of the vectors it makes. You also have to specify the chunk_size and chunk_overlap settings that are used to break documents up to compute document level embedding vectors.

You will need an OpenAlex premium key if you want to use the update-filters feature.

[embedding]
model = 'all-MiniLM-L6-v2'
cross-encoder = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
chunk_size = 1000
chunk_overlap = 200


[openalex]
email = "[email protected]"
api_key = "..."

[gpt]
model = "llama2"

You can define an environment variable to the root of your default litdb project. This should be a directory with a litdb.toml file in it.

export LITDB_ROOT="/path/to/your/default/litdb"

When you run a litdb command, it will look for a dominating litdb.toml file, which means you are running the command in a litdb project. If one is not found, it will check for the LITDB_ROOT environment variable and use that if it is found. Finally, if that does not exist, it will prompt you to make a new project in the current directory.

Using litdb

Your litdb starts out empty. You have to add articles that are relevant to you. It is an open question of the best way to build a litdb. The answer surely depends on what your aim is. You have to compromise on breadth and depth with the database size. The CLI makes it pretty easy to do this

litdb has a cli with an entry command of litdb and subcommands (like git) for interacting with it. You can see all the options with this command.

litdb --help

Searching the web

You have to start somewhere. You can use this to open a search in OpenAlex.

litdb web query

You can also open searches with these options:

optionsource
-g, –googleGoogle
-gs, –google-scholarGoogle Scholar
-ar, –arxivArxiv
-pm, –pubmedPubmed
-cr, –chemrxivChemRxiv
-br, –biorxivBioRxiv
-a, –allAll

You can find starting points this way.

Fine-tuned search in OpenAlex

This is a default query in Open Alex. It does not change your litdb, it just does a simple text search query on works.

litdb openalex query

You can get more specific with a filter:

litdb openalex -f 'author.orcid:https://orcid.org/0000-0003-2625-9232'

You can also search other endpoints and use fulters. Here we perform a search on Sources for display_names that contain the word discovery.

litdb openalex -e sources -f display_name.search:discovery

One-time additions of articles to litdb

You add an article by its DOI. There are optional arguments to also add references, citing and related articles.

litdb add doi --references --citing --related

To add an author, use their orcid. You can use litdb author-search firstname lastname to find an orcid for a person.

litdb add orcid

To add entries from a bibtex file, use the path to the file.

litdb add /path/to/bibtex.bib

You can provide more than one source and even mix them like this.

litdb add doi1 doi2 orcid

These are all one-time additions.

Adding filters

litdb provides several convenient ways to add queries to update your litdb in the future.

Follow an author

To get new papers by an author, you can follow them.

litdb follow orcid

Watch a query

litdb watch "filter to query"

Citations on a paper

litdb citing doi

Related papers

litdb related doi

A custom filter

A filter is used in OpenAlex to search for relevant articles. Here is an example of adding a filter for articles in the journal Digital Discovery. This doesn’t add any entries directly, it simply stores the filter in the database. The main difference of this vs the watch command above is the explicit description.

litdb add-filter "primary_location.source.id:https://openalex.org/S4210202120" -d "Digital Discovery"

Managing and updating the filters

You can get a list of your filters like this.

litdb list-filters

You can update the filters like this.

litdb update-filters

This adds papers that have been created since the last time you ran the filter. You need an OpenAlex premium API key for this. This will update the last_updated field.

You can remove a filter like this:

litdb rm-filter "filter-string"

Review your litdb

I find it helpful to review your litdb. To get a list of articles added in the last week, you can run this command.

litdb review -s "1 week ago"

This works best when you update your litdb regularly. You might want to redirect that into a file so you can review it in an editor of your choice.

Searching litdb

There are several search options.

vector search

The main way litdb was designed to be searched is with by natural language queries. The way this works is your query is converted to a vector using SentenceTransformers, and then a vector search identifies entries in the database that are similar to your query.

litdb vsearch "natural language query" 

The default number of entries returned is 3. You can change that with an optional argument

litdb vsearch "natural language query" -n 5

There is an iterative version of vsearch called isearch. This finds the closest entries, then downloads the citations, references and related entries for each one, and repeats the query until you tell it to stop, or it doesn’t find any new results.

litdb isearch "some query"

full text search

There is a full text search (full on the text in litdb) available. The command looks like this.

litdb fulltext "query"

See https://sqlite.org/fts5.html for information on what the query might look like. The search is done with this SQL command:

select source, text from fulltext where text match ? order by rank

The default number of entries returned is 3. You can change that with an optional argument

litdb fulltext "natural language query" -n 5

hybrid search

Vector and full text search have complementary strengths and weaknesses. We combine them in the hybrid-search subcommand. This performs two searches on two different queries, and combines them with a unified score that is used to rank all the matches. This ensures you get some results that match the full search, and the vector search. It is worth trying if you aren’t finding what you want by vector or text search alone.

litdb hybrid-search "vector query" "text query"

ollama GPT

You can use litdb as a RAG source for ollama. This looks up the three most related papers to your query, and uses them as context in a prompt to ollama (with the llama2 model). I find this quite slow (it can be minutes to generate a response on an old Intel Mac). I also find it makes up things like references, and that it is usually necessary to actually read the three papers. The three papers come from the same vector search described above.

litdb gpt "what is the state of the art in automated laboratories for soft materials"

search with audio

This command will record audio, transcribe that audio to text, and then do a vector search on that text. You will be prompted when the recording starts, and you press return to stop it. litdb will show you what it heard, and ask if you want to do a vector search on it.

litdb audio -p

I haven’t found the transcription to be that good on technical scientific terms. This is a proof of concept capability.

Note that you need to install these libraries for this feature to work:

pyaudio, playsound, SpeechRecognition

These are not trivial to install, and pyaudio relies on external libraries like portaudio that may not be easy to install. These are currently commented out in pyproject.toml because of these difficulties.

search from a screenshot

You can copy a screenshot to the clipboard, and then use OCR to extract text from it, and do a vector search on that text.

litdb screenshot

If you can copy and paste text, you should do that instead. This is helpful to get text from images, or pdfs where the text is stored in an image, maybe from videos, or screen share from online meetings, etc.

Eventually, if images get integrated into litdb, this is also an entry point for image searches.

Tagging entries

litdb supports tagging entries so you can group them. To tag a source with tag1 and tag2, use this syntax.

litdb add-tag source -t tag1 -t tag2

You can remove tags like this.

litdb rm-tag source -t tag1 -t tag2

You can delete a tag from the database.

litdb delete-tag tag1

To see all the tags do this.

litdb list-tags

To see entries with a tag:

litdb show-tag tag1

You can use this to export tagged entries into bibtex entries like this.

litdb show-tag workflow -f '{{ source }}' | litdb bibtex

Exporting entries

You can use these commands to export bibtex entries or citation strings.

Get a bibtex entry

This command will try to generate a bibtex entry for entries in your litdb.

litdb bibtex doi1 doi2

The output can be redirected to a file.

You can also use a search like this and pipe the output to litdb bibtex.

litdb vsearch "machine learning in catalysis
" -f "{{ source }}" | litdb bibtex

Get a citation string

This command will output a citation for the sources. It is mostly a convenience function. There is not currently a way to customize the citation.

litdb citation doi1 doi2

You can also use a search like this and pipe the output to litdb bibtex.

litdb vsearch "machine learning in catalysis
" -f "{{ source }}" | litdb citations

Find free pdfs

You can use litdb to find freely available PDFs via https://unpaywall.org/.

litdb unpaywall doi

These do not always work, and sometimes you get a version from arxiv or pubmed.

Low-level interaction with litdb

litdb is just a sqlite database (although you need to use the libsql executable for vector search). There is a CLI way to run a sql command. For example, to find all entries with a null bibtex field and their types use a query like this.

litdb sql "select source, json_extract(extra, '$.type'), json_extract(extra, '$.bibtex') as bt from sources where bt is null"

You might also use this for very specific queries. For example, here I search the citation strings for my name.

litdb sql "select source, json_extract(extra, '$.citation') as citation from sources where citation like '%kitchin%'"

Adding local files

The idea of using local files is that it is likely you have collected information in the form of files on your hard drive, and you want to be able to find information in those files.

It is possible to add any file that can be turned into text to litdb. That includes:

  • pdf
  • docx
  • pptx
  • html
  • ipynb
  • org / md
  • bib
  • url

This limits portability because you need a path if you want to be able to open that file.

The same vector, fulltext and gpt search commands are available for local file entries. These tend to be longer documents than the OpenAlex entries, and I am not sure how well the search works at the document level embeddings. Search at a chunk level is very precise; odds are you want paragraph level similarity to your query.

An early version of litdb stored each chunk. This is possible, but I used another table for it. You could munge the source to be something like f.pdf::chunk-1 so each one is unique, but that seems more complicated and you would need to do some experiments to see if it is warranted.

You can combine this with the OpenAlex entries in a single database.

You can walk a directory and add files from it with this command.

litdb index dir1

This directory is saved and you can update all the previously indexed directories like this.

litdb reindex

Some annoying things that may happen are duplicate content, e.g. because you have the same file in multiple formats like docx and pdf, or because you have literal copies of files in multiple places.

You should also be careful sharing a litdb that has indexed local files. It may have sensitive information that you don’t want others to be able to find.

Emacs integration

Of course there is some Emacs integration. I made a new link for litdb.

litdb:https://doi.org/10.1021/jp047349j

The links export as \cite{source}, and there is a function litdb-generate-bibtex to export bibtex entries for all links in the buffer. These entries are not certain to be valid, most likely from the keys (some DOIs are probably invalid keys).

You can easily insert a link like this:

M-x litdb

See ./litdb.el for details. This is not a package on MELPA yet. You should just load the .el file in your config. You can also use litdb-fulltext, litdb-vsearch, and litdb-gpt from Emacs to interact with your litdb.

litdb.el is under active development, and will be an alternative UI to the terminal eventually. It is too early to tell if it will replace org-ref. It has potential, but that would be a very large undertaking.

Database design

litdb uses a sqlite database with libsql. libsql is a sqlite fork with additional capabilities, most notably integrated vector search.

The main table in litdb is called sources.

  • sources
    • source (url to source location)
    • text (the text for the source)
    • extra (json data)
    • embedding (float32 blob in bytes)
    • date_added string

This table has an embedding_idx index for vector search.

There is also a virtual table fulltext for fulltext search.

  • fulltext
    • source
    • text

And a table called queries.

  • queries
    • filter
    • description
    • last_updated

This database is automatically created when you use litdb.

Limitations

The text that is stored for each entry comes from OpenAlex and is typically limited to the title and abstract. For the text in each entry The first line is typically a citation including the title, and the rest is the abstract if there is one. I feel like I see more and more entries with no abstract. This will certainly limit the quality of search, and could bias results towards entries with more text in them.

The quality of the vector search depends on several things. First, litdb stores a document level embedding vector that is computed by averaging the embedding vectors of overlapping chunks. We use Sentence Transformers to compute these. There are many choices to make on the model, and these have not been tested exhaustively. So far ‘all-MiniLM-L6-v2’ works well enough. There are other models you could consider like getting embeddings from ollama, but at the moment litdb can only use SentenceTransformers.

I guess that document level embeddings are less effective on longer documents. The title+abstract from OpenAlex is pretty short, and so far there isn’t evidence this is a problem.

Second, we rely on defaults in libsql for the vector search, notably finding the top k nearest vectors based on cosine similarity. There are other distance metrics you could use like L2, but we have not considered these.

The query is based on vector similarity between your query and the texts. So, you should write the query so it looks like what you want to find, rather than as a question. It is less clear how you should structure your query if you are using the GPT capability. It is more natural to ask a question, or give instructions. The RAG is still done by similarity though.

Finally, the search can only find things that are in your database. If you haven’t added it there, you won’t find it. That definitely means you will miss some papers. I try to use a mesh of approaches to cover the most likely papers. This includes:

  1. Follow authors
  2. add references, related, and citing papers to the most relevant papers.
  3. Use text search filters
  4. Add papers I find from X, bluesky, LinkedIn, etc. (and their references, related, etc)
  5. If read a paper in litdb that is good, add its references, related, etc.

It is an iterative process, and you have to make a judgment call about when to stop it. You can always come back later. There might even be newer papers to find.

Local file limitations

Similar limitations exist for local files. There are additionally the following known limitations:

  1. The quality of document to text influences the ultimate embedding. This varies by type of document, and the library used to convert it.
  2. Local files tend to be longer documents and this can lead to hundreds of text chunks per document. These chunk embeddings are averaged into one embedding. It is not obvious this is as effective as vector search on each chunk, but it is more memory efficient.

For PDF to text we use pymupd4fll which works for this proof of concept. There is a Pro version of that package which supports more file formats. It is not obvious what it would cost to use that. I used docling in an early prototype. It also worked pretty well, but it was a little slower I think, and would occasionally segfault so I stopped using it. Spacy is integrating PDF to structured data using docling (https://explosion.ai/blog/pdfs-nlp-structured-data). There is plenty of room for improvement in this dimension, with trade offs in performance and accuracy.

There is a new package from Microsoft to convert Office files to Markdown (https://github.com/microsoft/markitdown) that they specifically mention using in the context of LLMs.

The embedding model we use is trained on text. It is probably not as good at finding code, and the gpt we use is also probably not good at generating code. I guess you would need another table in the database for code, and a different model for embedding and generation. This only matters if you index jupyter notebooks (and later if other code files are supported).

sqlite + sqlite-vec vs libsql

Vector search is the core requirement for litdb. There are many ways to achieve this. I only considered local solutions so the options are:

vectorlite aims to be faster than sqlite-vec, but it relies on hnsw for vector search, and I was uncomfortable figuring out how to set the size of the db for this application.

sqlite-vec is nice, and early versions of litdb used it and its precursor. This approach requires a virtual table for the embeddings. This is installed as an extension, and is still considered in early stages of adoption.

libsql is a fork of sqlite with integrated vector search, and potential for using it as a cloud database. It is supported by a company, with freemium cloud services. In libsql you store the vectors in a regular table, and search on an embedding index. The code is on GitHUB, and can also be used locally.

Roadmap

These are ideas for future expansion.

PDFs and notes

I am not sure what the best way to do this is. The records in litdb are stored by the source, often a url, or path. The PDFs would be stored outside the database, and we would need some way to link them. The keys aren’t suitable for naming, but maybe a hash of the keys would be suitable. This would add a fuller opportunity to search larger, local documents too. In org-ref, I only had one pdf per entry. I guess here I would have a new table, so you could have multiple documents linked to an entry, although it won’t be easy to tell what they are from the hash-based filenames.

Notes on the other hand, might be small enough that they could be stored in the database. Then they would be easily searchable. They could also be stored externally to make them easy to edit. I haven’t found the notes feature in org-ref that helpful, and usually I take notes in various places. What I should do is add a search to find the litdb links in your org-files. This is already a feature of org-db.

Jupyter lab integration

An alternative to the CLI and Emacs would be to run this in Jupyter Lab with magic commands and rich output.

graph visualization

It might be helpful to have a graph representation of a paper that shows nodes of citing, references, and related papers, with edge length related to a similarity score, and node size related to number of citations.

ResearchRabbit and Litmaps do this pretty well.

ollama and agents

There might be a way to get better results using agents and / or tools. For example, you might have a tool that can lookup new articles on OpenAlex, or augment with google search somehow. Or there might be some iterative prompt building tool that refines the search for related articles based on output results.

Here are some references for when I get back to this.

I don’t use llamaindex (maybe I should see what it does), but it has this section on agents https://docs.llamaindex.ai/en/stable/understanding/agent/

web app / fast-api

It might be nice to have a flask app with an API. This would facilitate interaction with Emacs.

async operations

Almost everything is done synchronously and it blocks the program. At least some things could be done asynchronously I think, and that might speed things up (especially for local files), or at least let you do other things while it happens.

The only thing to be careful about is not exceeding rate limits to OpenAlex. This is handled in the synchronous code.

application specific encoders

I use a generic embedding model, and there are others that are better suited for specific tasks. For example:

These might have a variety of uses with litdb that range from extracting data, named entity recognition, specific searches on materials, etc.

It is not essential to use SentenceTransformers for embedding, they are just easy to use. An alternative is something like ollama embeddings (https://ollama.com/blog/embedding-models) or llama.cpp https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#embeddings. The main reason to use on of these would be performance, and maybe better integration with a chat llm.

It is not that easy to just switch models; you would need to either add new columns and compute embeddings for everything, or update all the embeddings for a new model. The SPECTER embedding is much bigger than the all-MiniLM-L6-v2 embedding.

from sentence_transformers import SentenceTransformer

m = SentenceTransformer('allenai-specter')
print(m.encode(['test']).shape)

merge databases

I have setup litdb to be project based. There may come a time when it is desirable to merge some set of databases. It might not be necessary, I think you can attach databases in sqlite (https://www.sqlitetutorial.net/sqlite-attach-database/) to achieve basically the same effect. litdb doesn’t store version info at the moment, so it could be tricky to ensure compatibility.

Still it might be interesting to sync two databases, e.g. https://www.sqlite.org/rsync.html. I don’t know if this works with libsql, but it might allow there to be a central db that users pull from.

remote db

The first version of litdb with libsql used a fully remote db on their cloud. The main benefit of that is you can update the db from another machine, keeping your working machine load low. A secondary benefit would be using the db from different machines more easily. Right now I use Dropbox to sync it; that mostly works but I get some conflict files here and there if I change it on one machine while it is open on another machine. It is a little more complex to set up though, and I got several api errors on long running scripts, and with network issues, so I switched to this local setup. I think you could specify this in the litdb.toml file and have it do the right thing on a project basis.

image and text models

One day it might be possible to include images in this (https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#image-text-models). At the moment, OpenAlex entries do not have any images, but other web resources and local files could. I have an image database in org-db, but I don’t use it a lot.

combine full text and vector search

Vector search might miss some things. Full text search is hard to do with meaning. There are several ways to do a hybrid search, e.g. do a full text search on keywords, and a vector search, and use some kind of union on those results.

https://www.meilisearch.com/blog/full-text-search-vs-vector-search

tag system

It could be useful to have a tag system where you could label entries, or they could be auto-tagged when updating filters. This would allow you to tag entries by a project, or select entries for some kind of bulk action like update, export to bibtex, or delete.

You might also build a scoring system, e.g. for like/dislike tags.

litdb tag doi -t "tag1" "tag2"  # add tag
litdb tab doi -r "tag" "tag2"  # rm tags

Integrate with audio input

This would use your microphone to record and transcribe a query for search.

Integrate with screenshot + OCR

Do the search from the results. I did this with tesseract (https://pypi.org/project/pytesseract/)

import pyautogui

# Prompt the user to move the mouse to the first corner and press Enter
input("Move the mouse to the first corner and press Enter...")
x1, y1 = pyautogui.position()

# Prompt the user to move the mouse to the opposite corner and press Enter
input("Move the mouse to the opposite corner and press Enter...")
x2, y2 = pyautogui.position()

# Calculate the region
left = min(x1, x2)
top = min(y1, y2)
width = abs(x2 - x1)
height = abs(y2 - y1)

region = (left, top, width, height)
print(f"Selected region: {region}")
import pyscreeze
im = pyscreeze.screenshot(region=(left, top, width, height))
im.save('screenshot.png')

see mss also.

from PIL import Image
import pytesseract

# Open an image file
img = Image.open('screenshot.png')

# Use Tesseract to extract text
text = pytesseract.image_to_string(img)

# Print the extracted text
print(text)

This might be nice later when we have image embeddings.

review process

litdb review --since '1 week ago'

You need to have a way to review what comes in to litdb; it is part of learning about what is current. I currently do this with Emacs and scimax-org-feed. You could integrate review with update-filters, or by entries added in the past few days, or some other kind of query. Then you just need to add some format information to get what you want, e.g. org, maybe html?

select source, date_added from sources where date(date_added) > '2024-11-28' limit 5

semantic similarity

litdb uses cosine similarity as the distance metric for the nearest neighbors. It might be useful to re-rank these with cross-encoding.

https://www.sbert.net/examples/applications/cross-encoder/README.html

Related projects

LitSuggest
https://www.ncbi.nlm.nih.gov/research/litsuggest/
  • Browser tool that suggests literature for you based on positive and negative PMIDs. Hosted by NIH.
paper-qa
https://github.com/Future-House/paper-qa
  • This project by Andrew White uses LLM+RAG to explore a paper.
ColBERT
https://github.com/stanford-futuredata/ColBERT
  • ColBERT is a fast retrieval model for large text collections. In theory it can probably be integrated with litdb. litdb is so simple, and works well enough so far without it.

Many of these projects require you to make an account. There are freemium levels in each one.

ResearchRabbit
https://www.researchrabbit.ai/
  • This is a browser tool to navigate the scientific literature graphically. You can make collections, and papers that are related by citations are shown in a graph
LitMaps
https://www.litmaps.com/
  • Another browser tool to graphically interact with scientific literature
Keenious
https://keenious.com/explore
  • Browser / Google Docs and Word plugin. Finds related articles to the text in your document. I like Keenious when in Google Docs.
scite.ai
https://scite.ai/
  • Browser tool that integrates GPT with the scientific literature, integration with Zotero
Scopus AI
https://www.scopus.com/search/form.uri?display=basic#scopus-ai
  • Sponsored by Elsevier
Dimensions AI
https://app.dimensions.ai/discover/publication
  • Seems similar to Scopus AI
khoj
https://khoj.dev/
  • This is a desktop app that can be totally local, or in the cloud. It can index your files, and then you can chat with them. There is a freemium level.
AnythingLLM
https://anythingllm.com/
  • Another tool that runs LLMs locally, and says it can index your files so you can chat with them.
gpt4all
https://www.nomic.ai/gpt4all
  • Another tool that runs LLMs locally, and says it can index your files so you can chat with them.

With all these options, why does litdb exist? There are a lot of answers to that. First, I wanted to make it. I learned a lot about vector search by doing it. Second, I wanted a free, extensible solution for literature search that could also work for my local files while never putting data in the cloud, and that would work in Emacs. The projects above are very nice, easy to use, no or low-code solutions, and if that is what you are looking for, look there! If you want to hack on things yourself, look here.