Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.

Latest commit

 

History

History
140 lines (110 loc) · 6.67 KB

README.rst

File metadata and controls

140 lines (110 loc) · 6.67 KB

REPO IS ARCHIVED

After experimenting for a while, we ultimately decided not to use Intake catalogs to distribute our data, so this repository is no longer maintained.

See PUDL Data Access for more information on how to access the data we publish.

SQLite Driver for Intake Data Catalogs

Tox-PyTest Status Codecov Test Coverage Read the Docs Build Status PyPI Latest Version conda-forge Version Supported Python Versions Any color you want, so long as it's black.

This package provides a (very) thin wrapper around the more general intake-sql driver, which can be used to generate Intake data catalogs from SQL databases.

The intake-sql driver takes an SQL Alchemy database URL and uses it to connect to and extract data from the database. This works with just fine with SQLite databases, but only when the database file is stored locally and can be referenced with a simple path.

For example this path:

/home/zane/code/catalyst/pudl-work/sqlite/pudl.sqlite

would correspond to this SQL Alchemy database URL:

sqlite:///home/zane/code/catalyst/pudl-work/sqlite/pudl.sqlite

But you can't access a remote SQLite DB this way.

Why access a remote SQLite DB?

  • SQLite databases are great standalone, standardized containers for relational data, that can be accessed using a huge variety of tools on almost any computer platform. They are even accepted as an archival format by the US Library of Congress!
  • Data evolves over time, and it's often useful to have easy access to several different versions of it, and to know exactly which version you're working with.
  • Cloud object storage is extremely cheap and convenient, and makes it easy to publish and store historical file-based data releases.
  • Managing your own bespoke local filesystem hierarchy filled with data -- and coordinating with colleagues so that everyone is using the same filesystem organizational scheme -- is a pain.
  • Intake catalogs can provide easy access to metadata and let you manage data versions just like software versions. Installing a new version of the data catalog points you at the new version of the data.
  • The overhead and cost associated with setting up and maintaining a database that uses a client-server model is relatively large compared to distributing a few files that change infrequently, are essentially read-only resources, and only take up a few gigabytes of space.

How does it work?

Rather than using an SQL Alchemy database URL to reference the SQLite DB, this intake driver takes a local path or a remote URL, like:

  • ../pudl-work/sqlite/pudl.sqlite
  • https://global-power-plants.datasettes.com/global-power-plants.db
  • s3://cloudy-mc-cloudface-databucket/v1.2.3/mydata.db

For local paths, it resolves the path and prepends sqlite:// before handing it off to intake-sql to do all the hard work.

For remote URLs it uses fsspec to cache a local copy of the database, and then gives intake-sql a database URL that points to the cached copy.

import intake_sqlite

gpp_cat = intake_sqlite.SQLiteCatalog(
    urlpath="https://global-power-plants.datasettes.com/global-power-plants.db",
    storage_options={"simplecache": {"cache_storage": "/home/zane/.cache/intake"}},
)

list(gpp_cat)

# ['global-power-plants',
#  'global-power-plants_fts',
#  'global-power-plants_fts_config',
#  'global-power-plants_fts_data',
#  'global-power-plants_fts_docsize',
#  'global-power-plants_fts_idx']

About Catalyst Cooperative

Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

Contact Us

  • For general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
  • If you'd like to get occasional updates about our projects sign up for our email list.
  • Want to schedule a time to chat with us one-on-one? Join us for Office Hours
  • Follow us on Twitter: @CatalystCoop
  • More info on our website: https://catalyst.coop
  • For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: [email protected]