sample_peyote

A multi-table synthetic data generator based on OpenAI’s GPT-3 APIs

Overview

This project is an experiment in LLM-based workflows. Functionally, my goal is to be able to quickly create semi-realistic synthetic data.

I want to be able to...

Start from scratch: Most synthetic data generators work by taking a sample of real data, and generating a fake dataset that has similar properties. I want to generate (aka "hallucinate") data starting from just an idea.
Cover any topic: I want to be able to generate data related to many different topics.
Generate a database, not just a table: I don't just want to generate a table. I want to generate a realistic-feeling database, with multiple tables and realistic use of things like foreign keys, ENUMs, and timestamps.
Pass the Enhance That! test: At the end of the day, I want to generate data that "feels authentic." This is squishy, but sometimes that's what it comes down to when we're evaluating content.

Examples

Here are examples for datasets generated by Sample Peyote, using topics from recent episodes of Stuff You Should Know.

Rubik's Cube Solving Times Data: the average time it takes to solve a Rubik's Cube, as well as the fastest and slowest times achieved by individual players
Nintendo Video Game Sales Data: the sales records for all of Nintendo's video games, including information like the title, release date, region it was released in, number of copies sold and gross revenue
Polar Bear Population Data: population counts and trends of polar bears in various regions, as well as information on their habitats and behavior
Strike Participant Data: information about the participants in the Atlanta Washer Woman Strike, including their names, ages, occupations, and any other relevant demographic information
Birthday Probability Data: the probability of two or more people in a given group having the same birthday

...and here are some generated from topics on the front page of the Wall Street Journal.

Microsoft Partner Network Data: information about the different partners and vendors that work with Microsoft, including their contact information, services provided, and customer ratings
Federal Reserve Bank Loan Data: loan information from each of the 12 Federal Reserve Banks, including loan type, amount, interest rate and maturity date
Nuclear Fusion Reactor Performance Data: information on the performance of different nuclear fusion reactors, such as energy output, reactor temperature, and efficiency
Xi Jinping's Political Policies Data: records of the political policies proposed and implemented by President Xi Jinping since he came to power in 2012
Supply Chain Inventory Data: detailed information on the quantity of inventory at each stage in the supply chain, from suppliers to retailers

You can find more example runs in the data/ directory in this repo.

Installation and usage

Installation

git clone [email protected]:abegong/sample_peyote.git
cd sample_peyote
pip install .

You'll also need to have a valid OPENAI_API_KEY configured in your environment variables.

⚠️ RUNNING SAMPLE PEYOTE COSTS MONEY. Each run of Sample Peyote costs about $0.05 worth of OpenAI credits. That's based refreshing my Billing page and eyeballing how much the Usage bar increased over the course of a couple hours of experimentation, so if you're going to scale this up and spend significant money, you'll want to do your own (better) estimation.

Basic usage:

sample_peyote

⚠️ OpenAI API calls are kinda slow. Each run takes about 2 minutes. (This gif runs at 8x speed.)

Data from this run is included the data/ folder in this repo:

Star Wars Character Data: a comprehensive list of all the characters featured in the Star Wars universe, with information such as their name, species, homeworld, and any other relevant details

Sample Peyote will

ask you for a topic,
generate some ideas for you,
ask you to choose one of the ideas,
generate tables and samples for you.

You can specify a topic to skip step 1: sample_peyote --topic quadrilaterals

For multi-word topics, please use quotes: sample_peyote --topic "The Beatles"

If you specify -n 1 (only generate a single idea), it'll skip step 3: sample_peyote -n 1

If you specify --silent or -s, it will suppress print output. Combined with the other a topic and -n 1, this allows headless generation of datasets: sample_peyote --topic "The Beatles" -n 1 --silent

Output

On each run, Sample Peyote will generate a directory that looks like this:

├── summary-beatles-song-lyrics-data.html   # An HTML file containing the dataset idea, data samples, descriptions, and full history of prompts and responses API calls
├── dataset_ideas.json                      # A json file containing dataset ideas related to the specified topic
├── tables-beatles-song-lyrics-data.jl      # A json-lines file containing table descriptions and columns for the chosen dataset
└── samples                                 # Contains the data samples themselves
    ├── albums.csv
    ├── artists.csv
    ├── genres.csv
    ├── performances.csv
    ├── song-lyrics.csv
    ├── tracks.csv
    └── writers.csv

Known issues

This program relies on regex parsing of replies from OpenAI's Davinci text model. I've done some basic prompt engineering to make Davinci more likely to return well-formatted responses, but it's not perfect. I'd guess that it fails about 10% of the time, but that's not based on anything scientific.

Since I'm only using this for demo purposes, I haven't bothered to trap, log, and set up retry logic for those errors. If you were going to use Sample Peyote for real, you'd want to make it more reliable. (:merge: PRs welcome!)

FAQ

What's with the name?

Well, it helps you hallucinate data samples.

Why synthetic data?

Use of synthetic data is on the rise, to boost the size of training sets for ML models, and to test algorithms and data systems with less risk of exposing sensitive data. I also have a hunch that sufficiently realistic datasets could be useful for classes and bootcamps for data scientists and engineers. It's hard to find realistic datasets without working within a real organization.

In other words, there's a possiblity that Sample Peyote might actually be useful to somebody. If that somebody is you, please have at it! I've open sourced Sample Peyote under the Apache 2.0 license. I'm also happy to accept PRs, but please don't expect quick turnaround---I'm not planning to invest a ton of time in this project.

What's the Enhance That! test?

Let's say you're watching a Hollywood blockbuster including some kind of data-related MacGuffin. There's a tense scene where the characters are looking at a screen: "Oh no! The Jackal is hacking the power grid database!" or "Captain, we've unencrypted the alien message. It's a SQLITE database."

You freeze playback, and output from Sample Peyote appears on the screen. If you showed the screen to a professional data analyst/scientist/engineer, would they cringe? Or look thoughtful and say, "hey, that's actually pretty good!"

What's your real goal with all this?

My real goal for this project was to explore how AI-based workflows are likely to evolve in the future. As a data guy and entrepreneur, my interest naturally gravitated in this direction. I built the core of Sample Peyote in a weekend, then spent a few more hours packaging it up to share via Github.

I'm very bullish on the potential for AI-based tools to unlock a lot ton of productivity and creativity, but I also believe that they're going to cause sweeping changes that we aren't ready for as a society. I'm also doing a lot of thinking about how those changes are going to play out within the world of data, analytics, and epistemology (i.e. learning and reasoning together based on evidence.)

If this stuff interests you too, please reach out! As of Dec 2022, I'm active on twitter under the handle @abegong.

Todo

Set up testing via GH actions
Bugfix: Detect failed regex matching
Add better error trapping in general
Add logging
Parallelize API calls to create Samples, for faster execution

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
data		data
sample_peyote		sample_peyote
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sample_peyote

Overview

Examples

Installation and usage

Output

Known issues

FAQ

What's with the name?

Why synthetic data?

What's the Enhance That! test?

What's your real goal with all this?

Todo

About

Releases

Packages

Languages

License

abegong/sample_peyote

Folders and files

Latest commit

History

Repository files navigation

sample_peyote

Overview

Examples

Installation and usage

Output

Known issues

FAQ

What's with the name?

Why synthetic data?

What's the Enhance That! test?

What's your real goal with all this?

Todo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages