A multi-table synthetic data generator based on OpenAI’s GPT-3 APIs
This project is an experiment in LLM-based workflows. Functionally, my goal is to be able to quickly create semi-realistic synthetic data.
I want to be able to...
- Start from scratch: Most synthetic data generators work by taking a sample of real data, and generating a fake dataset that has similar properties. I want to generate (aka "hallucinate") data starting from just an idea.
- Cover any topic: I want to be able to generate data related to many different topics.
- Generate a database, not just a table: I don't just want to generate a table. I want to generate a realistic-feeling database, with multiple tables and realistic use of things like foreign keys, ENUMs, and timestamps.
- Pass the Enhance That! test: At the end of the day, I want to generate data that "feels authentic." This is squishy, but sometimes that's what it comes down to when we're evaluating content.
Here are examples for datasets generated by Sample Peyote, using topics from recent episodes of Stuff You Should Know.
- Rubik's Cube Solving Times Data: the average time it takes to solve a Rubik's Cube, as well as the fastest and slowest times achieved by individual players
- Nintendo Video Game Sales Data: the sales records for all of Nintendo's video games, including information like the title, release date, region it was released in, number of copies sold and gross revenue
- Polar Bear Population Data: population counts and trends of polar bears in various regions, as well as information on their habitats and behavior
- Strike Participant Data: information about the participants in the Atlanta Washer Woman Strike, including their names, ages, occupations, and any other relevant demographic information
- Birthday Probability Data: the probability of two or more people in a given group having the same birthday
...and here are some generated from topics on the front page of the Wall Street Journal.
- Microsoft Partner Network Data: information about the different partners and vendors that work with Microsoft, including their contact information, services provided, and customer ratings
- Federal Reserve Bank Loan Data: loan information from each of the 12 Federal Reserve Banks, including loan type, amount, interest rate and maturity date
- Nuclear Fusion Reactor Performance Data: information on the performance of different nuclear fusion reactors, such as energy output, reactor temperature, and efficiency
- Xi Jinping's Political Policies Data: records of the political policies proposed and implemented by President Xi Jinping since he came to power in 2012
- Supply Chain Inventory Data: detailed information on the quantity of inventory at each stage in the supply chain, from suppliers to retailers
You can find more example runs in the data/ directory in this repo.
Installation
git clone [email protected]:abegong/sample_peyote.git
cd sample_peyote
pip install .
You'll also need to have a valid OPENAI_API_KEY
configured in your environment variables.
Basic usage:
sample_peyote
Data from this run is included the data/ folder in this repo:
- Star Wars Character Data: a comprehensive list of all the characters featured in the Star Wars universe, with information such as their name, species, homeworld, and any other relevant details
Sample Peyote will
- ask you for a topic,
- generate some ideas for you,
- ask you to choose one of the ideas,
- generate tables and samples for you.
You can specify a topic to skip step 1: sample_peyote --topic quadrilaterals
For multi-word topics, please use quotes: sample_peyote --topic "The Beatles"
If you specify -n 1
(only generate a single idea), it'll skip step 3: sample_peyote -n 1
If you specify --silent
or -s
, it will suppress print output. Combined with the other a topic and -n 1
, this allows headless generation of datasets: sample_peyote --topic "The Beatles" -n 1 --silent
On each run, Sample Peyote will generate a directory that looks like this:
├── summary-beatles-song-lyrics-data.html # An HTML file containing the dataset idea, data samples, descriptions, and full history of prompts and responses API calls
├── dataset_ideas.json # A json file containing dataset ideas related to the specified topic
├── tables-beatles-song-lyrics-data.jl # A json-lines file containing table descriptions and columns for the chosen dataset
└── samples # Contains the data samples themselves
├── albums.csv
├── artists.csv
├── genres.csv
├── performances.csv
├── song-lyrics.csv
├── tracks.csv
└── writers.csv
This program relies on regex parsing of replies from OpenAI's Davinci text model. I've done some basic prompt engineering to make Davinci more likely to return well-formatted responses, but it's not perfect. I'd guess that it fails about 10% of the time, but that's not based on anything scientific.
Since I'm only using this for demo purposes, I haven't bothered to trap, log, and set up retry logic for those errors. If you were going to use Sample Peyote for real, you'd want to make it more reliable. (:merge: PRs welcome!)
Well, it helps you hallucinate data samples.
Use of synthetic data is on the rise, to boost the size of training sets for ML models, and to test algorithms and data systems with less risk of exposing sensitive data. I also have a hunch that sufficiently realistic datasets could be useful for classes and bootcamps for data scientists and engineers. It's hard to find realistic datasets without working within a real organization.
In other words, there's a possiblity that Sample Peyote might actually be useful to somebody. If that somebody is you, please have at it! I've open sourced Sample Peyote under the Apache 2.0 license. I'm also happy to accept PRs, but please don't expect quick turnaround---I'm not planning to invest a ton of time in this project.
Let's say you're watching a Hollywood blockbuster including some kind of data-related MacGuffin. There's a tense scene where the characters are looking at a screen: "Oh no! The Jackal is hacking the power grid database!" or "Captain, we've unencrypted the alien message. It's a SQLITE database."
You freeze playback, and output from Sample Peyote appears on the screen. If you showed the screen to a professional data analyst/scientist/engineer, would they cringe? Or look thoughtful and say, "hey, that's actually pretty good!"
My real goal for this project was to explore how AI-based workflows are likely to evolve in the future. As a data guy and entrepreneur, my interest naturally gravitated in this direction. I built the core of Sample Peyote in a weekend, then spent a few more hours packaging it up to share via Github.
I'm very bullish on the potential for AI-based tools to unlock a lot ton of productivity and creativity, but I also believe that they're going to cause sweeping changes that we aren't ready for as a society. I'm also doing a lot of thinking about how those changes are going to play out within the world of data, analytics, and epistemology (i.e. learning and reasoning together based on evidence.)
If this stuff interests you too, please reach out! As of Dec 2022, I'm active on twitter under the handle @abegong.
- Set up testing via GH actions
- Bugfix: Detect failed regex matching
- Add better error trapping in general
- Add logging
- Parallelize API calls to create Samples, for faster execution