huggingface · Wauplin · Sep 12, 2024 · Aug 19, 2024 · Aug 20, 2024 · Aug 21, 2024
diff --git a/docs/api-inference/_redirects.yml b/docs/api-inference/_redirects.yml
@@ -0,0 +1,5 @@
+quicktour: overview
+detailed_parameters: parameters
+parallelism: TODO
+usage: getting_started
+faq: overview
diff --git a/docs/api-inference/_toctree.yml b/docs/api-inference/_toctree.yml
@@ -0,0 +1,17 @@
+- sections:
+  - local: index
+    title: Serverless Inference API
+  - local: overview
+    title: Overview
+  - local: getting_started
+    title: Getting Started
+  - local: rate_limits
+    title: Rate Limits
+  title: title
+- sections:
+  - local: parameters
+    title: Parameters
+  - sections:
+    - local: tasks/fill_mask
+      title: Fill Mask
+  title: Parameters
diff --git a/docs/api-inference/getting_started.md b/docs/api-inference/getting_started.md
@@ -0,0 +1,3 @@
+# Getting Started
+
+TODO:
diff --git a/docs/api-inference/index.md b/docs/api-inference/index.md
@@ -0,0 +1,50 @@
+# Serverless Inference API
+
+**Instant Access to 800,000+ ML Models for Fast Prototyping**
+
+Explore the most popular models for text, image, speech, and more — all with a simple API request. Build, test, and experiment without worrying about infrastructure or setup.
+
+---
+
+## Why use the Inference API?
+
+The Serverless Inference API offers a fast and free way to explore thousands of models for a variety of tasks. Whether you're prototyping a new application or experimenting with ML capabilities, this API gives you instant access to high-performing models across multiple domains:
+
+* **Text Generation:** Including large language models and tool-calling prompts, generate and experiment with high-quality responses.
+* **Image Generation:** Easily create customized images, including LoRAs for your own styles.
+* **Document Embeddings:** Build search and retrieval systems with SOTA embeddings.
+* **Classical AI Tasks:** Ready-to-use models for text classification, image classification, speech recognition, and more.
+
+TODO: add some flow chart image
+
+⚡ **Fast and Free to Get Started**: The Inference API is free with rate limits. For production needs, explore [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) for dedicated resources, autoscaling, advanced security features, and more.
+
+---
+
+## Key Benefits
+
+- 🚀 **Instant Prototyping:** Access powerful models without setup.
+- 🎯 **Diverse Use Cases:** One API for text, image, and beyond.
+- 🔧 **Developer-Friendly:** Simple requests, fast responses.
+
+---
+
+## Contents
+
+The documentation is organized into two sections:
+
+* **Getting Started** Learn the basics of how to use the Inference API.
+* **Parameters** Dive into task-specific settings and parameters.
+
+---
+
+## Looking for custom support from the Hugging Face team?
+
+<a target="_blank" href="https://huggingface.co/support">
+    <img alt="HuggingFace Expert Acceleration Program" src="https://cdn-media.huggingface.co/marketing/transformers/new-support-improved.png" style="max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
+</a><br>
+
+## Hugging Face is trusted in production by over 10,000 companies
+
+<img class="block dark:hidden !shadow-none !border-0 !rounded-none" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/inference-api/companies-light.png" width="600">
+<img class="hidden dark:block !shadow-none !border-0 !rounded-none" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/inference-api/companies-dark.png" width="600">
diff --git a/docs/api-inference/overview.md b/docs/api-inference/overview.md
@@ -0,0 +1,49 @@
+# Overview
+
+## Main Features
+
+* Leverage over 800,000+ models from different open-source libraries (transformers, sentence transformers, adapter transformers, diffusers, timm, etc.).
+* Use models for a variety of tasks, including text generation, image generation, document embeddings, NER, summarization, image classification, and more.
+* Accelerate your prototyping by using GPU-powered models.
+* Run very large models that are challenging to deploy in production.
+* Benefit from the built-in automatic scaling, load balancing and caching.
+
+## Eligibility
+
+Given the fast-paced nature of the open ML ecosystem, the Inference API allows using models that have large community interest and are actively being used(based on recent likes, downloads, and usage). Because of this, deployed models can be swapped without prior notice.
+
+You can find:
+
+* **[Warm models](https://huggingface.co/models?inference=warm&sort=trending):** models ready to be used.
+* **[Cold models](https://huggingface.co/models?inference=cold&sort=trending):** models that are not loaded but can be used.
+* **[Frozen models](https://huggingface.co/models?inference=frozen&sort=trending):** models that currently can't be run with the API.
+
+TODO: add screenshot
+
+## GPU vs CPU
+
+By default, the Inference API uses GPUs to run large models. For small models that can run well on CPU, such as small text classification and text embeddings, the API will automatically switch to CPU to save costs.
+
+## Inference for PRO
+
+In addition to thousands of public models available in the Hub, PRO and Enteprise users get free access and higher rate limits to the following models:
+
+
+| Model                          | Size                                                                                                                                                                                       | Context Length | Use                                                          |
+|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|--------------------------------------------------------------|
+| Meta Llama 3.1Instruct  | [8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), [70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)                                                      | 128k tokens      | High quality multilingual chat model with large context length |
+| Meta Llama 3 Instruct          | [8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), [70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)                                                       | 8k tokens      | One of the best chat models                                  |
+| Llama 2 Chat                   | [7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf), [70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | 4k tokens      | One of the best conversational models                        |
+| Bark                           | [0.9B](https://huggingface.co/suno/bark)                                                                                                                                                   | -              | Text to audio generation                                     |
+
+
+## FAQ
+
+### Running Private Models
+
+The free Serverless API is designed to run popular public models. If you have a private model, you can use the [Inference Endpoints](https://huggingface.co/docs/inference/endpoints) to deploy your model.
+
+### Fine-tuning Models
+
+To automatically finetune a model on your data, please try [AutoTrain](https://huggingface.co/autotrain). It’s a no-code solution for automatically training and deploying a model; all you have to do is upload your data!
+
diff --git a/docs/api-inference/parameters.md b/docs/api-inference/parameters.md
@@ -0,0 +1,16 @@
+# Parameters
+
+Table with 
+- Domain
+- Task
+- Whether it's supported in Inference API
+- Supported libraries (not sure)
+- Recommended model
+- Link to model specific page
+
+
+
+## Additional parameters (different page?)
+
+- Controling cache
+- Modifying the task used by a model (Which task is used by this model?)
diff --git a/docs/api-inference/rate_limits.md b/docs/api-inference/rate_limits.md
@@ -0,0 +1,11 @@
+# Rate Limits
+
+The Inference API has temporary rate limits based on the number of requests. These rate limits are subject to change in the future to be compute-based or token-based. 
+
+Serverless API is not meant to be used for heavy production applications. If you need higher rate limits, using [Inference Endpoints](https://huggingface.co/docs/inference/endpoints) to have dedicated resources.
+
+| User Tier           | Rate Limit                |
+|---------------------|---------------------------|
+| Unregistered Users  | 1 request per hour        |
+| Signed-up Users     | 300 requests per hour     |
+| PRO and Enterprise Users           | 1000 requests per hour    |
diff --git a/docs/api-inference/tasks/fill_mask.md b/docs/api-inference/tasks/fill_mask.md
@@ -0,0 +1,6 @@
+## Fill Mask
+
+Mask filling is the task of predicting the right word (token to be precise) in the middle of a sequence. 
+
+Automated docs below
+