huggingface · Wauplin · Sep 12, 2024 · Aug 19, 2024 · Aug 20, 2024 · Aug 21, 2024
diff --git a/docs/api-inference/_redirects.yml b/docs/api-inference/_redirects.yml
@@ -0,0 +1,5 @@
+quicktour: index
+detailed_parameters: parameters
+parallelism: TODO
+usage: getting_started
+faq: index
diff --git a/docs/api-inference/_toctree.yml b/docs/api-inference/_toctree.yml
@@ -0,0 +1,52 @@
+- sections:
+  - local: index
+    title: Serverless Inference API
+  - local: getting_started
+    title: Getting Started
+  - local: supported_models
+    title: Supported Models
+  - local: rate_limits
+    title: Rate Limits
+  title: Getting Started
+- sections:
+  - local: parameters
+    title: Parameters
+  - sections:
+    - local: tasks/audio_classification
+      title: Audio Classification
+    - local: tasks/automatic_speech_recognition
+      title: Automatic Speech Recognition
+    - local: tasks/chat_completion
+      title: Chat Completion
+    - local: tasks/feature_extraction
+      title: Feature Extraction
+    - local: tasks/fill_mask
+      title: Fill Mask
+    - local: tasks/image_classification
+      title: Image Classification
+    - local: tasks/image_segmentation
+      title: Image Segmentation
+    - local: tasks/image_to_image
+      title: Image to Image
+    - local: tasks/object_detection
+      title: Object Detection
+    - local: tasks/question_answering
+      title: Question Answering
+    - local: tasks/summarization
+      title: Summarization
+    - local: tasks/table_question_answering
+      title: Table Question Answering
+    - local: tasks/text_classification
+      title: Text Classification
+    - local: tasks/text_generation
+      title: Text Generation
+    - local: tasks/text_to_image
+      title: Text to Image
+    - local: tasks/token_classification
+      title: Token Classification
+    - local: tasks/translation
+      title: Translation
+    - local: tasks/zero_shot_classification
+      title: Zero Shot Classification
+    title: Detailed Task Parameters
+  title: API Reference
diff --git a/docs/api-inference/getting_started.md b/docs/api-inference/getting_started.md
@@ -0,0 +1,78 @@
+# Getting Started
+
+The Serverless Inference API allows you to easily do inference on a wide range of models and tasks. You can do requests with your favorite tools (Python, cURL, etc). We also provide a Python SDK (`huggingface_hub`) to make it even easier.
+
+We'll do a minimal example using a [sentiment classification model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest). Please visit task-specific parameters and further documentation in our [API Reference](./parameters.md).
+
+## Getting a Token
+
+Using the Serverless Inference API requires passing a user token in the request headers. You can get a token by signing up on the Hugging Face website and then going to the [tokens page](https://huggingface.co/settings/tokens). We recommend creating a `Fine-grained` token with the scope to `Make calls to the serverless Inference API`.
+
+TODO: add screenshot
+For more details about user tokens, check out [this guide](https://huggingface.co/docs/hub/en/security-tokens).
+
+## cURL
+
+```bash
+curl https://api-inference.huggingface.co/models/cardiffnlp/twitter-roberta-base-sentiment-latest \
+    -X POST \
+    -d '{"inputs": "Today is a nice day"}' \
+    -H "Authorization: Bearer hf_***" \
+    -H "Content-Type: application/json"
+```
+
+## Python
+
+You can use the `requests` library to make a request to the Inference API.
+
+```python
+import requests
+
+API_URL = "https://api-inference.huggingface.co/models/cardiffnlp/twitter-roberta-base-sentiment-latest"
+headers = {"Authorization": "Bearer hf_***"}
+
+payload = {"inputs": "Today is a nice day"}
+response = requests.post(API_URL, headers=headers, json=payload)
+response.json()
+```
+
+Hugging Face also provides a [`InferenceClient`](https://huggingface.co/docs/huggingface_hub/guides/inference) that handles inference, caching, async, and more. Make sure to install it with `pip install huggingface_hub` first
+
+```python
+from huggingface_hub import InferenceClient
+
+client = InferenceClient(model="cardiffnlp/twitter-roberta-base-sentiment-latest", token="hf_***")
+client.text_classification("Today is a nice day")
+```
+
+## JavaScript
+
+```js
+import fetch from "node-fetch";
+
+async function query(data) {
+    const response = await fetch(
+        "https://api-inference.huggingface.co/models/MODEL_ID",
+        {
+            method: "POST",
+            headers: {
+                Authorization: `Bearer cardiffnlp/twitter-roberta-base-sentiment-latest`,
+                "Content-Type": "application/json",
+            },
+            body: JSON.stringify(data),
+        }
+    );
+    const result = await response.json();
+    return result;
+}
+
+query({
+    inputs: "Today is a nice day"
+}).then((response) => {
+    console.log(JSON.stringify(response, null, 2));
+});
+```
+
+## Next Steps
+
+Now that you know the basics, you can explore the [API Reference](./parameters.md) to learn more about task-specific settings and parameters. 
diff --git a/docs/api-inference/index.md b/docs/api-inference/index.md
@@ -0,0 +1,60 @@
+# Serverless Inference API
+
+**Instant Access to 800,000+ ML Models for Fast Prototyping**
+
+Explore the most popular models for text, image, speech, and more — all with a simple API request. Build, test, and experiment without worrying about infrastructure or setup.
+
+---
+
+## Why use the Inference API?
+
+The Serverless Inference API offers a fast and free way to explore thousands of models for a variety of tasks. Whether you're prototyping a new application or experimenting with ML capabilities, this API gives you instant access to high-performing models across multiple domains:
+
+* **Text Generation:** Including large language models and tool-calling prompts, generate and experiment with high-quality responses.
+* **Image Generation:** Easily create customized images, including LoRAs for your own styles.
+* **Document Embeddings:** Build search and retrieval systems with SOTA embeddings.
+* **Classical AI Tasks:** Ready-to-use models for text classification, image classification, speech recognition, and more.
+
+TODO: add some flow chart image
+
+⚡ **Fast and Free to Get Started**: The Inference API is free with higher rate limits for PRO users. For production needs, explore [Inference Endpoints](https://ui.endpoints.huggingface.co/) for dedicated resources, autoscaling, advanced security features, and more.
+
+---
+
+## Key Benefits
+
+- 🚀 **Instant Prototyping:** Access powerful models without setup.
+- 🎯 **Diverse Use Cases:** One API for text, image, and beyond.
+- 🔧 **Developer-Friendly:** Simple requests, fast responses.
+
+---
+
+## Main Features
+
+* Leverage over 800,000+ models from different open-source libraries (transformers, sentence transformers, adapter transformers, diffusers, timm, etc.).
+* Use models for a variety of tasks, including text generation, image generation, document embeddings, NER, summarization, image classification, and more.
+* Accelerate your prototyping by using GPU-powered models.
+* Run very large models that are challenging to deploy in production.
+* Production-grade platform without the hassle: built-in automatic scaling, load balancing and caching.
+
+---
+
+## Contents
+
+The documentation is organized into two sections:
+
+* **Getting Started** Learn the basics of how to use the Inference API.
+* **API Reference** Dive into task-specific settings and parameters.
+
+---
+
+## Looking for custom support from the Hugging Face team?
+
+<a target="_blank" href="https://huggingface.co/support">
+    <img alt="HuggingFace Expert Acceleration Program" src="https://cdn-media.huggingface.co/marketing/transformers/new-support-improved.png" style="max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
+</a><br>
+
+## Hugging Face is trusted in production by over 10,000 companies
+
+<img class="block dark:hidden !shadow-none !border-0 !rounded-none" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/inference-api/companies-light.png" width="600">
+<img class="hidden dark:block !shadow-none !border-0 !rounded-none" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/inference-api/companies-dark.png" width="600">
diff --git a/docs/api-inference/parameters.md b/docs/api-inference/parameters.md
@@ -0,0 +1,154 @@
+# Parameters
+
+Table with 
+- Domain
+- Task
+- Whether it's supported in Inference API
+- Supported libraries (not sure)
+- Recommended model
+- Link to model specific page
+
+
+
+## Additional Options
+
+### Caching
+
+There is a cache layer on the inference API to speed up requests when the inputs are exactly the same. Many models, such as classifiers and embedding models, can use those results as is if they are deterministic, meaning the results will be the same. Howevr, if you use a nondeterministic model, you can disable the cache mechanism from being used, resulting in a real new query.
+
+To do this, you can add `x-use-cache:false` to the request headers. For example
+
+<inferencesnippet>
+
+<curl>
+```diff
+curl https://api-inference.huggingface.co/models/MODEL_ID \
+    -X POST \
+    -d '{"inputs": "Can you please let us know more details about your "}' \
+    -H "Authorization: Bearer hf_***" \
+    -H "Content-Type: application/json" \
++   -H "x-use-cache: false"
+```
+</curl>
+
+<python>
+```diff
+import requests
+
+API_URL = "https://api-inference.huggingface.co/models/MODEL_ID"
+headers = {
+    "Authorization": "Bearer hf_***",
+    "Content-Type": "application/json",
++   "x-use-cache": "false"
+}
+data = {
+    "inputs": "Can you please let us know more details about your "
+}
+response = requests.post(API_URL, headers=headers, json=data)
+print(response.json())
+```
+
+</python>
+
+<js>
+```diff
+import fetch from "node-fetch";
+
+async function query(data) {
+    const response = await fetch(
+        "https://api-inference.huggingface.co/models/MODEL_ID",
+        {
+            method: "POST",
+            headers: {
+                Authorization: `Bearer hf_***`,
+                "Content-Type": "application/json",
++               "x-use-cache": "false"
+            },
+            body: JSON.stringify(data),
+        }
+    );
+    const result = await response.json();
+    return result;
+}
+
+query({
+    inputs: "Can you please let us know more details about your "
+}).then((response) => {
+    console.log(JSON.stringify(response, null, 2));
+});
+
+```
+
+</js>
+
+</inferencesnippet>
+
+### Wait for the model
+
+When a model is warm, it is ready to be used and you will get a response relatively quickly. However, some models are cold and need to be loaded before they can be used. In that case, you will get a 503 error. Rather than doing many requests until it's loaded, you can wait for the model to be loaded by adding `x-wait-for-model:true` to the request headers. We suggest to only use this flag to wait for the model to be loaded when you are sure that the model is cold. That means, first try the request without this flag and only if you get a 503 error, try again with this flag.
+
+
+<inferencesnippet>
+
+<curl>
+```diff
+curl https://api-inference.huggingface.co/models/MODEL_ID \
+    -X POST \
+    -d '{"inputs": "Can you please let us know more details about your "}' \
+    -H "Authorization: Bearer hf_***" \
+    -H "Content-Type: application/json" \
++   -H "x-wait-for-model: true"
+```
+</curl>
+
+<python>
+```diff
+import requests
+
+API_URL = "https://api-inference.huggingface.co/models/MODEL_ID"
+headers = {
+    "Authorization": "Bearer hf_***",
+    "Content-Type": "application/json",
++   "x-wait-for-model": "true"
+}
+data = {
+    "inputs": "Can you please let us know more details about your "
+}
+response = requests.post(API_URL, headers=headers, json=data)
+print(response.json())
+```
+
+</python>
+
+<js>
+```diff
+import fetch from "node-fetch";
+
+async function query(data) {
+    const response = await fetch(
+        "https://api-inference.huggingface.co/models/MODEL_ID",
+        {
+            method: "POST",
+            headers: {
+                Authorization: `Bearer hf_***`,
+                "Content-Type": "application/json",
++               "x-wait-for-model": "true"
+            },
+            body: JSON.stringify(data),
+        }
+    );
+    const result = await response.json();
+    return result;
+}
+
+query({
+    inputs: "Can you please let us know more details about your "
+}).then((response) => {
+    console.log(JSON.stringify(response, null, 2));
+});
+
+```
+
+</js>
+
+</inferencesnippet>
diff --git a/docs/api-inference/rate_limits.md b/docs/api-inference/rate_limits.md
@@ -0,0 +1,11 @@
+# Rate Limits
+
+The Inference API has rate limits based on the number of requests. These rate limits are subject to change in the future to be compute-based or token-based. 
+
+Serverless API is not meant to be used for heavy production applications. If you need higher rate limits, consider [Inference Endpoints](https://huggingface.co/docs/inference/endpoints) to have dedicated resources.
+
+| User Tier           | Rate Limit                |
+|---------------------|---------------------------|
+| Unregistered Users  | 1 request per hour        |
+| Signed-up Users     | 300 requests per hour     |
+| PRO and Enterprise Users           | 1000 requests per hour    |