-
Notifications
You must be signed in to change notification settings - Fork 56
Home
When loading a model in local.ai, a server is initialized (by default on port 8000
) which can act as a text completion endpoint, mirroring closely that which is provided by OpenAI's own completion API. This allows you to send and receive responses from an LLM in external applications or tools.
The available APIs are:
-
POST
/v1/completions
: Similar to the instruct completion API, supports only streaming SSE request. See https://github.com/louisgv/local.ai/issues/75
Responses from local.ai's completion API can adjusted using the below parameters:
-
prompt The prompt, including template, of what is to be completed/answered by the model. To add a new line in your prompt, use
\\n
. -
sampler A string which controls selection of the most likely token.
-
stream Specifies whether to stream the response back in chunks or to receive the complete response at once. When set to "true", the response will be streamed as a series of messages, each containing a partial completion.
-
max_tokens Specifies the maximum number of tokens used in the response, controlling its length. The model will stop generating tokens once this limit is reached, even if the completion is not finished.
-
seed An optional seed value used to generate pseudo-random numbers in the model. Providing the same seed will produce the same completion result, allowing for reproducibility.
-
temperature Controls the randomness of the model's output. A higher temperature value (e.g., 0.8) makes the output more diverse and creative, while a lower value (e.g., 0.2) makes it more focused and deterministic.
-
top_k Controls the number of top tokens to consider during sampling. It restricts the sampling pool to the top k tokens based on their probabilities. A smaller value generates more focused and deterministic responses.
-
top_p Limits the sampling to the cumulative probability until it exceeds the specified threshold. The pool of tokens is dynamically determined, allowing for more coherent responses with varied lengths.
-
frequency_penalty Specifies a penalty factor to discourage the model from repeating the same tokens. A higher penalty value (e.g., 0.8) reduces the likelihood of repetitive completions.
-
presence_penalty Specifies a penalty factor to discourage the model from focusing on specific words or phrases. It encourages the model to explore alternative completions and generate more diverse responses.
-
stop_sequences A list of sequences that, when encountered, will immediately stop the generation of tokens in the completion. The model will not produce any tokens beyond the stop sequences.
-
stop Should provide the same function as the 'stop_sequences' option.
Aside from the prompt, you don't have to specify a value for every parameter listed above.