OpenAI GPT’s Chat Completion API Parameters 101

Jericho Siahaya
8 min readNov 30, 2023
OpenAI GPT’s Chat Completion Parameters

Once upon a time, there was a boy who tried to build an AI-based application using OpenAI’s GPT service. He knew he had a brilliant idea. However, the execution was terrible, and the boy faced problems with the language model (LLM), resulting in unwanted output. He desired the output to be specific, just the way he wanted it — concise, straightforward, with minimal repetition. Even after trying many different prompts, the output sometimes came close to what he wanted and sometimes was far from perfect.

So, what could the boy do to tackle this problem? Instead of focusing on prompt design, which is essential for every good output, there’s another approach to tweak the LLM to produce the desired output.

GPT’s chat completion API actually has a few parameters that can be used to adjust the LLM. Many people, including myself, typically don’t use these parameters because they seem challenging to understand. However, after taking another look and making a genuine effort to understand them, I’ll share my explanation of some of these parameters.

  • max_tokens

max_tokens is just the most tokens allowed for the answer GPT generates. You can use this setting to make GPT stop creating an answer after a specific number of tokens. The default is 4096 tokens.

curl  -X POST \
'https://domain.com/openai/deployments/{deployment_name}/chat/completions?api-version={your-api-version}' \
--header 'Accept: */*' \
--header 'api-key: your-api-key' \
--header 'Content-Type: application/json' \
--data-raw '{
"messages": [
{
"role": "user",
"content": "Tell me a story about Indonesian rendang."
}
],
"max_tokens": 50
}'
  • temperature

temperature is how you adjust the model’s level of creativity, from straightforward to imaginative. The default is 1, and it can be set between 0 and 2. If you’ve crafted a prompt for a simple YES or NO output, you can lower the temperature to 0.2 to stick to those responses. If you want more creative answers, you can set it higher, like 0.8.

Temperature affects the sampling choice of the next predicted tokens. In short, the lower the temperature, the more deterministic the results, meaning the highest probable next token is always chosen. Increasing the temperature could lead to more randomness, encouraging more diverse or creative outputs.

Temperature
curl  -X POST \
'https://domain.com/openai/deployments/{deployment_name}/chat/completions?api-version={your-api-version}' \
--header 'Accept: */*' \
--header 'api-key: your-api-key' \
--header 'Content-Type: application/json' \
--data-raw '{
"messages": [
{
"role": "user",
"content": "Tell me a story about Indonesian rendang."
}
],
"max_tokens": 50,
"temperature": 0.0
}'
  • top_p

top-p or nucleus sampling and temperature serves the same purpose in language models. Top-p sampling selected tokens from the smallest set whose cumulative probability surpasses a given threshold, denoted as ‘p’. Rather than sampling from the entire distribution of possible tokens, top-p sampling focuses on a subset with the highest probabilities. The default value is 1 and setting top-p to 0.1 means only tokens in the top 10% probability range are considered. It is not recommended to use temperature and top-p together.

  • frequency_penalty

frequency_penalty helps reduce the chance of repeatedly sampling the same sequences of tokens. This parameter instructs the language model not to use the same words too frequently. For example, if the model has already used ‘awesome’ in the current output, it will try to avoid using the same words in the next tokens. The mechanism works by directly adjusting the logits (un-normalized log-probabilities) with an additive contribution.

For instance, if the model produces the word “rice” for the first time, and the log-probability for that word is, let’s say, -1.0. The next time the model generates the word “rice,” the frequency_penalty (e.g., 0.5) value is added to its log-probability. Thus, the total log for the word “rice” becomes -1.0 + 0.5 = -0.5. This boost in log-probability serves to discourage the model from using the same word too often in the generated text.

frequency_penalty is a contribution that is proportional to how often a particular token has already been sampled.

The default value is 0, and suitable penalty coefficients typically range from 0.1 to 1. A higher value means the model will be less inclined to reuse the same words, while a lower value implies a greater likelihood of repetitive word usage.

curl  -X POST \
'https://domain.com/openai/deployments/{deployment_name}/chat/completions?api-version={your-api-version}' \
--header 'Accept: */*' \
--header 'api-key: your-api-key' \
--header 'Content-Type: application/json' \
--data-raw '{
"messages": [
{
"role": "user",
"content": "Tell me a story about Indonesian rendang."
}
],
"max_tokens": 50,
"temperature": 0.0,
"frequency_penalty": 1.0
}'
  • presence_penalty

presence_penaltycan be used to encourage the model to use a diverse range of tokens in the generated text. It instructs the language model to utilize different words, promoting variety in the outputs. Imagine you have a friend who likes to say the word “banana” a lot. Every time your friend says “banana,” you give them a little note. The first time, you say, “Here’s a note with a value of -1 because it’s the first time you said ‘banana.’”

Now, your friend says “banana” again. But this time, you want to make things interesting. You decide to add a small penalty to the note because they’ve said “banana” before. This penalty is like a little subtraction. So, the second time your friend says “banana,” you add a penalty of -0.2 to their previous note. The new note becomes -1.0 (the first time) — 0.2 (the penalty) = -1.2.

If your friend keeps saying “banana” a lot, each time you add a bit more to the penalty. It’s like a game where the more they say it, the less your notes value the word. This helps make your friend’s talk more diverse. They might start saying different words because saying “banana” over and over gets a little less rewarding each time.

presence penalty is a one-off additive contribution that applies to all tokens that have been sampled at least once.

A higher presence_penalty value will result in the model being more likely to generate tokens that have not yet been included in the generated text.

curl  -X POST \
'https://domain.com/openai/deployments/{deployment_name}/chat/completions?api-version={your-api-version}' \
--header 'Accept: */*' \
--header 'api-key: your-api-key' \
--header 'Content-Type: application/json' \
--data-raw '{
"messages": [
{
"role": "user",
"content": "Tell me a story about Indonesian rendang."
}
],
"max_tokens": 50,
"temperature": 0.0,
"frequency_penalty": 1.0,
"presence_penalty": 0.8
}'
  • logit_bias

logit_bias parameter acts as a tool to stop the model from generating specific, unwanted tokens or words. By using this parameter, you can control or exclude certain words from GPT’s output. It takes a JSON object that links token IDs to associated bias values ranging from -100 to 100. Essentially, this bias is mathematically added to the logits produced by the model before the sampling process.

In short, assigning -100 to a selected token will effectively ban that token from being generated, while assigning 100 will ensure an exclusive selection of that token in the generated output.

curl  -X POST \
'https://domain.com/openai/deployments/{deployment_name}/chat/completions?api-version={your-api-version}' \
--header 'Accept: */*' \
--header 'api-key: your-api-key' \
--header 'Content-Type: application/json' \
--data-raw '{
"messages": [
{
"role": "user",
"content": "Tell me a story about Indonesian rendang."
}
],
"max_tokens": 50,
"temperature": 0.0,
"presence_penalty": 2.0,
"frequency_penalty": 1.0,
"logit_bias": {
"12805": 100,
"5304": 100,
"264": 100,
"892": 100
}
}'

Response:

{
"id": "",
"object": "chat.completion",
"created": ,
"model": "gpt-35-turbo",
"choices": [
{
"index": 0,
"finish_reason": "length",
"message": {
"role": "assistant",
"content": "Once upon a time a a time a a time a time a time a time a time a time a time a time a time a time a time a time upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon a"
}
}
],
"usage": {
"prompt_tokens": 17,
"completion_tokens": 50,
"total_tokens": 67
}
}

See, if you set the bias to 100 for the phrase “once upon a time,” it leads to an exclusive selection of that particular set of tokens in the generated output.

  • seed

seed is still in a beta feature, but it allows you to obtain consistent results for every input submitted to GPT. For instance, if you ask GPT to narrate a story about Indonesian rendang, the generated story may vary each time you ask. However, using the seed will prompt the system to make its best attempt at deterministic sampling, ensuring you get the same result for a given input.

curl  -X POST \
'https://domain.com/openai/deployments/{deployment_name}/chat/completions?api-version={your-api-version}' \
--header 'Accept: */*' \
--header 'api-key: your-api-key' \
--header 'Content-Type: application/json' \
--data-raw '{
"messages": [
{
"role": "user",
"content": "Tell me a story about Indonesian rendang."
}
],
"max_tokens": 50,
"temperature": 0.0,
"presence_penalty": 2.0,
"frequency_penalty": 1.0,
"logit_bias": {
"12805": 100,
"5304": 100,
"264": 100,
"892": 100
},
"seed": 123456789
}'
  • function_call

function_call parameter allows you to instruct the language model to perform a specific task before generating an answer to the user’s input. A common example is a weather application where the model lacks current weather data. In such cases, you can gather information from a weather API, incorporate that data into the prompt context, and then instruct the model to generate an answer using that context.

Function call

For instance, if the user input is “What is the weather like in Jakarta?” the language model identifies “Jakarta” as the location variable. It then invokes an external function, maybe a weather API call like weather.api.com?location=Jakarta, using Jakarta as the parameter. Importantly, you don’t have to explicitly control when the language model should call the weather function; it autonomously recognizes when the input is seeking weather information for a specific location.

curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "What is the weather like in Jakarta?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}'

Response:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1699896916,
"model": "gpt-3.5-turbo-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\n\"location\": \"Jakarta\"\n}"
}
}
]
},
"finish_reason": "tool_calls",
}
],
"usage": {
"prompt_tokens": 82,
"completion_tokens": 17,
"total_tokens": 99
}
}

--

--