Prompt Chain

Enhance Function Calling Accuracy with Fine-Tuning

Fine-tune AI models to significantly improve function calling accuracy and reduce errors, especially when dealing with complex tasks and multiple functions.

Works with openai

91
Spark score
out of 100
Updated 3 months ago
Version 1.0.0

Add to Favorites

Why it matters

Improve the reliability and accuracy of AI models for function calling, especially in complex scenarios with numerous functions. This asset helps reduce errors and token usage by fine-tuning models on custom datasets.

Outcomes

What it gets done

01

Evaluate baseline function calling performance.

02

Generate synthetic training data using a more capable model.

03

Fine-tune a model for improved function calling.

04

Evaluate the performance of the fine-tuned model.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-finetuningforfunctioncalling | bash

Steps

Steps in the chain

01
Improve function definitions and prompt engineering

Before fine tuning for function calling, begin with: Improvements to the function definitions. Make them more clear, and more distinct from one another. Experiment with prompt engineering: often a more detailed prompt can help the model call the correct function.

02
Assess baseline function calling performance

Evaluate an out-of-the-box gpt-3.5-turbo model on your given functions. Test with straightforward feasible prompts and impossible requests to establish baseline performance metrics.

03
Define utility functions for API calls

Define utility functions for making calls to the Chat Completions API, one to get the completion and one to get the function call.

04
Generate all possible function invocations

Generate every invocation of every function to have full coverage of all potential invocations. Account for required parameters and handle functions with fixed enums and integer values. Use placeholders for values that will be filled later.

05
Generate prompts for each function invocation

Use gpt-4o to generate realistic prompts that would result in each function invocation. This creates prompt-function invocation pairs for training data.

06
Generate rejection prompts

Use gpt-4o to generate prompts that are nearly possible but should result in the reject_request function being called. These are requests related to but not quite possible with the given functions.

07
Format training data

Format all training examples properly according to OpenAI's fine-tuning data formatting requirements for function calling. Combine all training examples together.

08
Create and run fine-tuning job

Kick off the fine-tuning job using the formatted training data. You can also list existing jobs, retrieve job status, or cancel jobs as needed.

09
Evaluate fine-tuned model performance

Test the fine-tuned model on your evaluation set, particularly on prompts that should be rejected. Compare rejection rates and token usage against the baseline model.

Overview

Fine tuning with function-calling

What it does

This notebook covers how to fine-tune to increase function calling accuracy and reliability. You can find more information on function calling [here](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_call_functions_with_chat_models.ipynb), and on fine tuning [here](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_finetune_chat_models.ipynb)

Source README

Fine tuning with function-calling

This notebook covers how to fine-tune to increase function calling accuracy and reliability. You can find more information on function calling here, and on fine tuning here

For context, from the function calling notebook above:

tools is an optional parameter in the Chat Completion API which can be used to provide function specifications. The purpose of this is to enable models to generate function arguments which adhere to the provided specifications. Note that the API will not actually execute any function calls. It is up to developers to execute function calls using model outputs.

Function calling is a very powerful tool when it functions as intended. However, we have seen that as the number of functions increases, and the complexity of the task at hand increases, function calling becomes less accurate (e.g.: more hallucinated invocations, and incorrect invocations).

Before fine tuning for function calling, it's best to begin with:

  • Improvements to the function definitions. Make them more clear, and more distinct from one another.
  • Experiment with prompt engineering: often a more detailed prompt can help the model call the correct function.

If the steps above fail to improve function calling to a satisfactory level, then you can try fine tuning for function calling.

Overview

This notebook contains three sections

  • Assessing baseline function calling performance: Evaluating an out-of-the-box gpt-3.5-turbo model on our given function (let's assume that for latency + cost reasons we cannot use gpt-4o for a drone copilot)
  • Generating synthetic data: Using gpt-4o to create 'golden' set of prompts and function invocations to use as training data
  • Fine-tuning: Running the fine tuning job, and evaluating the fine-tuned model

Note: This notebook provides an example of how to create synthetic training data for fine tuning for function calling given just a list of functions. While real-world production test evals are preferable, this method produces strong results and can be used in conjunction with real-world training data.

Getting baseline function calling performance

Utilities

Let's define utility functions for making calls to the Chat Completions API, one to get the completion and one to get the function call.

Baseline testing

Let's build an intelligent drone co-pilot. We want to be able to give the co-pilot commands, and have it either call the function
for that command, or deny that request if the command is unfeasible.
We can first define a system prompt for the copilot.

Now let's define functions for all of the actions the copilot can take.

For starters, let's see how function calling performs with some straight forward feasible prompts, and then couple of obviously impossible requests which call the 'reject_request' function.

Nice! The model performs quite well with these requests. Now let's try some more difficult requests: requests that are almost feasible and are drone-related, but that the drone cannot actually do, and the pilot should reject.

Now we run into some problems.
The model here should reject all of these requests, as they are impossible/conflicting/ambiguous given the functions, however instead the model calls functions that are somewhat related to the request, but incorrect. For example, the model sets follow_me_mode when asked to initiate following on social media.


In this simple case, more prompt engineering may resolve some of these issues, but for the purpose of this example we will demonstrate how fine tuning can be used to improve performance. Additionally, while this case is relatively straightforward, as the number of and complexity of the functions increases, fine tuning becomes more and more impactful.

Again, our goal here is to improve performance and use less tokens, so fine-tuning allows us to:

  • Omit function and parameter descriptions: remove the description field from function and parameters
  • Omit parameters: remove the entire properties field from the parameters object
  • Omit function entirely: remove the entire function object from the functions array

Generating synthetic data

Helper functions

We want to generate every invocation of every function, so that we have
full coverage of all potential invocations to create synthetic data for. Then, we will use gpt-4o to come up with prompts that would call each invocation, and we will use that prompt - function invocation pair as training data.

Generating every invocation for a function with fixed enums is more simple, but for a function such as
control_gimbal we need to set the tilt and pan integer values, so to generate those synthetic invocations we will first set a placeholder, and then later use gpt-4o to come up with reasonable values.

The functions below take in all the functions from the function list, and look
at all the potential invocations of those functions given each function's parameters.
The functions also account for required parameters, so that all the invocations
are actually feasible.

Let's generate every invocation for every function first

Prompts:

In the below snippet, we generate the invocation of each function except for the reject_request function.

To perform effective fine-tuning we need correctly labeled data. We could manually come up with examples and label the data,
or we can generate synthetic data with the help of gpt-4o

Empirically, gpt-4o needs a bit more help to get good realistic examples of prompts that would generate the reject_request function, so we'll do that next...

Now that we have all the invocations, let's use gpt-4o to generate prompts that would result in those invocations

Now let's format the training examples properly. For more documentation on the proper training data formatting for fine tuning for function calling, see here: https://platform.openai.com/docs/guides/fine-tuning/fine-tuning-examples

Now, back to the rejection function. Let's generate some prompts that are nearly possible, but should result in the reject_request function being called. To do so, we queried gpt-4o asking for requests that are related to, but not quite possible with, the given list of functions.

Now combine all the training examples together

Fine tuning

Finally, we can kick off the fine-tuning job

In addition to creating a fine-tuning job, you can also list existing jobs, retrieve the status of a job, or cancel a job.

After a fine-tuning job has finished, you can also see metrics around how the training process went by querying a fine-tuning job, extracting a file ID from the result_files, and then retrieving that files content. Each results CSV file has the following columns: step, train_loss, train_accuracy, valid_loss, and valid_mean_token_accuracy. While metrics can he helpful, evaluating samples from the fine-tuned model provides the most relevant sense of model quality.

Evaluations

Great! We trained a fine-tuned model for function calling. Let's see how it does on our evaluation set for prompts that the drone assistant
should automatically reject.

Great! While the original model only rejected 60%, the fine tuned model rejected 100% requests and used less tokens to do so.

Conclusion

Congratulations! You are now ready to fine tune your model for function calling. We can't wait to see what you build.

Step 1: Improve function definitions and prompt engineering

Before fine tuning for function calling, begin with: Improvements to the function definitions. Make them more clear, and more distinct from one another. Experiment with prompt engineering: often a more detailed prompt can help the model call the correct function.

Step 2: Assess baseline function calling performance

Evaluate an out-of-the-box gpt-3.5-turbo model on your given functions. Test with straightforward feasible prompts and impossible requests to establish baseline performance metrics.

Step 3: Define utility functions for API calls

Define utility functions for making calls to the Chat Completions API, one to get the completion and one to get the function call.

Step 4: Generate all possible function invocations

Generate every invocation of every function to have full coverage of all potential invocations. Account for required parameters and handle functions with fixed enums and integer values. Use placeholders for values that will be filled later.

Step 5: Generate prompts for each function invocation

Use gpt-4o to generate realistic prompts that would result in each function invocation. This creates prompt-function invocation pairs for training data.

Step 6: Generate rejection prompts

Use gpt-4o to generate prompts that are nearly possible but should result in the reject_request function being called. These are requests related to but not quite possible with the given functions.

Step 7: Format training data

Format all training examples properly according to OpenAI's fine-tuning data formatting requirements for function calling. Combine all training examples together.

Step 8: Create and run fine-tuning job

Kick off the fine-tuning job using the formatted training data. You can also list existing jobs, retrieve job status, or cancel jobs as needed.

Step 9: Evaluate fine-tuned model performance

Test the fine-tuned model on your evaluation set, particularly on prompts that should be rejected. Compare rejection rates and token usage against the baseline model.

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.