Research & summarize

Build Reliable Agents with Memory and Compaction

Build reliable AI agents for long-running tasks using memory and compaction to manage context and retain workflow insights across sessions.

Without it

Piece it together by hand, every time.

With it

Enhance AI agents for long-running tasks by implementing memory and compaction to manage evolving context and retain workflow lessons across sessions.

What you get

  • Implement compaction to handle growing conversation context in long-running agents.
  • Integrate memory to store and reuse workflow lessons from previous agent runs.
  • Develop a sandbox agent for evidence review in compliance investigations.
  • Combine agent tools, compaction, and memory for robust, iterative analysis.

Use this prompt chain

OpenAI Cookbook SummarizeExtractRAG indexAudit access

Building Reliable Agents with Memory and Compaction

This Cookbook shows how to build an evidence review agent for a synthetic compliance investigation using the OpenAI Agents SDK.

You will start with a simple sandbox agent, then add two reliability primitives:

  • Compaction lets you support long-running conversations despite finite context windows by carrying forward the state needed for later turns while reducing context size.
  • Memory lets future sandbox-agent runs reuse workflow lessons from prior runs without replaying every previous turn.

The reliability pattern is straightforward: compaction helps the current run continue, memory helps later runs start with useful workflow guidance, and the generated memo remains the human-reviewed source of truth for the investigation.

References:

Use Case: Evidence Review Agent for a Compliance Investigation

A compliance team is investigating whether a vendor exception followed internal policy. The evidence arrives as a small set of files: policy language, exception notes, audit observations, approval records, and remediation plans.

The agent's job is not to become the investigation record. Its job is to help a reviewer move through the evidence, keep track of what changed, and write a concise memo that separates supported findings from open questions.

This makes the example useful for memory and compaction because the investigation has three traits that show up in real work:

  • The record changes over time. Later documents may narrow or supersede an earlier assumption.
  • The conversation can become long-running. A reviewer may ask follow-up questions, request revisions, and return to the same work later.
  • The final artifact needs provenance. The memo should cite evidence and preserve uncertainty instead of flattening the review into a confident but unsupported conclusion.

Where This Pattern Applies

Although this notebook uses a compliance review, the same pattern applies anywhere knowledge workers review evolving context and produce a human-auditable artifact.

Good fits include:

  • Customer support teams applying new policy updates to open escalations.
  • Security teams reviewing incident evidence and writing incident summaries.
  • Finance teams reconciling exceptions across policies, approvals, and audit notes.
  • Product teams updating competitive positioning after new launches or model releases.
  • Legal or procurement teams reviewing contracts, emails, and approval histories.
  • M&A teams absorbing new business rules, operating procedures, and diligence notes.

In each case, compaction keeps the active review viable as context grows, memory carries reusable workflow lessons forward, and the final artifact remains the reviewed output.

What You'll Build

The use case is a compliance evidence review. A team receives policy documents, exception notes, audit findings, approvals, and remediation plans over time. The agent helps review the evidence, preserve uncertainty, and produce a concise memo with citations.

You will build:

  1. A synthetic evidence workspace with a folder structure, manifest, and output directory.
  2. A simple SandboxAgent that can inspect files and write a memo.
  3. A compaction checkpoint for long-running work.
  4. SDK memory generation for reusable workflow lessons.
  5. A combined run that uses sandbox tools, compaction, memory, and generated artifacts together.

You can inspect the notebook without making model calls because RUN_AGENT defaults to False. Set RUN_AGENT = True only when you want to execute the live sandbox workflow.

Table of Contents

Prerequisites

To run the live agent workflow, you need:

  • Python 3.10 or later.
  • The openai-agents package.
  • An OpenAI API key available as OPENAI_API_KEY.
  • A local Unix-like environment for UnixLocalSandboxClient. The notebook uses synthetic files created from the sandbox Manifest, so no external dataset is required.

The notebook is safe to inspect without credentials because RUN_AGENT defaults to False. Set RUN_AGENT = True only when you want to execute the model-backed sandbox run.

Setup

The notebook writes all synthetic files under examples/agents_sdk/.tmp/evidence_review_memory_compaction/. It does not require external data.

By default, tracing is disabled because some organizations use Zero Data Retention (ZDR), where trace ingestion may be blocked. For synthetic data or non-ZDR environments, you can set ENABLE_TRACING = True to inspect traces while developing.

Using Agents SDK in This Notebook

A sandbox agent is an Agents SDK agent that runs with a controlled workspace. In this notebook, that workspace contains synthetic evidence files, a manifest.csv, an output folder, and SDK-generated memory files.

The sandbox gives the agent a bounded place to inspect files and write artifacts. Instead of pasting every document into the prompt, the application creates a workspace and lets the agent use capabilities such as:

  • Filesystem() to read and write workspace files.
  • Shell() to list files, inspect documents, and search across batches.
  • Compaction() to support long-running reviews when the active context grows.
  • Memory() to store reusable workflow lessons for future sandbox-agent runs.

The memo remains the human-reviewed artifact. Tools help the agent work, compaction helps it continue, memory helps future runs improve, and generated artifacts hold the reviewable output.

Memory vs. Compaction

A useful way to separate the concepts is to ask what each one is allowed to carry forward.

Question Compaction Memory
What does it help with? Continuing one long-running run when context grows. Improving future runs with reusable workflow lessons.
What does it summarize? The active conversation and working state. Patterns, preferences, and process lessons worth reusing.
Should it store investigation conclusions? No. It can preserve working state, but the memo is the reviewed artifact. No. Store workflow lessons, not case-specific facts.
When is it useful? Mid-review, especially before later batches or follow-up turns. Across repeated reviews of similar evidence workflows.

For this notebook, the compliance memo is the source of truth for the investigation output. Memory is intentionally scoped to reviewer preferences and workflow habits, such as using the manifest first, preserving uncertainty, and keeping superseded assumptions visible.

Folder Structure and Manifest

The agent works from a small file workspace. The folder structure is simple on purpose: evidence files are grouped by batch, generated outputs go under outputs/, and the manifest gives the agent a compact map of the available documents.

The Manifest Feature

In the Agents SDK, a Manifest is the fresh-session workspace contract for a sandbox agent. It describes the files, directories, mounts, environment, users, groups, and related workspace configuration that should exist when a new sandbox session starts.

The local SDK implementation defines these core fields:

Manifest field What it controls How to use it in this Cookbook
root Workspace root path. Defaults to /workspace. Keep the default unless a sandbox provider expects a different root.
entries Files, directories, local files, local directories, repos, or mounts to materialize. Put README.md, manifest.csv, input documents, and outputs/ here.
environment Environment variables available when the sandbox starts. Use only for non-secret runtime configuration. Keep credentials out of prompts and committed notebooks.
users / groups Sandbox-local OS accounts and groups for providers that support them. Usually unnecessary for a Cookbook, useful for production isolation.
extra_path_grants Additional path grants, especially useful for Unix-local workflows. Use sparingly when a sandbox needs scoped read/write access to host paths.
remote_mount_command_allowlist Commands allowed against remote mounts. Keep narrow when mounting external storage or data rooms.

Manifest entry paths should be workspace-relative. Avoid absolute paths and .. escapes so the same agent can move between Unix-local, Docker, and hosted sandbox providers.

Folder and Manifest Best Practices

  • Put source documents, manifests, helper files, and output directories in the Manifest instead of pasting large content into the prompt.
  • Put longer task instructions in workspace files such as README.md, task.md, or AGENTS.md; keep agent instructions focused on behavior and boundaries.
  • Use stable document IDs and a machine-readable manifest file so generated memos can cite sources and reviewers can inspect the path back to evidence.
  • Let Memory() manage its own memory artifacts. By default, sandbox memory uses memories/ and sessions/ under the workspace.
  • Keep generated artifacts under outputs/ so the application can inspect, copy, validate, or archive them after the run.
  • Keep mount scopes narrow. If you mount a data room, mount only what the agent should read or write.
  • Treat secrets as runtime configuration injected by your application or sandbox provider, not as prompt text or committed manifest content.
  • Prefer a small synthetic File(...) or Dir(...) entry for a tutorial, then switch to LocalDir, GitRepo, or storage mounts for production-sized datasets.

Prepare a Small Evidence Workspace

A Manifest describes the starting files in a fresh sandbox workspace. For this tutorial, the workspace includes:

  • a manifest.csv listing documents by batch and document ID,
  • three small document batches,
  • an output directory for the review memo.

The only memory primitive we attach later is the SDK's Memory() capability. Investigation findings stay in the generated reviewer memo, where they can be cited and inspected.

Step 1: Start With a Simple Agent Configuration

First, build the agent without memory or compaction. The goal is to make the baseline behavior clear before adding primitives.

One subtle point: SandboxAgent defaults can include built-in capabilities. To keep this baseline explicit, pass the exact capability list you want. Here we include only the workspace tools the agent needs to inspect files and write an artifact: Filesystem() and Shell(). We intentionally do not attach Compaction() or Memory() yet.

  • Filesystem() gives the sandbox agent file-oriented workspace access so it can read staged evidence and write the memo artifact. In the Sandbox Agents guide, capabilities are described as the way to attach sandbox-native behavior and tools to a SandboxAgent
  • Shell() lets the agent inspect the workspace with terminal commands such as listing files, opening evidence documents, and searching for terms across batches. The Sandbox Agents guide notes that Shell() is one of the default capabilities, and the Shell tool guide explains that shell gives models a terminal environment for hosted or local execution.
  • For this baseline, these two capabilities are enough: Filesystem() handles workspace reads and writes, while Shell() handles deterministic inspection and search. Memory and compaction are added only after the baseline harness is clear.

Step 2: Add Compaction

Compaction is for long-running work. As a conversation grows, compaction reduces context size while preserving the state needed for later turns. There are three useful ways to think about it:

  1. Automatic compaction with Compaction(): attach the capability and let the SDK compact when context pressure requires it.
  2. Threshold-based compaction with StaticCompactionPolicy: set an explicit threshold for environments where you want more predictable context-size behavior.
  3. Forced checkpoint compaction with OpenAIResponsesCompactionSession.run_compaction({"force": True}): compact at an application-defined phase boundary, such as after a major review phase and before the next evidence batch.

This notebook uses a forced checkpoint because the synthetic dataset is intentionally small. In production, automatic compaction is often the simplest starting point, and threshold-based compaction is useful when you want a tighter operational policy.

Best practices

  • Compact at meaningful workflow boundaries, not after every turn.
  • Preserve enough working state for the next phase to make sense.
  • Keep cited facts in generated artifacts, not only in compacted conversation state.

How Compaction Gets Triggered

With the Compaction() capability, server-side compaction is eligible to run when the active context grows large enough. That is the current default behavior: attach the capability and let the SDK manage context pressure.

For small tutorials, automatic compaction can be hard to see because the run may never get close to the model context limit. A lower StaticCompactionPolicy can help, but it still depends on the rendered context crossing the threshold.

For a small evidence set, a forced checkpoint is the clearest operational pattern. The OpenAIResponsesCompactionSession wrapper stores session history and lets the application call run_compaction({"force": True}) at a phase boundary. That makes compaction visible without inflating the evidence set.

Step 3: Attach Memory

Memory is for reuse across runs. In this example, memory should capture workflow lessons, not investigation facts.

Good memory candidates include:

  • Use the manifest first when reviewing a file-based evidence workspace.
  • Preserve uncertainty in the memo instead of guessing.
  • Keep earlier assumptions visible when later evidence narrows them.

Bad memory candidates include:

  • "Northwind Logistics violated policy."
  • "ACME's Finance Ops process is deficient."
  • Any case-specific conclusion that belongs in the memo.

Best practices

  • Use memory for stable process lessons and user preferences.
  • Keep case-specific facts in reviewed artifacts such as the memo.
  • Inspect generated memory before relying on it in future runs.

Step 4: Run With Both Compaction and Memory

Now combine the pieces:

  • Filesystem() and Shell() let the agent navigate the evidence workspace.
  • Compaction() keeps the active review viable as context grows.
  • Memory() captures reusable workflow lessons after the run.
  • The final memo remains the investigation artifact.

The task below asks the agent to review the synthetic evidence, write a memo, then read the memo back to verify it preserved the required structure and uncertainty.

Inspect Generated Artifacts

The final agent response is useful, but the reliability pattern becomes clearer when you inspect the files the sandbox run produced. This section makes the normally hidden state visible:

  • the reviewer-facing memo in outputs/compliance_review_memo.md,
  • generated SDK memory files such as memories/MEMORY.md and memories/memory_summary.md,
  • the workspace files produced by the run, including the session log.

The generated memory artifact is not the compliance memo and should not be treated as investigation truth. It is reusable workflow memory. The Task Group heading is the memory system's own grouping label, and the memory generator is steered with MemoryGenerateConfig.extra_prompt so it stores workflow lessons rather than ACME-specific findings.

If RUN_AGENT = False, this section displays the expected output shape instead of live sandbox artifacts.

Common Pitfall

Do not treat Memory() as an unreviewed fact database.

Memory should help the next run remember how to work. It should not become a shadow compliance record. If a conclusion matters, write it into a reviewed artifact with citations.

Conclusion

You now have the building blocks for a reliable long-running agent workflow:

  • A sandbox workspace for controlled file access.
  • A manifest that helps the agent route across documents.
  • Compaction for finite context windows.
  • Memory for reusable workflow lessons.
  • A generated memo as the reviewed investigation artifact.

The main design choice is separation of responsibility: context helps the agent work, memory helps future agents work better, and reviewed artifacts hold the facts that people will rely on.

Comments (0)

Sign In Sign in to leave a comment.