## Summary
- record a realtime close developer message when a new realtime session
replaces an active one
- assert the replacement marker through the mocked responses request
path
---------
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Charles Cunningham <ccunningham@openai.com>
## Summary
This changes `custom_tool_call_output` to use the same output payload
shape as `function_call_output`, so freeform tools can return either
plain text or structured content items.
The main goal is to let `js_repl` return image content from nested
`view_image` calls in its own `custom_tool_call_output`, instead of
relying on a separate injected message.
## What changed
- Changed `custom_tool_call_output.output` from `string` to
`FunctionCallOutputPayload`
- Updated freeform tool plumbing to preserve structured output bodies
- Updated `js_repl` to aggregate nested tool content items and attach
them to the outer `js_repl` result
- Removed the old `js_repl` special case that injected `view_image`
results as a separate pending user image message
- Updated normalization/history/truncation paths to handle multimodal
`custom_tool_call_output`
- Regenerated app-server protocol schema artifacts
## Behavior
Direct `view_image` calls still return a `function_call_output` with
image content.
When `view_image` is called inside `js_repl`, the outer `js_repl`
`custom_tool_call_output` now carries:
- an `input_text` item if the JS produced text output
- one or more `input_image` items from nested tool results
So the nested image result now stays inside the `js_repl` tool output
instead of being injected as a separate message.
## Compatibility
This is intended to be backward-compatible for resumed conversations.
Older histories that stored `custom_tool_call_output.output` as a plain
string still deserialize correctly, and older histories that used the
previous injected-image-message flow also continue to resume.
Added regression coverage for resuming a pre-change rollout containing:
- string-valued `custom_tool_call_output`
- legacy injected image message history
#### [git stack](https://github.com/magus/git-stack-cli)
- 👉 `1` https://github.com/openai/codex/pull/12948
## Summary
- bundle contextual prompt injection into at most one developer message
plus one contextual user message in both:
- per-turn settings updates
- initial context insertion
- preserve `<model_switch>` across compaction by rebuilding it through
canonical initial-context injection, instead of relying on
strip/reattach hacks
- centralize contextual user fragment detection in one shared definition
table and reuse it for parsing/compaction logic
- keep `AGENTS.md` in its natural serialized format:
- `# AGENTS.md instructions for {dirname}`
- `<INSTRUCTIONS>...</INSTRUCTIONS>`
- simplify related tests/helpers and accept the expected snapshot/layout
updates from bundled multi-part messages
## Why
The goal is to converge toward a simpler, more intentional prompt shape
where contextual updates are consistently represented as one developer
envelope plus one contextual user envelope, while keeping parsing and
compaction behavior aligned with that representation.
## Notable details
- the temporary `SettingsUpdateEnvelope` wrapper was removed; these
paths now return `Vec<ResponseItem>` directly
- local/remote compaction no longer rely on model-switch strip/restore
helpers
- contextual user detection is now driven by shared fragment definitions
instead of ad hoc matcher assembly
- AGENTS/user instructions are still the same logical context; only the
synthetic `<user_instructions>` wrapper was replaced by the natural
AGENTS text format
## Testing
- `just fmt`
- `cargo test -p codex-app-server
codex_message_processor::tests::extract_conversation_summary_prefers_plain_user_messages
-- --exact`
- `cargo test -p codex-core
compact::tests::collect_user_messages_filters_session_prefix_entries
--lib -- --exact`
- `cargo test -p codex-core --test all
'suite::compact::snapshot_request_shape_pre_turn_compaction_strips_incoming_model_switch'
-- --exact`
- `cargo test -p codex-core --test all
'suite::compact_remote::snapshot_request_shape_remote_pre_turn_compaction_strips_incoming_model_switch'
-- --exact`
- `cargo test -p codex-core --test all
'suite::client::includes_apps_guidance_as_developer_message_when_enabled'
-- --exact`
- `cargo test -p codex-core --test all
'suite::client::includes_developer_instructions_message_in_request' --
--exact`
- `cargo test -p codex-core --test all
'suite::client::includes_user_instructions_message_in_request' --
--exact`
- `cargo test -p codex-core --test all
'suite::client::resume_includes_initial_messages_and_sends_prior_items'
-- --exact`
- `cargo test -p codex-core --test all
'suite::review::review_input_isolated_from_parent_history' -- --exact`
- `cargo test -p codex-exec --test all
'suite::resume::exec_resume_last_respects_cwd_filter_and_all_flag' --
--exact`
- `cargo test -p core_test_support
context_snapshot::tests::full_text_mode_preserves_unredacted_text --
--exact`
## Notes
- I also ran several targeted `compact`, `compact_remote`,
`prompt_caching`, `model_visible_layout`, and `event_mapping` tests
while iterating on prompt-shape changes.
- I have not claimed a clean full-workspace `cargo test` from this
environment because local sandbox/resource conditions have previously
produced unrelated failures in large workspace runs.
Increase `IMAGE_BYTES_ESTIMATE` from 340 bytes to 7,373 bytes so the
existing 4-bytes/token heuristic yields an image estimate of ~1,844
tokens instead of ~85. This makes auto-compaction more conservative for
image-heavy transcripts and avoids underestimating context usage, which
can otherwise cause compaction to fail when there is not enough free
context remaining. The new value was chosen because that's the image
resolution cap used for our latest models.
Follow-up to [#12419](https://github.com/openai/codex/pull/12419).
Refs [#11845](https://github.com/openai/codex/issues/11845).
## Summary
Introduces the initial implementation of Feature::RequestPermissions.
RequestPermissions allows the model to request that a command be run
inside the sandbox, with additional permissions, like writing to a
specific folder. Eventually this will include other rules as well, and
the ability to persist these permissions, but this PR is already quite
large - let's get the core flow working and go from there!
<img width="1279" height="541" alt="Screenshot 2026-02-15 at 2 26 22 PM"
src="https://github.com/user-attachments/assets/0ee3ec0f-02ec-4509-91a2-809ac80be368"
/>
## Testing
- [x] Added tests
- [x] Tested locally
- [x] Feature
Fixes#11845.
Adjust context/token estimation for inline image `data:*;base64,...`
URLs so we
do not count the raw base64 payload as model-visible text.
What changed:
- keep the existing JSON-length estimator as the baseline
- detect only inline base64 `data:` image URLs in message and
function-call
output content items
- subtract only the base64 payload bytes (preserving data URL prefix +
JSON
overhead)
- add a fixed per-image estimate of 340 bytes (~85 tokens at the repo’s
4-bytes/token heuristic)
This avoids large overestimates from MCP image tool outputs while
leaving normal
image URLs (`https://`, `file://`, non-base64 `data:` URLs) unchanged.
Tests:
- message image data URL estimate regression
- function-call output image data URL estimate regression
- non-base64 image URLs unchanged
- non-base64 `data:` URLs unchanged
- `data:application/octet-stream;base64,...` adjusted
- multiple inline images apply multiple fixed costs
- text-only items unchanged
## Summary
- move regular-turn context diff/full-context persistence into
`run_turn` so pre-turn compaction runs before incoming context updates
are recorded
- after successful pre-turn compaction, rely on a cleared
`reference_context_item` to trigger full context reinjection on the
follow-up regular turn (manual `/compact` keeps replacement history
summary-only and also clears the baseline)
- preserve `<model_switch>` when full context is reinjected, and inject
it *before* the rest of the full-context items
- scope `reference_context_item` and `previous_model` to regular user
turns only so standalone tasks (`/compact`, shell, review, undo) cannot
suppress future reinjection or `<model_switch>` behavior
- make context-diff persistence + `reference_context_item` updates
explicit in the regular-turn path, with clearer docs/comments around the
invariant
- stop persisting local `/compact` `RolloutItem::TurnContext` snapshots
(only regular turns persist `TurnContextItem` now)
- simplify resume/fork previous-model/reference-baseline hydration by
looking up the last surviving turn context from rollout lifecycle
events, including rollback and compaction-crossing handling
- remove the legacy fallback that guessed from bare `TurnContext`
rollouts without lifecycle events
- update compaction/remote-compaction/model-visible snapshots and
compact test assertions (including remote compaction mock response
shape)
## Why
We were persisting incoming context items before spawning the regular
turn task, which let pre-turn compaction requests accidentally include
incoming context diffs without the new user message. Fixing that exposed
follow-on baseline issues around `/compact`, resume/fork, and standalone
tasks that could cause duplicate context injection or suppress
`<model_switch>` instructions.
This PR re-centers the invariants around regular turns:
- regular turns persist model-visible context diffs/full reinjection and
update the `reference_context_item`
- standalone tasks do not advance those regular-turn baselines
- compaction clears the baseline when replacement history may have
stripped the referenced context diffs
## Follow-ups (TODOs left in code)
- `TODO(ccunningham)`: fix rollback/backtracking baseline handling more
comprehensively
- `TODO(ccunningham)`: include pending incoming context items in
pre-turn compaction threshold estimation
- `TODO(ccunningham)`: inject updated personality spec alongside
`<model_switch>` so some model-switch paths can avoid forced full
reinjection
- `TODO(ccunningham)`: review task turn lifecycle
(`TurnStarted`/`TurnComplete`) behavior and emit task-start context
diffs for task types that should have them (excluding `/compact`)
## Validation
- `just fmt`
- CI should cover the updated compaction/resume/model-visible snapshot
expectations and rollout-hydration behavior
- I did **not** rerun the full local test suite after the latest
resume-lookup / rollout-persistence simplifications
## Summary
- add `previous_context_item: Option<TurnContextItem>` to
`ContextManager`
- expose session/state accessors for reading and updating the stored
previous context item
- switch settings diffing to use `TurnContextItem` instead of
`TurnContext`
- remove submission-loop local `previous_context` and persist the
previous context item in history
## Testing
- `just fmt`
- `just fix -p codex-core`
- `cargo test -p codex-core --test all model_switching::`
- `cargo test -p codex-core --test all collaboration_instructions::`
- `cargo test -p codex-core --test all personality::`
- `cargo test -p codex-core --test all
permissions_messages::permissions_message_not_added_when_no_change`
## Summary
This PR centralizes model-visible state diffing for turn context updates
into one module, while keeping existing behavior and call sites stable.
### What changed
- Added `core/src/context_updates.rs` with the consolidated diffing
logic for:
- environment context updates
- permissions/policy updates
- collaboration mode updates
- model-instruction switch updates
- personality updates
- Added `BuildSettingsUpdateItemsParams` so required dependencies are
passed explicitly.
- Updated `Session::build_settings_update_items` in `core/src/codex.rs`
to delegate to the centralized module.
- Reused the same centralized `personality_message_for` helper from
initial-context assembly to avoid duplicated logic.
- Registered the new module in `core/src/lib.rs`.
## Why
This is a minimal, shippable step toward the model-visible-state design:
all state diff decisions for turn-context update items now live in one
place, improving reviewability and reducing drift risk without expanding
scope.
## Behavior
- Intended to be behavior-preserving.
- No protocol/schema changes.
- No call-site behavior changes beyond routing through the new
centralized logic.
## Testing
Ran targeted tests in this worktree:
- `cargo test -p codex-core
build_settings_update_items_emits_environment_item_for_network_changes`
- `cargo test -p codex-core collaboration_instructions --test all`
Both passed.
## Codex author
`codex resume 019c540f-3951-7352-a3fa-6f07b834d4ce`
- Make `ContextManager::for_prompt` modality-aware and strip input_image
content when the active model is text-only.
- Added a test for multi-model -> text-only model switch
## Summary
- add targeted remote-compaction failure diagnostics in compact_remote
logging
- log the specific values needed to explain overflow timing:
- last_api_response_total_tokens
- estimated_tokens_of_items_added_since_last_successful_api_response
- estimated_bytes_of_items_added_since_last_successful_api_response
- failing_compaction_request_body_bytes
- simplify breakdown naming and remove
last_api_response_total_bytes_estimate (it was an approximation and not
useful for debugging)
## Why
When compaction fails with context_length_exceeded, we need concrete,
low-ambiguity numbers that map directly to:
1) what the API most recently reported, and
2) what local history added since then.
This keeps the failure logs actionable without adding broad, noisy
metrics.
## Testing
- just fmt
- cargo test -p codex-core
## Summary
- change compaction pre-check accounting to include all items added
after the last model-generated item, not only trailing codex-generated
outputs
- use that boundary consistently in get_total_token_usage() and
get_total_token_usage_breakdown()
- update history tests to cover user/tool-output items after the last
model item
## Why
last_token_usage.total_tokens is API-reported for the last successful
model response. After that point, local history may gain additional
items (user messages, injected context, tool outputs). Compaction
triggering must account for all of those items to avoid late compaction
attempts that can overflow context.
## Testing
- just fmt
- cargo test -p codex-core
## Summary
This PR fixes a deterministic mismatch in remote compaction where
pre-trim estimation and the `/v1/responses/compact` payload could use
different base instructions.
Before this change:
- pre-trim estimation used model-derived instructions
(`model_info.get_model_instructions(...)`)
- compact payload used session base instructions
(`sess.get_base_instructions()`)
After this change:
- remote pre-trim estimation and compact payload both use the same
`BaseInstructions` instance from session state.
## Changes
- Added a shared estimator entry point in `ContextManager`:
- `estimate_token_count_with_base_instructions(&self, base_instructions:
&BaseInstructions) -> Option<i64>`
- Kept `estimate_token_count(&TurnContext)` as a thin wrapper that
resolves model/personality instructions and delegates to the new helper.
- Updated remote compaction flow to fetch base instructions once and
reuse it for both:
- trim preflight estimation
- compact request payload construction
- Added regression coverage for parity and behavior:
- unit test verifying explicit-base estimator behavior
- integration test proving remote compaction uses session override
instructions and trims accordingly
## Why this matters
This removes a deterministic divergence source where pre-trim could
think the request fits while the actual compact request exceeded context
because its instructions were longer/different.
## Scope
In scope:
- estimator/payload base-instructions parity in remote compaction
Out of scope:
- retry-on-`context_length_exceeded`
- compaction threshold/headroom policy changes
- broader trimming policy changes
## Codex author:
`codex fork 019c2b24-c2df-7b31-a482-fb8cf7a28559`
Took over the work that @aaronl-openai started here:
https://github.com/openai/codex/pull/10397
Now that app-server clients are able to set up custom tools (called
`dynamic_tools` in app-server), we should expose a way for clients to
pass in not just text, but also image outputs. This is something the
Responses API already supports for function call outputs, where you can
pass in either a string or an array of content outputs (text, image,
file):
https://platform.openai.com/docs/api-reference/responses/create#responses_create-input-input_item_list-item-function_tool_call_output-output-array-input_image
So let's just plumb it through in Codex (with the caveat that we only
support text and image for now). This is implemented end-to-end across
app-server v2 protocol types and core tool handling.
## Breaking API change
NOTE: This introduces a breaking change with dynamic tools, but I think
it's ok since this concept was only recently introduced
(https://github.com/openai/codex/pull/9539) and it's better to get the
API contract correct. I don't think there are any real consumers of this
yet (not even the Codex App).
Old shape:
`{ "output": "dynamic-ok", "success": true }`
New shape:
```
{
"contentItems": [
{ "type": "inputText", "text": "dynamic-ok" },
{ "type": "inputImage", "imageUrl": "data:image/png;base64,AAA" }
]
"success": true
}
```
Two fixes:
1. Include trailing tool output in the total context size calculation.
Otherwise when checking whether compaction should run we ignore newly
added outputs.
2. Trim trailing tool output/tool calls until we can fit the request
into the model context size. Otherwise the compaction endpoint will fail
to compact. We only trim items that can be reproduced again by the model
(tool calls, tool call outputs).
### What
add wiring for `phase` field on `ResponseItem::Message` to lay
groundwork for differentiating model preambles and final messages.
currently optional.
follows pattern in #9698.
updated schemas with `just write-app-server-schema` so we can see type
changes.
### Tests
Updated existing tests for SSE parsing and hydrating from history
## Summary
Support updating Personality mid-Thread via UserTurn/OverwriteTurn. This
is explicitly unused by the clients so far, to simplify PRs - app-server
and tui implementations will be follow-ups.
## Testing
- [x] added integration tests
## Summary
#9555 is the start of a rename, so I'm starting to standardize here.
Sets up `model_instructions` templating with a strongly-typed object for
injecting a personality block into the model instructions.
## Testing
- [x] Added tests
- [x] Ran locally
## What
Record a model-visible `<turn_aborted>` marker in history when a turn is
interrupted, and treat it as a session prefix.
## Why
When a turn is interrupted, Codex emits `TurnAborted` but previously did
not persist anything model-visible in the conversation history. On the
next user turn, the model can’t tell the previous work was aborted and
may resume/repeat earlier actions (including duplicated side effects
like re-opening PRs).
Fixes: https://github.com/openai/codex/issues/9042
## How
On `TurnAbortReason::Interrupted`, append a hidden user message
containing a `<turn_aborted>…</turn_aborted>` marker and flush.
Treat `<turn_aborted>` like `<environment_context>` for session-prefix
filtering.
Add a regression test to ensure follow-up turns don’t repeat side
effects from an aborted turn.
## Testing
`just fmt`
`just fix -p codex-core`
`cargo test -p codex-core -- --test-threads=1`
`cargo test --all-features -- --test-threads=1`
---------
Co-authored-by: Skylar Graika <sgraika127@gmail.com>
Co-authored-by: jif-oai <jif@openai.com>
Co-authored-by: Eric Traut <etraut@openai.com>
## Summary
We have a variety of things we refer to as instructions in the code
base: our current canonical terms are:
- base instructions (raw string)
- developer instructions (has a type in protocol)
- user instructions
We also have `instructions` floating around in various places. We should
standardize on the above, and start using types to prevent them from
ending up in the wrong place. There will be additional PRs, but I'm
going to keep these small so we can easily follow them!
## Testing
- [x] Tests pass, this is purely a file move
## Before
When we detect an `InvalidImageRequest`, we replace the image by a
placeholder and keep going
## Now
In such `InvalidImageRequest`, we check if the image is due to a user
message or a tool call output. For tool call output we still replace it
with a placeholder to avoid breaking the agentic loop bu tif this is
because of a user message, we send an error to the user
> // todo(aibrahim): why are we passing model here while it can change?
we update it on each turn with `.with_model`
> //TODO(aibrahim): run CI in release mode.
although it's good to have, release builds take double the time tests
take.
> // todo(aibrahim): make this async function
we figured out another way of doing this sync
- Merge ModelFamily into ModelInfo
- Remove logic for adding instructions to apply patch
- Add compaction limit and visible context window to `ModelInfo`
Add `thread/rollback` to app-server to support IDEs undo-ing the last N
turns of a thread.
For context, an IDE partner will be supporting an "undo" capability
where the IDE (the app-server client) will be responsible for reverting
the local changes made during the last turn. To support this well, we
also need a way to drop the last turn (or more generally, the last N
turns) from the agent's context. This is what `thread/rollback` does.
**Core idea**: A Thread rollback is represented as a persisted event
message (EventMsg::ThreadRollback) in the rollout JSONL file, not by
rewriting history. On resume, both the model's context (core replay) and
the UI turn list (app-server v2's thread history builder) apply these
markers so the pruned history is consistent across live conversations
and `thread/resume`.
Implementation notes:
- Rollback only affects agent context and appends to the rollout file;
clients are responsible for reverting files on disk.
- If a thread rollback is currently in progress, subsequent
`thread/rollback` calls are rejected.
- Because we use `CodexConversation::submit` and codex core tracks
active turns, returning an error on concurrent rollbacks is communicated
via an `EventMsg::Error` with a new variant
`CodexErrorInfo::ThreadRollbackFailed`. app-server watches for that and
sends the BAD_REQUEST RPC response.
Tests cover thread rollbacks in both core and app-server, including when
`num_turns` > existing turns (which clears all turns).
**Note**: this explicitly does **not** behave like `/undo` which we just
removed from the CLI, which does the opposite of what `thread/rollback`
does. `/undo` reverts local changes via ghost commits/snapshots and does
not modify the agent's context / conversation history.
Normally, all tool calls within a saved session should have a response,
but there are legitimate reasons for the response to be missing. This
can occur if the user canceled the call or there was an error of some
sort during the rollout. We shouldn't panic in this case.
This is a partial fix for #7990
If an image can't be read by the API, it will poison the entire history,
preventing any new turn on the conversation.
This detect such cases and replace the image by a placeholder
- The total token used returned from the api doesn't account for the
reasoning items before the assistant message
- Account for those for auto compaction
- Add the encrypted reasoning effort in the common tests utils
- Add a test to make sure it works as expected
- This PR is to make it on path for truncating by tokens. This path will
be initially used by unified exec and context manager (responsible for
MCP calls mainly).
- We are exposing new config `calls_output_max_tokens`
- Use `tokens` as the main budget unit but truncate based on the model
family by Introducing `TruncationPolicy`.
- Introduce `truncate_text` as a router for truncation based on the
mode.
In next PRs:
- remove truncate_with_line_bytes_budget
- Add the ability to the model to override the token budget.