mirror of
https://github.com/openai/codex.git
synced 2026-05-02 04:11:39 +03:00
core: box wrapper futures to reduce stack pressure (#13429)
Follow-up to [#13388](https://github.com/openai/codex/pull/13388). This uses the same general fix pattern as [#12421](https://github.com/openai/codex/pull/12421), but in the `codex-core` compact/resume/fork path. ## Why `compact_resume_after_second_compaction_preserves_history` started overflowing the stack on Windows CI after `#13388`. The important part is that this was not a compaction-recursion bug. The test exercises a path with several thin `async fn` wrappers around much larger thread-spawn, resume, and fork futures. When one `async fn` awaits another inline, the outer future stores the callee future as part of its own state machine. In a long wrapper chain, that means a caller can accidentally inline a lot more state than the source code suggests. That is exactly what was happening here: - `ThreadManager` convenience methods such as `start_thread`, `resume_thread_from_rollout`, and `fork_thread` were inlining the larger spawn/resume futures beneath them. - `core_test_support::test_codex` added another wrapper layer on top of those same paths. - `compact_resume_fork` adds a few more helpers, and this particular test drives the resume/fork path multiple times. On Windows, that was enough to push both the libtest thread and Tokio worker threads over the edge. The previous 8 MiB test-thread workaround proved the failure was stack-related, but it did not address the underlying future size. ## How This Was Debugged The useful debugging pattern here was to turn the CI-only failure into a local low-stack repro. 1. First, remove the explicit large-stack harness so the test runs on the normal `#[tokio::test]` path. 2. Build the test binary normally. 3. Re-run the already-built `tests/all` binary directly with progressively smaller `RUST_MIN_STACK` values. Running the built binary directly matters: it keeps the reduced stack size focused on the test process instead of also applying it to `cargo` and `rustc`. That made it possible to answer two questions quickly: - Does the failure still reproduce without the workaround? Yes. - Does boxing the wrapper futures actually buy back stack headroom? Also yes. After this change, the built test binary passes with `RUST_MIN_STACK=917504` and still overflows at `786432`, which is enough evidence to justify removing the explicit 8 MiB override while keeping a deterministic low-stack repro for future debugging. If we hit a similar issue again, the first places to inspect are thin `async fn` wrappers that mostly forward into a much larger async implementation. ## `Box::pin()` Primer `async fn` compiles into a state machine. If a wrapper does this: ```rust async fn wrapper() { inner().await; } ``` then `wrapper()` stores the full `inner()` future inline as part of its own state. If the wrapper instead does this: ```rust async fn wrapper() { Box::pin(inner()).await; } ``` then the child future lives on the heap, and the outer future only stores a pinned pointer to it. That usually trades one allocation for a substantially smaller outer future, which is exactly the tradeoff we want when the problem is stack pressure rather than raw CPU time. Useful references: - [`Box::pin`](https://doc.rust-lang.org/std/boxed/struct.Box.html#method.pin) - [Async book: Pinning](https://rust-lang.github.io/async-book/04_pinning/01_chapter.html) ## What Changed - Boxed the wrapper futures in `core/src/thread_manager.rs` around `start_thread`, `resume_thread_from_rollout`, `fork_thread`, and the corresponding `ThreadManagerState` spawn helpers so callers no longer inline the full spawn/resume state machine through multiple layers. - Boxed the matching test-only wrapper futures in `core/tests/common/test_codex.rs` and `core/tests/suite/compact_resume_fork.rs`, which sit directly on top of the same path. - Restored `compact_resume_after_second_compaction_preserves_history` in `core/tests/suite/compact_resume_fork.rs` to a normal `#[tokio::test]` and removed the explicit `TEST_STACK_SIZE_BYTES` thread/runtime sizing. - Simplified a tiny helper in `compact_resume_fork` by making `fetch_conversation_path()` synchronous, which removes one more unnecessary future layer from the test path. ## Verification - `cargo test -p codex-core --test all suite::compact_resume_fork::compact_resume_after_second_compaction_preserves_history -- --exact --nocapture` - `cargo test -p codex-core --test all suite::compact_resume_fork -- --nocapture` - Re-ran the built `codex-core` `tests/all` binary directly with reduced stack sizes: - `RUST_MIN_STACK=917504` passes - `RUST_MIN_STACK=786432` still overflows - `cargo test -p codex-core` - Still fails locally in unrelated existing integration areas that expect the `codex` / `test_stdio_server` binaries or hit the existing `search_tool` wiremock mismatches.
This commit is contained in:
@@ -312,8 +312,9 @@ impl ThreadManager {
|
||||
}
|
||||
|
||||
pub async fn start_thread(&self, config: Config) -> CodexResult<NewThread> {
|
||||
self.start_thread_with_tools(config, Vec::new(), false)
|
||||
.await
|
||||
// Box delegated thread-spawn futures so these convenience wrappers do
|
||||
// not inline the full spawn path into every caller's async state.
|
||||
Box::pin(self.start_thread_with_tools(config, Vec::new(), false)).await
|
||||
}
|
||||
|
||||
pub async fn start_thread_with_tools(
|
||||
@@ -322,12 +323,12 @@ impl ThreadManager {
|
||||
dynamic_tools: Vec<codex_protocol::dynamic_tools::DynamicToolSpec>,
|
||||
persist_extended_history: bool,
|
||||
) -> CodexResult<NewThread> {
|
||||
self.start_thread_with_tools_and_service_name(
|
||||
Box::pin(self.start_thread_with_tools_and_service_name(
|
||||
config,
|
||||
dynamic_tools,
|
||||
persist_extended_history,
|
||||
None,
|
||||
)
|
||||
))
|
||||
.await
|
||||
}
|
||||
|
||||
@@ -338,17 +339,16 @@ impl ThreadManager {
|
||||
persist_extended_history: bool,
|
||||
metrics_service_name: Option<String>,
|
||||
) -> CodexResult<NewThread> {
|
||||
self.state
|
||||
.spawn_thread(
|
||||
config,
|
||||
InitialHistory::New,
|
||||
Arc::clone(&self.state.auth_manager),
|
||||
self.agent_control(),
|
||||
dynamic_tools,
|
||||
persist_extended_history,
|
||||
metrics_service_name,
|
||||
)
|
||||
.await
|
||||
Box::pin(self.state.spawn_thread(
|
||||
config,
|
||||
InitialHistory::New,
|
||||
Arc::clone(&self.state.auth_manager),
|
||||
self.agent_control(),
|
||||
dynamic_tools,
|
||||
persist_extended_history,
|
||||
metrics_service_name,
|
||||
))
|
||||
.await
|
||||
}
|
||||
|
||||
pub async fn resume_thread_from_rollout(
|
||||
@@ -358,7 +358,7 @@ impl ThreadManager {
|
||||
auth_manager: Arc<AuthManager>,
|
||||
) -> CodexResult<NewThread> {
|
||||
let initial_history = RolloutRecorder::get_rollout_history(&rollout_path).await?;
|
||||
self.resume_thread_with_history(config, initial_history, auth_manager, false)
|
||||
Box::pin(self.resume_thread_with_history(config, initial_history, auth_manager, false))
|
||||
.await
|
||||
}
|
||||
|
||||
@@ -369,17 +369,16 @@ impl ThreadManager {
|
||||
auth_manager: Arc<AuthManager>,
|
||||
persist_extended_history: bool,
|
||||
) -> CodexResult<NewThread> {
|
||||
self.state
|
||||
.spawn_thread(
|
||||
config,
|
||||
initial_history,
|
||||
auth_manager,
|
||||
self.agent_control(),
|
||||
Vec::new(),
|
||||
persist_extended_history,
|
||||
None,
|
||||
)
|
||||
.await
|
||||
Box::pin(self.state.spawn_thread(
|
||||
config,
|
||||
initial_history,
|
||||
auth_manager,
|
||||
self.agent_control(),
|
||||
Vec::new(),
|
||||
persist_extended_history,
|
||||
None,
|
||||
))
|
||||
.await
|
||||
}
|
||||
|
||||
/// Removes the thread from the manager's internal map, though the thread is stored
|
||||
@@ -411,17 +410,16 @@ impl ThreadManager {
|
||||
) -> CodexResult<NewThread> {
|
||||
let history = RolloutRecorder::get_rollout_history(&path).await?;
|
||||
let history = truncate_before_nth_user_message(history, nth_user_message);
|
||||
self.state
|
||||
.spawn_thread(
|
||||
config,
|
||||
history,
|
||||
Arc::clone(&self.state.auth_manager),
|
||||
self.agent_control(),
|
||||
Vec::new(),
|
||||
persist_extended_history,
|
||||
None,
|
||||
)
|
||||
.await
|
||||
Box::pin(self.state.spawn_thread(
|
||||
config,
|
||||
history,
|
||||
Arc::clone(&self.state.auth_manager),
|
||||
self.agent_control(),
|
||||
Vec::new(),
|
||||
persist_extended_history,
|
||||
None,
|
||||
))
|
||||
.await
|
||||
}
|
||||
|
||||
pub(crate) fn agent_control(&self) -> AgentControl {
|
||||
@@ -474,14 +472,14 @@ impl ThreadManagerState {
|
||||
config: Config,
|
||||
agent_control: AgentControl,
|
||||
) -> CodexResult<NewThread> {
|
||||
self.spawn_new_thread_with_source(
|
||||
Box::pin(self.spawn_new_thread_with_source(
|
||||
config,
|
||||
agent_control,
|
||||
self.session_source.clone(),
|
||||
false,
|
||||
None,
|
||||
None,
|
||||
)
|
||||
))
|
||||
.await
|
||||
}
|
||||
|
||||
@@ -494,7 +492,7 @@ impl ThreadManagerState {
|
||||
metrics_service_name: Option<String>,
|
||||
inherited_shell_snapshot: Option<Arc<ShellSnapshot>>,
|
||||
) -> CodexResult<NewThread> {
|
||||
self.spawn_thread_with_source(
|
||||
Box::pin(self.spawn_thread_with_source(
|
||||
config,
|
||||
InitialHistory::New,
|
||||
Arc::clone(&self.auth_manager),
|
||||
@@ -504,7 +502,7 @@ impl ThreadManagerState {
|
||||
persist_extended_history,
|
||||
metrics_service_name,
|
||||
inherited_shell_snapshot,
|
||||
)
|
||||
))
|
||||
.await
|
||||
}
|
||||
|
||||
@@ -517,7 +515,7 @@ impl ThreadManagerState {
|
||||
inherited_shell_snapshot: Option<Arc<ShellSnapshot>>,
|
||||
) -> CodexResult<NewThread> {
|
||||
let initial_history = RolloutRecorder::get_rollout_history(&rollout_path).await?;
|
||||
self.spawn_thread_with_source(
|
||||
Box::pin(self.spawn_thread_with_source(
|
||||
config,
|
||||
initial_history,
|
||||
Arc::clone(&self.auth_manager),
|
||||
@@ -527,7 +525,7 @@ impl ThreadManagerState {
|
||||
false,
|
||||
None,
|
||||
inherited_shell_snapshot,
|
||||
)
|
||||
))
|
||||
.await
|
||||
}
|
||||
|
||||
@@ -540,7 +538,7 @@ impl ThreadManagerState {
|
||||
persist_extended_history: bool,
|
||||
inherited_shell_snapshot: Option<Arc<ShellSnapshot>>,
|
||||
) -> CodexResult<NewThread> {
|
||||
self.spawn_thread_with_source(
|
||||
Box::pin(self.spawn_thread_with_source(
|
||||
config,
|
||||
initial_history,
|
||||
Arc::clone(&self.auth_manager),
|
||||
@@ -550,7 +548,7 @@ impl ThreadManagerState {
|
||||
persist_extended_history,
|
||||
None,
|
||||
inherited_shell_snapshot,
|
||||
)
|
||||
))
|
||||
.await
|
||||
}
|
||||
|
||||
@@ -566,7 +564,7 @@ impl ThreadManagerState {
|
||||
persist_extended_history: bool,
|
||||
metrics_service_name: Option<String>,
|
||||
) -> CodexResult<NewThread> {
|
||||
self.spawn_thread_with_source(
|
||||
Box::pin(self.spawn_thread_with_source(
|
||||
config,
|
||||
initial_history,
|
||||
auth_manager,
|
||||
@@ -576,7 +574,7 @@ impl ThreadManagerState {
|
||||
persist_extended_history,
|
||||
metrics_service_name,
|
||||
None,
|
||||
)
|
||||
))
|
||||
.await
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user