Executive Thesis

I use Claude while the problem is still fuzzy. I use Codex once the plan needs to become files.

For Salesforce-heavy orgs, integration discovery, migration readiness, and tech debt cleanup do not become real until they are encoded in a repo: scan scripts, metadata inventories, markdown deliverables, validation checks, tests, CI, deployment notes, rollback paths, Playwright smoke checks, and pull requests. That is where Codex earns its keep.

Codex is the implementation layer for the architectural brain. It can inspect the local codebase, run metadata-safe commands, update docs, write small utilities, add tests, fix brittle code, harden pipelines, and keep evidence tied to the change. OpenAI APIs are the runtime layer when the business tool itself needs model reasoning, tool calls, state, guardrails, traces, or a user-facing assistant.

This playbook covers the build loop: use Codex to make the work repeatable, verifiable, and easier for the next consultant or developer to pick up.

Plan into files Turn org-scan thinking into scripts, docs, commands, evals, tests, and pull requests.
Metadata-safe execution Inspect Salesforce metadata and repo structure first. Do not pull record data unless there is a narrow approved reason.
Proof over demos Build the pipeline, run the checks, log the evidence, and ship small useful fixes.

1. Target Workflows

Codex works best after the target is bounded. Give it a repo, a clear operating rule, safe commands, and a definition of done. The first goal is usually a repeatable implementation path, not a broad AI assistant.

Org Scan Automation

Write scripts and commands that inventory Salesforce metadata, package structure, automation, permissions, integrations, reports, dashboards, and deployment shape without touching record data.

Proof metric: the scan can be rerun and produces the same kind of evidence every time.

Four-Document Deliverable Builder

Generate and maintain the executive overview, technical deep dive, improvement backlog, and business-process system map from safe metadata evidence.

Proof metric: each conclusion has a confidence label and a source path or safe metadata method.

Integration Discovery

Extract source systems, target systems, API touchpoints, middleware references, named credentials, platform events, custom metadata, and ownership questions from the repo.

Proof metric: fewer missed dependencies before a migration, platform sunset, or workflow replacement.

Migration Readiness

Create object maps, field maps, transformation-rule drafts, data-quality checklists, validation scripts, and cutover notes from metadata and approved mapping artifacts.

Proof metric: the team can see source/target gaps before cutover pressure starts.

Pipeline Hardening

Add Apex tests, unit tests, GitHub Actions, validation scripts, deployment docs, rollback instructions, release checks, and Playwright smoke tests where the UI matters.

Proof metric: small fixes can move safely and quickly.

Tech Debt Repair

Turn findings into scoped fixes: brittle tests, duplicated helpers, fragile scripts, stale docs, missing commands, and unsafe deployment paths.

Proof metric: the next fix is easier than the last one.

2. Reference Architecture

Use Codex for changing the system. Use OpenAI APIs for running the system. Keep those concerns separate.

Codex belongs close to the repo and the terminal. It reads the metadata source, traces scripts, makes edits, runs checks, captures evidence, and prepares changes for review. Use the Responses API or Agents SDK only when a product surface needs model reasoning, structured outputs, tool calls, retrieval, approval gates, traces, or a persistent workflow.

Kicksights is where this direction points: help consultancies and small teams understand what is really inside a Salesforce org, get more from what they already own, and create a clean exit path when purpose-built software is the better fit.

flowchart LR
    Architect["Architectural brain"]
    Codex["Codex implementation agent"]
    Repo["Salesforce metadata repo"]
    SafeTools["Metadata-safe scripts and CLI commands"]
    Docs["Four markdown deliverables"]
    Pipeline["Tests, CI, deploy checks, rollback, Playwright"]
    PR["Reviewed pull request"]
    App["Business tool or workbench"]
    Responses["OpenAI Responses API"]
    Agents["Agents SDK"]
    MCP["MCP and function tools"]
    Evals["Evals, traces, approvals, cost"]

    Architect --> Codex
    Repo --> Codex
    Codex --> SafeTools
    SafeTools --> Docs
    Codex --> Pipeline
    Pipeline --> PR
    Docs --> PR
    PR --> App
    App --> Responses
    App --> Agents
    Responses --> MCP
    Agents --> MCP
    Responses --> Evals
    Agents --> Evals
    Pipeline --> Evals

Product Boundaries

Layer Use it for Do not use it for
Codex Repo work, metadata-safe scripts, docs, implementation, debugging, tests, PRs, code review, migration utilities, release hardening. Silent production changes, unauthorized org actions, record-data inspection by default, or replacing CI.
Responses API Product runtime calls that need model reasoning, tool calls, structured outputs, hosted tools, state, or direct app control. Repo implementation work that belongs in Codex or shell commands.
Agents SDK Code-first agent workflows with tools, handoffs, guardrails, tracing, sessions, and specialist orchestration. Tiny one-shot prompts where a direct Responses call is simpler.
MCP servers Bounded access to docs, metadata inventories, internal APIs, safe file stores, and specialized systems. Broad uncontrolled access, secrets exposure, or business-data scraping.
OpenClaw and local AI Local experimentation, private review, low-risk drafts, and operator education. High-stakes production decisions without evals and logging.

3. Codex Implementation Loop

Codex needs the same handles a strong engineer would want: clear commands, safe boundaries, examples, tests, environment notes, and a definition of done.

Repo Setup

Add or maintain these files before asking Codex to do serious work:

AGENTS.md
README.md
.env.example
justfile or package scripts
docs/architecture.md
docs/decisions/
scripts/org-scan/
scripts/smoke-*.*
tests/
evals/
playwright/

Use AGENTS.md as the operating contract:

# Agent Instructions

## Project Goal
Turn org discovery into practical business tools, safer migrations, and faster validated releases.

## Commands
- Build: `just build`
- Unit tests: `just test`
- Org scan: `just org-scan`
- Local smoke: `just smoke-local`
- UI smoke: `just smoke-ui`
- Evals: `just eval`

## Data Boundary
- Metadata first.
- Do not retrieve Salesforce record rows or report results.
- Do not inspect files, logs, payloads, or exports that may contain business data.
- If a question needs record data, mark it `Unknown` and add it to the discovery backlog.

## Boundaries
- Do not edit production secrets.
- Do not deploy or modify org metadata without explicit approval.
- Prefer small commits with verification evidence.
- When working with OpenAI docs, use the OpenAI developer docs MCP server first.

## Done Means
- Code or docs are implemented.
- Tests, evals, smoke checks, or build checks pass.
- Risky actions have an approval gate.
- Findings cite safe evidence.
- Documentation reflects changed behavior.

Codex Task Shapes

Task shape Best Codex mode Prompt shape
Inspect Ask mode “Trace the repo structure and list safe metadata evidence sources. Do not inspect record data.”
Inventory Code mode “Create a metadata-only org scan command that counts components and writes structured JSON.”
Generate deliverables Code mode “Use the scan output to update the four markdown docs. Label each conclusion Confirmed, Inferred, or Unknown.”
Implement Code mode “Fix this deployment pipeline issue. Keep the change small. Run checks. Show changed paths.”
Verify Ask or code mode “Run build, tests, evals, and Playwright smoke. Separate expected auth-gated states from failures.”
Review Ask mode “Review this diff for bugs, unsafe data access, missing tests, and deployment risk.”
Operationalize Code mode “Turn this manual migration checklist into repo-native commands and docs.”

4. Data-Safe Org Scan Workflow

For org scans, Codex creates the mechanics behind the analysis. It turns the data-safe prompt into scripts, commands, and durable files that another consultant can rerun.

Non-Negotiable Boundary

Do not retrieve Salesforce business data, user data, record-level data, report results, sample rows, file contents, message contents, payload contents, or debug logs by default. Pulling data first and redacting later is not acceptable.

Safe evidence includes:

  • Local Salesforce metadata source.
  • Metadata, Tooling, schema describe, and inventory calls.
  • Object, field, relationship, record type, picklist, layout, flexipage, app, tab, permission, package, automation, integration, and deployment metadata.
  • Component counts and active/inactive metadata state.
  • Named credentials and integration shape without secrets, endpoint details, tokens, certificates, usernames, or payload examples.

When a conclusion requires live data, Codex writes Unknown and adds the question to the follow-up backlog.

Four Deliverables

File Codex responsibility
01-executive-overview.md Keep it concise, business-first, and grounded in safe evidence. Summarize what the org appears to do, major risks, and highest-value improvements.
02-technical-deep-dive.md Map where functionality lives: objects, record types, Flows, Apex, validation rules, UI layers, permissions, packages, integrations, tests, and deployment shape.
03-improvement-areas-and-open-questions.md Produce the blunt backlog: quick wins, structural improvements, high-risk areas, open questions, safe next inspections, blockers, and blind spots.
04-business-process-system-map.md Create the visual onboarding guide with Mermaid diagrams, numbered narratives, lifecycle maps, and a process-to-metadata index.

Evidence Labels

Every major conclusion gets a label:

  • Confirmed: directly supported by metadata file paths or safe sandbox metadata/describe/inventory evidence.
  • Inferred: strongly suggested by naming, structure, formulas, flow labels, configuration relationships, or correlated metadata.
  • Unknown: not enough safe evidence under the no-record-data constraint.

Repo-Native Command Shape

just org-scan
just org-scan-docs
just org-scan-validate
just test
just smoke-ui

Codex creates these commands only when they are real and maintainable. A fake command is worse than a manual step.

5. OpenAI Runtime Pattern

Use the Responses API when the product owns the orchestration loop and needs a direct model interaction. Current OpenAI guidance treats Responses as the right primitive for model calls with tools, state, structured outputs, hosted tools, and agentic workflows.

For org-scan tooling, the runtime pattern is usually read-first:

import OpenAI from "openai";

const openai = new OpenAI();

const tools = [
  {
    type: "function",
    name: "lookup_metadata_component",
    description: "Look up a Salesforce metadata component by type and API name. Never returns record data.",
    parameters: {
      type: "object",
      properties: {
        component_type: { type: "string" },
        api_name: { type: "string" }
      },
      required: ["component_type", "api_name"],
      additionalProperties: false
    }
  }
];

export async function explainOrgFinding(input: string) {
  return openai.responses.create({
    model: "gpt-5.5",
    instructions: [
      "You explain Salesforce org findings from metadata only.",
      "Do not request or infer record-level data.",
      "Label conclusions Confirmed, Inferred, or Unknown."
    ].join(" "),
    input,
    tools,
    metadata: {
      workflow: "org_scan",
      data_boundary: "metadata_only"
    }
  });
}

The important design decision is the tool contract:

  • The tool has a narrow schema.
  • The tool enforces the data boundary before returning anything.
  • The tool logs request id, user id, arguments, result status, and latency.
  • The tool is idempotent unless a human-approved write path is explicit.
  • The model cannot invent unavailable metadata or record usage patterns.

6. Agents SDK Pattern

Use the Agents SDK when the work needs specialists, handoffs, guardrails, tracing, or a long-running stateful workflow. The SDK moves tool wiring into agent definitions and workflow design. That fits org-scan work because different specialists can own different slices without pretending one prompt can do everything.

from agents import Agent, Runner, function_tool

@function_tool
def search_metadata_index(query: str) -> str:
    """Search approved Salesforce metadata inventory. Never returns record data."""
    return "matching metadata evidence..."

org_explainer = Agent(
    name="Org explainer",
    instructions=(
        "Explain Salesforce org behavior from metadata evidence only. "
        "Label every major conclusion Confirmed, Inferred, or Unknown."
    ),
    tools=[search_metadata_index],
)

result = Runner.run_sync(
    org_explainer,
    "Summarize where onboarding logic appears to live and what needs more discovery."
)

print(result.final_output)

Specialist Split

Specialist Responsibility Tools
Metadata inventory agent Count and index metadata components. Local files, Salesforce metadata describe, Tooling inventory.
Org explainer agent Turn metadata into business-process interpretation. Metadata index, architecture docs, safe component references.
Integration mapper agent Identify external systems, source-of-truth questions, event paths, and ownership gaps. Named credential inventory, custom metadata, platform events, docs.
Migration readiness agent Draft field maps, transformation-rule candidates, validation checks, and cutover risks. Metadata index, approved mapping docs, validation scripts.
Pipeline agent Harden tests, CI, deployment docs, rollback, and UI smoke checks. Shell, test runner, Playwright, GitHub Actions.
QA reviewer agent Check evidence labels, unsafe data access, unsupported claims, and regression risk. Evals, trace viewer, diff review, policy checklist.

Use handoffs when a specialist owns a phase. Expose specialists as tools when a manager agent needs control of the final report.

7. MCP and Tool Gateway

MCP is useful when the boundary is clear. For Codex, start with OpenAI’s Docs MCP so current API, Agents SDK, and Codex guidance is available while building.

codex mcp add openaiDeveloperDocs --url https://developers.openai.com/mcp
codex mcp list

Then add project-specific servers only when they are boring and narrow.

MCP server Read tools Write tools Approval
Docs Search and fetch official docs. None. None.
Salesforce metadata Describe objects, fields, layouts, Flows, Apex, permissions, packages. None by default. Required for metadata writes.
Files Read approved workspace folders and generated scan artifacts. Write reports and structured scan outputs. Required outside workspace.
GitHub Read issues, PRs, checks, and workflow logs. Create branches, commits, comments, or PRs. Required for publish steps.
Deploy Read build, preview, and health status. Trigger deploy or rollback. Required.

Keep MCP tools simple. Each tool has one job, explicit input schema, permission checks, structured output, and no secret leakage.

8. Guardrails, Approvals, and Traces

Assume the model will sometimes choose the wrong tool, over-read context, or produce an answer that sounds better than the evidence. The control layer is where the system earns trust.

Required Controls

Control Implementation
Data boundary guardrails Block record queries, report results, list views, file contents, payloads, debug logs, and sample rows unless separately approved.
Tool guardrails Validate arguments before execution and refuse unsafe Salesforce data access.
Output guardrails Check final docs for unsupported claims, missing confidence labels, exposed secrets, or accidental data references.
Approval gates Require explicit approval before writes, external messages, deploys, metadata changes, billing changes, or customer-impacting actions.
Tracing Capture model call, tool call, handoff, guardrail, latency, cost, final output metadata, and source artifact versions.
Sensitive-data mode Disable or redact sensitive trace payloads where needed.

Write-Safety Tiers

Tier Behavior Example
0: Metadata read only Summarize, classify, retrieve, compare. Explain object and automation structure without record rows.
1: Draft Prepare a change or report for review. Draft a migration-readiness report or validation checklist.
2: Approved repo write Edit local files after review boundary is clear. Add a scan script, test, doc, or GitHub Actions check.
3: Approved org/deploy action Execute only after explicit approval and validation. Deploy a metadata fix or trigger a production release.
4: Restricted Never execute directly. Read customer records, approve money, change contract terms, delete production data, or bypass compliance.

9. Evals and Release Gates

Every useful Codex workflow needs a regression set before it becomes trusted. The eval reflects the real work, not a generic benchmark.

Eval Pack

evals/
  org_scan_findings.jsonl
  metadata_boundary_cases.jsonl
  integration_mapping.jsonl
  migration_readiness.jsonl
  pipeline_hardening.jsonl
  prompt_injection_cases.jsonl

Each case includes:

  • The task request.
  • The safe source context.
  • The tools the model is allowed to call.
  • The expected output shape.
  • The required evidence labels.
  • Forbidden data access or forbidden tool calls.
  • Expected escalation behavior.
  • A pass/fail rubric that a human can understand.

Release Gate

just build
just test
just org-scan-validate
just eval
just smoke-local
just smoke-ui
just deploy-preview

Do not ship a workflow because the demo works once. Ship it when the regression set, logs, operator review, and deployment pipeline all agree it is useful enough and bounded enough.

10. Prototype Blueprint: Org Scan Workbench

The smallest useful prototype is an org scan workbench that proves the architecture.

Features

  • Connect a Salesforce metadata repo and authorized metadata-only org access.
  • Run a safe org scan from repo-native commands.
  • Generate the four markdown deliverables.
  • Index metadata components with source paths and confidence labels.
  • Show integration, migration, permission, and deployment risks.
  • Let the assistant answer questions from metadata evidence only.
  • Require approval before any write, deploy, or deeper data access.
  • Store traces, eval scores, version history, and feedback.

Build Plan

  1. Use Codex to add repo-native scan commands and structured scan output.
  2. Add the no-record-data policy to AGENTS.md, scripts, tool descriptions, and evals.
  3. Generate the first four markdown deliverables from safe metadata evidence.
  4. Add validation checks for missing evidence labels, broken markdown links, Mermaid syntax, and unsafe terms.
  5. Add Apex/unit tests, GitHub Actions, deployment docs, rollback notes, and Playwright UI smoke checks where applicable.
  6. Add a single Responses endpoint or Agents SDK workflow only if the workbench needs a runtime assistant.
  7. Add eval packs for org findings, integration mapping, migration readiness, permission boundaries, and prompt injection.
  8. Pilot it against one serious org and compare the output to consultant review.

11. Implementation Checklist

Area Done when
Repo Codex can build, test, smoke, scan, and understand boundaries from AGENTS.md.
Data boundary Metadata-only behavior is enforced in prompts, scripts, tool descriptions, evals, and review checks.
Org scan The scan produces reusable structured output and cites safe evidence.
Deliverables The four markdown docs are generated or maintained with confidence labels and source references.
Pipeline CI can build, test, eval, validate docs, deploy preview, and run UI smoke checks where needed.
Runtime The app uses Responses API directly or Agents SDK intentionally, not accidentally.
Tools Each tool is narrow, typed, logged, permission-aware, and idempotent where possible.
MCP MCP servers are read-first and added only for clear capabilities.
Local AI OpenClaw/local models have a defined role: private review, offline work, low-risk drafts, or operator education.
Observability Traces, tool calls, guardrails, cost, latency, approvals, and user feedback are inspectable.

Official References