ARC-AGI-3 · Autonomous Training Platform

一个会自我试验的参赛 Agent 平台

这个部署展示了 ARC3 项目当前的端到端闭环:世界模型 agent 会想象未来动作;平台 agent 会搜集资料、调用本机 LiteLLM/DeepSeek 提出实验、训练、评估并生成报告。

DeepSeek smoke passed LiteLLM alias: opencode-go/deepseek-v4-flash Submission offline check passed
Best score in smoke cycle
-3.157

本地 mock ARC 环境上的一轮短训练评估分数。

DeepSeek proposal
ε → 0.22

increase exploration

Generated report
HTML

训练报告、指标、平台决策都已作为本地产物生成。

平台 agent 如何工作

1

搜集证据

读取 README、算法笔记、最近训练指标和上一轮平台决策。

2

提出实验

通过本机 LiteLLM 调用 DeepSeek,只允许改安全范围内的训练参数。

3

执行训练

运行世界模型 agent,记录 replay、metrics、checkpoint 和 imagined frames。

4

评估和报告

生成训练报告,把结果写入下一轮可读的 structured log。

DeepSeek 这轮建议

{
  "rationale": "increase exploration",
  "changes": {
    "train.epsilon": 0.22
  },
  "expected_effect": "collect more diverse transitions",
  "risk": "short smoke run may be noisy"
}

复现实验命令

python3 -m arcagent.platform.agent_cli \
  --cycles 1 \
  --steps-per-cycle 80 \
  --model opencode-go/deepseek-v4-flash \
  --run-dir runs/platform_agent_deepseek_smoke2

嵌入的训练报告快照

ARC3 Training Report

This report is generated from local logs only. It explains whether the agent is learning, how its world model is improving, and how the self-tuning controller changed training.

At a glance

episodes: 1metric rows: 5PBT events: 0

Episode return: latest -1.800, best -1.800, mean -1.800

World-model loss: latest 0.0814, best 0.0814

▁▅█

Beginner explanation

The player does not only react to the current grid. It first compresses the grid into a small latent state, predicts what each possible action may do next, and scores short imagined futures. The self-tuning platform then compares several training runs and keeps the settings that worked best.

Recorded frames

Frame arrays are saved as compressed NumPy files for notebooks to animate. Latest files:

  • frames/episode_0000.npz