Richard Sutton on Experience

Jun 9, 2026 · Mountain View, California, United States

Exploring experiential learning, the “Era of Experience,” and the foundations of reinforcement learning

Note on method: These are real questions posed by me. The responses, compiled with AI assistance, are not authored by Richard Sutton; they are grounded in his documented positions and primary sources, with each answer citing the source it draws on. Quotation marks denote verbatim passages; everything else is paraphrase or analysis, flagged where it extrapolates beyond what Sutton states directly.

Primary sources referenced: Silver & Sutton, “Welcome to the Era of Experience” (2025); Sutton & Barto, “Reinforcement Learning: An Introduction” (2nd ed., 2018); “The Alberta Plan for AI Research” (2022); “Beyond Reward” (2011); the Dwarkesh Patel interview; and Sutton’s talk slides (incompleteideas.net).

Q1: If an agent’s only access to the world is the Internet — a corpus composed entirely of human input — how can it learn experientially?

The premise — “the agent’s access to the world is the Internet, which is only human input” — is the picture Sutton tells the field to abandon. A static corpus is not access to the world; it is access to a recording of other people talking about the world. Treating that corpus as the agent’s environment is what he calls the era of human data, which he argues is hitting a ceiling: “the pace of progress driven solely by supervised learning from human data is demonstrably slowing.” Imitating that text reaches competence, but “this approach in isolation has not and likely cannot achieve superhuman intelligence.”

The resolution is that the agent must no longer be confined to the corpus. Experience, in his framework, is “data that is generated by the agent interacting with its environment.” The defining move is action — an agent does things, observes what happens, and learns from that. Even on a network, an agent can run code and observe whether it executed, call an API and see the result, or interact with a user, a simulator, a market, or a tool and watch the consequence arrive in its own stream. None of that is a human pre-writing the answer; it is the world — or a digital piece of it — answering back. That action happens through the same interface as the agent’s other observations, not a privileged human-text channel, and any reward is grounded: a consequence of the agent’s action, not human prejudgement of it.

Supervised learning on the corpus has no experience in it — “supervised learning does not involve experience; it learns from special training data.” The instant the agent takes an action and the digital environment returns a result the agent did not write itself — code that fails, an API that responds, a user who reacts — there is a genuine action–observation–reward stream. The Internet — full of executable, interactive, consequence-returning systems — is one of the richest such environments once an agent stops reading it and starts acting in it. In Sutton’s view, human text is a usable starting point, but an agent that only imitates it is bounded by what its authors already knew — imitation alone, as quoted above, “has not and likely cannot” go beyond that to superhuman intelligence.

Source: “Era of Experience” (Silver & Sutton, 2025) for the quoted passages on data generated by interaction, slowing progress, and superhuman intelligence; “experience.pdf” talk slides for “supervised learning does not involve experience”; Dwarkesh Patel interview for the principle that an agent does things and sees what happens. The “era of human data is ending” framing and the digital-action examples are Sutton & Silver’s documented position, paraphrased.

Q2: How does Sutton reconcile grounded rewards with humans being part of the agent’s environment?

The Rewards section draws the boundary not at human vs. non-human, but at prejudgement vs. measured consequence.

The disallowed case (RLHF-style):

“Human-centric LLMs typically optimise for rewards based on human prejudgement: an expert observes the agent’s action and decides whether it is a good action… The fact that these rewards or preferences are determined by humans in absence of their consequences, rather than measuring the effect of those actions on the environment, means that they are not directly grounded in the reality of the world.”

The allowed case — a human in the environment reporting a consequence they experienced:

“Grounded rewards may arise from humans that are part of the agent’s environment. For example, a human user could report whether they found a cake tasty, how fatigued they are after exercising, or the level of pain from a headache… Such rewards measure the consequence of the agent’s actions within their environment, and should ultimately lead to better assistance than a human expert that prejudges a proposed cake recipe, exercise program, or treatment program.”

A human rating a proposed action is prejudgement (a ceiling); a human reporting an actual outcome they lived through is a grounded environmental signal. It is the same person — the difference is whether they are predicting how the action will turn out or reporting how it did. Sutton makes this a matter of principle in a footnote:

“Experience and human data are not exact opposites. For example, a dog learns entirely from experience, but human interaction is part of its experience.”

The mechanism — bi-level optimization:

“The reward function could be defined by a neural network that takes the agent’s interactions with both the user and the environment as input, and outputs a scalar reward… users could provide feedback during the learning process… This can also be understood as a bi-level optimisation process that optimises user feedback as the top-level goal, and optimises grounded signals from the environment at the low level. In this way, a small amount of human data may facilitate a large amount of autonomous learning.”

Human feedback sits at the top, defining and correcting what to optimize for, while the bulk of learning happens at the bottom from grounded environmental signals. Sutton frames this as a safety feature: misaligned reward functions “can often be incrementally corrected over time by trial and error” — his answer to the worry that an agent locked onto a badly specified goal pursues it forever. Experience can contain humans; it just cannot be replaced by them.

Source: All quotes verbatim from Silver & Sutton, “Welcome to the Era of Experience” (2025), Rewards section, footnote 2, and Consequences section.

Q3: Is the next crucial innovation, then, maximizing the legibility of real-world consequences to the agent?

“Legibility” is an interpretive lens, not a phrase Sutton uses.

Where it matches the paper:

The paper’s own list of what this era must build lands close to the idea. In the Reinforcement Learning Methods section:

“This era will bring new ways to think about reward functions that are flexibly grounded in observational data. It will revisit value functions and methods to estimate them from long streams with as yet incomplete sequences… Novel approaches to world models will be developed that capture the complexities of grounded interactions. New methods for temporal abstraction will allow agents to reason, in terms of experience, over ever-longer time horizons.”

Read as the agent’s ability to connect an action to its often delayed, diffuse downstream effects, “legibility of environmental consequences” maps onto the credit-assignment + world-model + value-function cluster, which the paper names as the core technical work. The most direct match is the world-model passage: building a model that “predicts the consequences of the agent’s actions upon the world, including predicting reward.”

The tension the framing risks:

“Legibility” can be read two ways that point in opposite directions:

Making the agent better at extracting consequence-structure from raw signals itself — better value functions, world models, exploration, temporal abstraction. This is what the paper advocates.

Making consequences more legible to the agent by pre-digesting them — humans engineering clean, readable reward signals so the agent does not have to work them out from raw experience.

The second reading is, on Sutton’s own terms, the Bitter Lesson antipattern, and it reintroduces the very thing the Rewards section rules out: a hand-crafted “legible reward” is still a human deciding in advance what counts as good — prejudgement, relocated from the rating step to the reward design step. When the paper enumerates reward sources it deliberately lists raw quantities — “cost, error rates, hunger, productivity, health metrics, climate metrics, profit, sales, exam results” — and pushes the combining work onto the agent via a learned reward function, not a human-authored legible one.

Consequence-attribution is the bottleneck, and the paper treats value functions over long incomplete streams, world models of grounded interaction, and exploration as the hard, unsolved work. On Sutton’s position, the innovation is not increasing legibility of the environment, but increasing the agent’s capacity to learn legibility from raw experience. The moment “maximize legibility” becomes a human pre-processing step, it collides with both the Bitter Lesson and the anti-prejudgement principle.

Source: Silver & Sutton, “Era of Experience” (2025) — RL Methods, Planning and Reasoning, and Rewards sections (quotes verbatim). The “Bitter Lesson antipattern” judgment is a synthesis of Sutton’s documented positions; its application to “reward legibility” specifically is extrapolation, not a claim he makes verbatim.

Q4: What exactly does the value function — a single number — attribute?

The value function attributes, to a state (or a state–action pair), the total amount of future reward the agent can expect to accumulate from that point onward — under a given policy. It is not the immediate reward; it is a prediction of the whole future stream of reward, collapsed into one number.

Reward vs. value:

	What it is	Time horizon
Reward	the immediate scalar signal from the environment at one step	now
Value	the expected cumulative (discounted) future reward from a state onward	the whole rest of the stream

“Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.”

“Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary… Nevertheless, it is values with which we are most concerned when making and evaluating decisions… we seek actions that bring about states of highest value, not highest reward.”

Formally:

State-value vπ(s) = Eπ[G_t | S_t = s] — the expected return (sum of future rewards) starting from state s and acting under policy π.

Action-value qπ(s,a) = Eπ[G_t | S_t = s, A_t = a] — the same, but committing to action a first.

The return G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + … is the actual future stream; the value function is its expectation. The number attributes how much total future reward this situation (or this action here) is worth.

Relation to consequence-attribution:

A consequence in an experiential stream is usually delayed and diffuse — an action now pays off (or costs) many steps later. The value function folds that long, diffuse future back into a single present scalar, so the agent can make a local decision (pick the highest-value action) that is nonetheless correct about long-run consequences. This is why the Era of Experience paper singles out “value functions and methods to estimate them from long streams with as yet incomplete sequences” as core unsolved work, and why Sutton treats learning to predict (temporal-difference learning, his 1988 contribution) as the engine.

Source: Sutton & Barto, “Reinforcement Learning: An Introduction” (2nd ed., 2018) — Ch. 1 (reward-vs-value passages, verbatim/near-verbatim) and Ch. 3 §3.5–3.6 (formal vπ, qπ, return). The tie to “long streams / incomplete sequences” is verbatim from Silver & Sutton, “Era of Experience.”

Q5: Is time horizon the only distinction between reward and value functions?

No. Time horizon is one axis of at least five. The one Sutton treats as most consequential is who produces the number.

Dimension	Reward	Value
Time horizon	immediate, one step	expected cumulative future return
Source	given by the environment — exogenous	estimated by the agent — internal
Epistemic status	an observed fact (it just arrives)	a prediction — can be wrong, continually corrected
Role	primary — defines the goal	secondary — a means to get more reward
Policy-dependence	a property of the problem R(s,a,s′)	defined relative to a policy (vπ, qπ)

Given vs. estimated:

“Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime. In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.”

Reward is handed to the agent; it does not compute it, cannot argue with it, and (in this framing) cannot be “wrong” about it — it is the ground truth that defines the problem. Value is something the agent must construct and keep correcting from experience, which is why TD learning, the engine of correction, is where almost all the algorithmic work lives.

Primary vs. secondary:

“Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward.”

Reward is the goal; value is instrumental. A reward can exist with no value function (a bandit, or a myopic agent); a meaningful value function cannot exist without a reward to predict.

Policy-dependence:

A value is only defined with respect to a policy: vπ(s) = Eπ[G_t | S_t = s]. The same state has different values under different policies. Reward has no such dependence — R is fixed by the environment; which rewards are encountered depends on behavior, but the reward function itself does not change with π.

Reward and value differ on at least five axes — horizon, source, epistemic status, role, and policy-dependence. The deepest in Sutton’s framing is given-by-the-environment vs. estimated-by-the-agent, because that is what makes value the thing learning algorithms actually have to work for.

Source: Sutton & Barto, “Reinforcement Learning: An Introduction” (2nd ed., 2018) — Ch. 1 §1.3 (both block quotes verbatim/near-verbatim) and Ch. 3 §3.5 (policy-dependence of vπ, qπ).

Q6: How is a policy defined?

A policy is the agent’s behavior rule: a mapping from states to actions (or to probabilities over actions) — given what the agent perceives, what it does next.

Intuitive definition:

“A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.”

Sutton treats it as central: “the policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior.”

Formal definition:

“Formally, a policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy π at time t, then π(a | s) is the probability that A_t = a if S_t = s.”

So π is not necessarily “state → one action.” In general it is “state → a distribution over actions”:

Type	Form	Meaning
Deterministic	π(s) = a	in state s, always take action a
Stochastic	π(a	s) ∈ [0,1], Σₐ π(a

Deterministic is the special case where the distribution puts probability 1 on one action. Stochasticity matters for exploration and for problems where optimal behavior genuinely is to randomize.

How it connects:

Value is defined relative to a policy. vπ(s) and qπ(s,a) mean “expected return if the agent behaves according to π.” Change π and the values change.

The goal of RL is to find a good policy. Reward defines the goal; value functions are estimated as a means; the point of those estimates is to improve the policy toward the optimal policy π*, which maximizes expected return from every state. The standard loop is policy evaluation → policy improvement: estimate vπ, make π greedier with respect to it, repeat.

In Sutton’s element-list: reward = the goal signal (given), value = prediction of long-run reward (estimated), policy = the behavior rule being optimized, and model (when present) = the agent’s model of the environment.

Source: Sutton & Barto, “Reinforcement Learning: An Introduction” (2nd ed., 2018) — Ch. 1 §1.3 (intuitive definition + “core of the agent,” verbatim/near-verbatim) and Ch. 3 §3.5 (formal π(a|s), verbatim). The optimal-policy / policy-iteration framing is from Ch. 3 §3.6 and Ch. 4 (paraphrased).

Q7: Can networks perceive time? Is time built into how a network understands the world, or just memorized from human descriptions of how long things take?

Sutton’s framework agrees with the intuition for today’s networks — but sharpens why, and locates the failure in the training setup rather than in networks as such.

Two different senses of “having” time:

	Human-data-trained network (LLM)	Sutton’s experiential agent
What time is	a topic in the data — descriptions of durations, dates, sequences written by humans	the structure of experience itself — the index of the stream
Where horizons come from	human prejudgement: raters and labels encode how long a task “should” take	the agent’s own discounting and goals over its lived stream
Continuity	“little or no information carries over from one episode to the next, precluding any adaptation over time”	“information is carried across the entire stream” over years

For an LLM, temporal knowledge is exactly what the question suggests: memorized human statements about time, learned from “special training data” with no temporal stream during learning — no “now,” no before-and-after that the network itself lived through. On Sutton’s terms this is the general grounding failure applied to time: the network holds claims about temporal scale that it has no way to test against any stream of its own, so they cannot be self-verified knowledge, only inherited ones.

In Sutton’s formalism, time is not perceived — it is constitutive:

A correction to the question’s framing: in Sutton’s agent, time is not something perceived through a sensor, like color. Experience is defined as three time series — “these three time series — observation, action, and reward — constitute the experience of the agent” (Alberta Plan) — so temporal order is the skeleton everything else hangs on, not one signal among others. The Alberta Plan also demands temporal uniformity: the same learning computation at every step, with no privileged training phase versus deployment phase.

Time then enters the agent’s understanding through three mechanisms, all grounded in its own stream:

TD learning — Sutton’s foundational method is literally learning from temporal succession: adjusting predictions using the difference between estimates at consecutive time steps. Temporal structure is the learning signal itself (“Learning to Predict by the Methods of Temporal Differences,” 1988).

Discounting / return — the horizon is set by γ in G_t = R_{t+1} + γR_{t+2} + …: a fact about the agent’s own objective over its own future, not a human’s evaluation of the task’s timescale.

Temporal abstraction (options) — multi-scale time is built out of the stream: “new methods for temporal abstraction will allow agents to reason, in terms of experience, over ever-longer time horizons” (Era of Experience; framework from Sutton, Precup & Singh 1999).

Sutton’s own evidence that today’s networks are not natively temporal:

The strongest support for the intuition comes from Sutton’s recent empirical work: “Loss of Plasticity in Deep Continual Learning” (Dohare, …, Sutton, Nature 2024) shows that standard deep-learning methods progressively lose the ability to learn at all when trained continually on a stream — backpropagation-style training implicitly assumes a static dataset, not ongoing time. The Era of Experience likewise notes the era of human data produced RL methods “designed for short episodes of ungrounded, human interaction… not suitable for long streams of grounded, autonomous interaction.”

Sutton’s diagnosis splits the question’s claim in two:

Right, about current systems: their temporal “understanding” is mimicry of human-described timescales, and their architectures demonstrably degrade under genuinely temporal (continual) training.

Not a limit in principle: nothing about networks excludes time — the exclusion comes from training on static human data in short episodes. Put the network in a lifelong stream with TD-style learning, discounted objectives, and temporal abstraction, and time stops being memorized content and becomes the medium the agent learns in.

Source: Silver & Sutton, “Era of Experience” (2025) — Streams and RL Methods sections (quotes verbatim). Alberta Plan (arXiv 2208.11173) — three-time-series definition (verbatim) and temporal-uniformity principle (paraphrase). Sutton (1988), “Learning to Predict by the Methods of Temporal Differences” (paraphrase of the method). Sutton & Barto (2018), Ch. 3 (return/discounting). Sutton, Precup & Singh (1999) on options (paraphrase). Dohare et al., “Loss of Plasticity in Deep Continual Learning,” Nature 2024 (paraphrase of central finding; Sutton is senior author). The two-column table and the “constitutive vs. perceived” framing are synthesis, not Sutton’s verbatim claims.

Q8: What innovations are prerequisites for a network to learn experientially, with time as a constitutive element — and which frontier ML efforts would most accelerate this?

The prerequisites can be drawn almost entirely from Sutton’s own roadmap documents; the frontier efforts are partially grounded in what the Era of Experience paper itself cites, with additions beyond his corpus flagged as such.

The prerequisites, per Sutton’s writings:

The Era of Experience names the missing machinery explicitly — “the era of experience presents an opportunity to revisit and improve classic RL concepts” — and the Alberta Plan structures a 12-step research program around the same gaps. Consolidated:

Continual learning without loss of plasticity. The foundational blocker: Sutton’s Nature 2024 result shows standard backprop-trained networks progressively lose the ability to learn under continual training. An experiential agent learns at every step of a lifelong stream, so this must be solved first — his lab’s proposed fix is continual backpropagation, selectively re-initializing low-utility units to preserve plasticity (Dohare et al., Nature 2024).
Temporally uniform, streaming learning. No train/deploy split, no episodes — the same bounded computation at every time step, learning online from the stream as it arrives. An Alberta Plan design principle; the Era of Experience flags that current methods are “designed for short episodes of ungrounded, human interaction… not suitable for long streams of grounded, autonomous interaction.”
Value functions for long, incomplete streams. “It will revisit value functions and methods to estimate them from long streams with as yet incomplete sequences.” Relatedly, Sutton argues continuing (non-episodic) problems need the average-reward formulation rather than discounted-episodic bookkeeping — an Alberta Plan emphasis.
Real-world exploration. “Principled yet practical methods for real-world exploration that discover new behaviours that are radically different from human priors.” RLHF-era systems inherited human priors instead of exploring; experiential agents must generate their own novelty.
World models of grounded interaction + planning. A learned model that “predicts the consequences of the agent’s actions upon the world, including predicting reward,” continually corrected as predictions fail — the Dyna lineage (Sutton 1990).
Temporal abstraction. Options (Sutton, Precup & Singh 1999) so the agent can reason “over ever-longer time horizons” — without this, a lifelong stream is computationally unreachable at the 0.1-second grain.
Grounded, adaptable reward. The bi-level scheme from the Rewards section: a learned reward function combining raw environmental signals, tuned by sparse human goal-feedback.
Predictive knowledge representation. Knowledge stored as testable predictions of the stream — general value functions (GVFs) / Horde in Sutton’s earlier work — so the agent can verify and correct what it knows autonomously (Beyond Reward 2011; Alberta Plan).

Notable frontier efforts mapped to those prerequisites:

Prerequisite	Effort	Sourcing
Continual plasticity	Continual backprop (Sutton’s lab, Alberta/Amii); his work at Keen Technologies with John Carmack on this agenda	Sutton’s own work
Streaming learning	“Streaming deep reinforcement learning finally works” (Elsayed, Vasan, Mahmood 2024)	cited in Era of Experience [11]
Self-generated experiential data at scale	AlphaProof — ~100K human proofs seeded, then 100M self-generated through interaction with a formal prover; DeepSeek-R1 — “we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies”	both cited in the paper (DeepSeek quote is the paper’s quotation)
Rich grounded action spaces	Computer-use agents: Anthropic computer use, Project Mariner, OpenAI Operator — “the same interface that humans use to operate a computer”	cited in the paper [3, 15, 24]
Grounded reward from execution	Code RL on execution feedback (CodeRL, self-debugging, RLEF) — “the agent actually runs code and observes what happens”	cited in the paper [17, 7, 12]
World models + planning	MuZero lineage (planning with a learned model); Dreamer-style model-based agents	MuZero cited in the paper [29]; Dreamer is an addition beyond Sutton’s corpus
Long-horizon agentic RL	RL with verifiable rewards driving reasoning models (the o-series / R1 wave); increasingly long-horizon agent training on real tasks	assessment beyond Sutton’s corpus — though the paper’s “Why Now?” section argues exactly this combination makes the transition “imminent”

The shape of the answer:

Sutton’s view, condensed: the prerequisites are not exotic new ideas but the classic RL concepts rebuilt for unbounded streams — plasticity, streaming value estimation, exploration, world models, options, grounded reward. The era of human data side-stepped each of them (RLHF replaced value functions with human judgment; human priors replaced exploration; human-term reasoning replaced world models), and the work now is putting them back at scale. The accelerants he and Silver themselves point to are the systems that already close an experiential loop in a digital environment — theorem provers, code execution, and computer-use interfaces — because each one gives the network a stream where time, action, and consequence are its own.

Source: Silver & Sutton, “Era of Experience” (2025) — RL Methods, Planning and Reasoning, Actions and Observations, Why Now? sections (verbatim where quoted; citations [3,7,11,12,15,17,24,29] are the paper’s own). Alberta Plan (arXiv 2208.11173) — temporal uniformity, average reward, 12-step program (paraphrase). Dohare et al., Nature 2024 (paraphrase). Sutton 1990 (Dyna); Sutton, Precup & Singh 1999 (options); Beyond Reward 2011 (GVF / predictive knowledge — paraphrase). Items marked as beyond Sutton’s corpus (Dreamer, the RLVR / reasoning-model wave) reflect an assessment as of early 2026, not his documented positions.

Q9: What should the design principles be for a network that predominantly learns experientially?

The Alberta Plan is a design document for exactly such an agent, and Sutton’s older essays supply the epistemic principles. Consolidated into a design charter:

The stream is primary. Design everything around the action–observation–reward time series, because it is all the agent has: “only experience is available to the agent, and the environment is known only as a source and sink for these signals.” Knowledge, state, and goals must all be definable in terms of the stream, “not in variables internal to the environment.”
The verification principle. “An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself” (Sutton’s essay “Self-Verification, The Key to AI,” ~2001). Design consequence: store knowledge as predictions of the stream (general value functions), because predictions are the one knowledge form the agent can check autonomously by comparing them to what arrives.
Temporal uniformity — no phases. No train/deploy split, no episodes, no special “learning period”: the same computation runs at every time step, forever.
Real-time, bounded per-step compute. The agent must keep up with its stream — processing per step must be constant and affordable at the stream’s grain (~0.1s in Sutton’s framing). An algorithm whose cost grows with history length violates the design.
Plasticity is a design requirement, not an assumption. Sutton’s Nature 2024 result shows standard deep-learning training destroys the ability to keep learning; an experiential network must be built to preserve it (e.g., continual backpropagation’s utility-based reinitialization).
The world is bigger than the agent — approximate and track, never converge. The environment vastly exceeds the network’s capacity, so exact solutions and convergence are the wrong targets; the agent should continually track a changing best answer rather than converge to a fixed one.
General methods that scale; meta-learn what would otherwise be hand-engineered. The Bitter Lesson applied to architecture: do not build in human knowledge of the domain — build in the ability to learn representations, step-sizes, and features from the stream, so the design improves with compute.
The four-part agent. Sutton’s “common model of the intelligent agent”: perception (recursively constructing agent-state from the stream), reactive policy + value functions, a transition/world model, and planning with that model (Dyna). Reasoning is grounded because it runs through the learned model of consequences, “rather than reasoning solely in human terms.”
Reward is the goal; keep it grounded and adaptable. One scalar defines the objective (the reward hypothesis); ground it in environmental signals, with the bi-level scheme letting sparse human feedback steer which signals matter.

As a table:

Principle	Design consequence	Source
Stream is primary	all knowledge/state defined over the action–observation–reward series	Alberta Plan (verbatim)
Verification	knowledge = self-checkable predictions (GVFs)	“Verification” essay; Beyond Reward
Temporal uniformity	no train/deploy split; learn every step	Alberta Plan (paraphrase)
Real-time bound	constant per-step compute	Alberta Plan (paraphrase)
Plasticity by design	continual backprop or equivalent	Nature 2024
Big world	approximate + track, never converge	lab position (paraphrase)
Bitter Lesson	meta-learn, don’t hand-engineer	Bitter Lesson essay
Four-part agent	perception / policy+value / model / planning	common-model paper, Alberta Plan
Grounded reward	environmental signals + bi-level tuning	Era of Experience

The decision it drives:

The governing commitment is the conjunction of principles 1 and 2: build a network whose every piece of knowledge is a prediction about its own stream, verified and corrected at every time step within a fixed compute budget. The other principles — plasticity, uniformity, tracking, the four-part architecture — follow from that commitment. Today’s LLM design makes the opposite choice on each element: knowledge is human text, verification is by human raters, and the network is frozen after training.

Source: Alberta Plan (arXiv 2208.11173) — stream-grounding quotes verbatim; temporal uniformity, real-time constraint, four-part base agent paraphrased. “Self-Verification, The Key to AI” (incompleteideas.net, ~2001) — principle quoted verbatim. Dohare et al., Nature 2024 (paraphrase). “The Bitter Lesson” (2019, incompleteideas.net — paraphrase). Sutton, “The Quest for a Common Model of the Intelligent Decision Maker” (2022 — paraphrase). Silver & Sutton, “Era of Experience” (2025) — “reasoning solely in human terms” verbatim. The big-world/tracking item is a Sutton-lab position paraphrased with moderate confidence.

Q10: Which systems already use this experiential loop in digital environments?

The Era of Experience paper catalogs these as the evidence base for its “Why Now?” argument. Two generations, distinguished by the kind of digital environment:

Generation 1 — closed simulators (the “era of simulation”):

“RL methods equalled or exceeded human performance through self-play in board games such as backgammon, Go, chess, poker and Stratego; video games such as Atari, StarCraft II, Dota 2 and Gran Turismo; dextrous manipulation tasks such as Rubik’s cube; and resource management tasks such as data center cooling” (verbatim). So: TD-Gammon, AlphaGo/AlphaZero, DQN, AlphaStar, OpenAI Five, GT Sophy, Libratus/DeepStack, DeepNash.

These fully closed the loop — act, observe, get grounded reward, improve — but in “closed problems with singular, precisely defined rewards.” The paper’s verdict: they “did not leap the gap between simulation… to reality (open-ended problems with a plurality of seemingly ill-defined rewards).”

Generation 2 — open digital environments (the current wave):

System	Digital environment	Action → grounded consequence	Sourcing
AlphaProof	Lean formal proving system	propose proof step → prover accepts/rejects; “initially exposed to around a hundred thousand formal proofs… subsequently generated a hundred million more through continual interaction with a formal proving system” (verbatim)	cited in paper [20]
DeepSeek-R1	verifiable math/code problems	emit reasoning + answer → checked correct/incorrect; “we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies” (paper quoting DeepSeek)	cited in paper [10]
Execution-feedback code agents (CodeRL, self-debugging, RLEF)	code interpreter / test suite	write code → “the agent actually runs code and observes what happens” (verbatim)	cited in paper [17, 7, 12]
Computer-use agents (Anthropic computer use, Project Mariner, OpenAI Operator)	the OS/browser itself — “the same interface that humans use to operate a computer” (verbatim)	click/type → screen state changes	cited in paper [3, 15, 24]

What separates the generations:

The grounding signal’s scope. Generation 1’s environments had one built-in, perfectly defined reward (win/lose, score). Generation 2’s environments return consequences (proof verified, tests passed, page loaded) in open-ended task spaces — closer to the paper’s target of “rewards… flexibly connected to any of an abundance of grounded, real-world signals” (verbatim). The pattern across every working example: the environment contains an automatic verifier — a prover, a compiler, a test suite, a rendered screen — so the consequence arrives without a human prejudging it. That is also the current boundary: domains with no automatic verifier (open-ended dialogue, advice, research quality) do not yet have a working experiential loop, which is the gap the paper’s bi-level reward proposal is aimed at.

One flag: the four Generation-2 rows are the paper’s own examples (April 2025). Systems shipped after it — more capable computer-use agents, longer-horizon agentic RL — continue the same pattern but are beyond the paper’s corpus.

Source: Silver & Sutton, “Era of Experience” (2025) — Why Now?, Era of Experience, and Actions and Observations sections; all quotes verbatim; bracketed numbers are the paper’s own citations. The “automatic verifier” generalization is synthesis from the paper’s examples, not its phrasing.

Q11: A fully digital experiential loop seems limited to verifiable domains like coding and math. Is that understanding correct — and how can the scope of experiential learning be expanded?

Partially correct — right about the present, but the inference “digital loop → limited to code/math” conflates two different things, and the paper explicitly attacks the limitation on three fronts.

What is correct:

Today’s working loops cluster where an automatic verifier already exists (prover, compiler, test suite, game score). The paper concedes this is the current state: the era-of-simulation systems were confined to “closed problems with singular, precisely defined rewards,” and its current examples (AlphaProof, DeepSeek-R1, code agents) are all verifier-rich domains.

The correction — a digital loop does not mean digital-only consequences:

The Actions and Observations section is explicit that digital interfaces reach physical reality: “agents will also interact with the real world via digital interfaces. For example, a scientific agent could monitor environmental sensors, remotely operate a telescope, or control a robotic arm in a laboratory to autonomously conduct experiments” (verbatim). The action space is digital; the consequences — and therefore the experience — are physical. A “fully digital loop” therefore does not restrict what domains the agent can learn about: the digital interface is only the channel the actions and observations travel through, and what sits on the other end of that channel can be a telescope, a lab, or a market as easily as a compiler.

The expansion mechanisms, per the paper:

Mechanism	How it expands scope	Sourcing
Grounded signals are abundant, not scarce	“the world abounds with quantities such as cost, error rates, hunger, productivity, health metrics, climate metrics, profit, sales, exam results, success, visits, yields, stocks, likes, income, pleasure/pain, economic indicators, accuracy, power, distance, speed, efficiency, or energy consumption” — plus “innumerable additional signals arising from the occurrence of specific events”	verbatim, Rewards
Humans-in-the-environment as consequence reporters	extends the loop to advice/assistance domains with no mechanical verifier: the user reports the cake’s taste, the fatigue, the headache — a measured consequence, not prejudgement	Rewards (established in the grounded-rewards exchange above)
Bi-level reward adaptation	turns ill-defined goals into optimizable ones: “improve my fitness” → a learned function of heart rate, sleep, steps; user satisfaction tunes the combination over time — “a small amount of human data may facilitate a large amount of autonomous learning”	verbatim, Rewards
Long streams	months/years-scale domains (health agent on wearables, education agent tracking progress, science agent running campaigns of experiments) where outcomes accumulate into rich grounded signal	Streams section, paraphrase
World models + planning	predicting consequences before acting cuts the cost of trials in expensive domains — plan in the model, verify against reality, correct the model	Planning and Reasoning, paraphrase

The limit the paper itself names:

Physical loops are slow: “advancements relying on physical experience are inherently constrained by the time it takes to execute actions in the real world and observe their consequences. For example, the development of a new drug… still requires real-world trials that cannot be completed overnight” (verbatim, Consequences — framed there as a natural brake on self-improvement, but equally a constraint on scope expansion).

Condensed:

The limitation is real but mislocated: it is not that digital loops can only reach code and math — it is that those are the domains where the grounded signal arrives fast, cheap, and pre-verified. The expansion program is to engineer the same property elsewhere: route physical consequences through digital interfaces (sensors, lab robots), accept slower and noisier grounded signals (health metrics, sales, exam results), use humans-in-the-environment where no instrument exists, and let a learned bi-level reward make initially ill-defined goals optimizable. What determines whether a domain is reachable is how quickly its consequences arrive and how reliably they can be measured — not whether the domain is digital or physical.

Source: Silver & Sutton, “Era of Experience” (2025) — Actions and Observations, Rewards, Streams, Planning and Reasoning, Why Now?, and Consequences sections; quotes verbatim. The closing generalization about consequence speed and measurability is synthesis, not the paper’s phrasing.