Discussion
We tested several mechanisms from the current literature on
modelling individual variation in the form of Pavlovian conditioned
responses (ST vs GT) that emerge using a classical
autoshaping procedure, and the role of dopamine in both the
acquisition and expression of these CRs. Benefiting from a rich set
of data, we identified key mechanisms that are sufficient to account
for specific properties of the observed behaviours. The resulting
model relies on two major concepts: Dual learning systems and
factored representations. Figure 4 summarizes the role of each
mechanism in the model.
Conclusion
Here we have presented a model that accounts for variations in
the form of Pavlovian conditioned approach behaviour seen
during autoshaping in rats; that is, the development of a signtracking
vs goal-tracking CR. This works adds to an emerging set
of studies suggesting the presence and collaboration of multiple RL
systems in the brain. It questions the classical paradigm of state
representation and suggests that further investigation of factored
representations in RL models of Pavlovian and instrumental
conditioning experiments may be useful.
Methods
Modelling the autoshaping experiment
In the classical reinforcement learning theory [1], tasks are
usually described as Markov Decision Processes (MDPs). As the
proposed model is based on RL algorithms, we use the MDP
formalism to computationally describe the Pavlovian autoshaping
procedure used in all simulations.
An MDP describes the interactions of an agent with its
environment and the rewards it might receive. An agent being
in a state s can execute an action a which results in a new state s’
and the possible retrieval of some reward r. More precisely, an
agent can be in a finite set of states S, in which it can perform a
finite set of discrete actions A, the consequences of which are
defined by a transition function T : S|A?P(S), where P(S) is
the probability distribution P(s’Ds,a) of reaching state s’ doing
action a in state s. Additionally, the reward function R : S|A?R
is the reward R(s,a) for doing action a in state s. Importantly,
MDPs should theoretically comply with the Markov property: the probability of reaching state s’ should only depend on the last state
s and the last action a. An MDP is defined as episodic if it includes
at least one state which terminates the current episode.
Figure 1 shows the deterministic MDP used to simulate the
autoshaping procedure. Given the variable time schedule (30–
150s) and the net difference observed in behaviours in inter-trial
intervals, we can reasonably assume that each experimental trial
can be simulated with a finite horizon episode.
The agent starts from an empty state (s0) where there is nothing
to do but explore. At some point the lever appears (s1) and the
agent must make a critical choice: It can either go to the lever (s2)
and engage with it (s5), go to the magazine (s4) and engage with it
(s7) or just keep exploring (s3,s6). At some point, the lever is
retracted and food is delivered. If the agent is far from the
magazine (s5,s7), it first needs to get closer. Once close (s7), it
consumes the food. It ends in an empty state (s0) which symbolizes
the start of the inter-trial interval (ITI): no food, no lever and an
empty but still present magazine.
The MDP in Figure 1 is common to all of the simulations and
independent of the reinforcement learning systems we use. STs
should favour the red path, while GTs should favour the shorter
blue path. All of the results rely mainly on the action taken at the
lever appearance (s1), when choosing to go to either the lever, the
magazine, or to explore. Exploring can be understood as not going
to the lever nor to the magazine.
To fit with the requirements of the MDP framework, we
introduce two limitations in our description, which also simplify
our analyses. We assume that engagement is necessarily exclusive
to one or no stimulus, and we make no use of the precise timing of
the procedure – the ITI duration nor the CS duration – in our
simulations.
DiscussionWe tested several mechanisms from the current literature onmodelling individual variation in the form of Pavlovian conditionedresponses (ST vs GT) that emerge using a classicalautoshaping procedure, and the role of dopamine in both theacquisition and expression of these CRs. Benefiting from a rich setof data, we identified key mechanisms that are sufficient to accountfor specific properties of the observed behaviours. The resultingmodel relies on two major concepts: Dual learning systems andfactored representations. Figure 4 summarizes the role of eachmechanism in the model.ConclusionHere we have presented a model that accounts for variations inthe form of Pavlovian conditioned approach behaviour seenduring autoshaping in rats; that is, the development of a signtrackingvs goal-tracking CR. This works adds to an emerging setof studies suggesting the presence and collaboration of multiple RLsystems in the brain. It questions the classical paradigm of staterepresentation and suggests that further investigation of factoredrepresentations in RL models of Pavlovian and instrumentalconditioning experiments may be useful.MethodsModelling the autoshaping experimentIn the classical reinforcement learning theory [1], tasks areusually described as Markov Decision Processes (MDPs). As theproposed model is based on RL algorithms, we use the MDPformalism to computationally describe the Pavlovian autoshapingprocedure used in all simulations.An MDP describes the interactions of an agent with itsenvironment and the rewards it might receive. An agent beingin a state s can execute an action a which results in a new state s’and the possible retrieval of some reward r. More precisely, anagent can be in a finite set of states S, in which it can perform afinite set of discrete actions A, the consequences of which aredefined by a transition function T : S|A?P(S), where P(S) isthe probability distribution P(s’Ds,a) of reaching state s’ doingaction a in state s. Additionally, the reward function R : S|A?Ris the reward R(s,a) for doing action a in state s. Importantly,MDPs should theoretically comply with the Markov property: the probability of reaching state s’ should only depend on the last states and the last action a. An MDP is defined as episodic if it includesat least one state which terminates the current episode.Figure 1 shows the deterministic MDP used to simulate theautoshaping procedure. Given the variable time schedule (30–150s) and the net difference observed in behaviours in inter-trialintervals, we can reasonably assume that each experimental trialcan be simulated with a finite horizon episode. The agent starts from an empty state (s0) where there is nothingto do but explore. At some point the lever appears (s1) and theagent must make a critical choice: It can either go to the lever (s2)and engage with it (s5), go to the magazine (s4) and engage with it(s7) or just keep exploring (s3,s6). At some point, the lever isretracted and food is delivered. If the agent is far from themagazine (s5,s7), it first needs to get closer. Once close (s7), itconsumes the food. It ends in an empty state (s0) which symbolizesthe start of the inter-trial interval (ITI): no food, no lever and anempty but still present magazine.The MDP in Figure 1 is common to all of the simulations andindependent of the reinforcement learning systems we use. STsshould favour the red path, while GTs should favour the shorterblue path. All of the results rely mainly on the action taken at thelever appearance (s1), when choosing to go to either the lever, themagazine, or to explore. Exploring can be understood as not goingto the lever nor to the magazine.To fit with the requirements of the MDP framework, weintroduce two limitations in our description, which also simplifyour analyses. We assume that engagement is necessarily exclusiveto one or no stimulus, and we make no use of the precise timing ofthe procedure – the ITI duration nor the CS duration – in oursimulations.
การแปล กรุณารอสักครู่..
