DiscussionWe tested several mechani

Discussion
We tested several mechanisms from the current literature on
modelling individual variation in the form of Pavlovian conditioned
responses (ST vs GT) that emerge using a classical
autoshaping procedure, and the role of dopamine in both the
acquisition and expression of these CRs. Benefiting from a rich set
of data, we identified key mechanisms that are sufficient to account
for specific properties of the observed behaviours. The resulting
model relies on two major concepts: Dual learning systems and
factored representations. Figure 4 summarizes the role of each
mechanism in the model.

Conclusion
Here we have presented a model that accounts for variations in
the form of Pavlovian conditioned approach behaviour seen
during autoshaping in rats; that is, the development of a signtracking
vs goal-tracking CR. This works adds to an emerging set
of studies suggesting the presence and collaboration of multiple RL
systems in the brain. It questions the classical paradigm of state
representation and suggests that further investigation of factored
representations in RL models of Pavlovian and instrumental
conditioning experiments may be useful.

Methods
Modelling the autoshaping experiment
In the classical reinforcement learning theory [1], tasks are
usually described as Markov Decision Processes (MDPs). As the
proposed model is based on RL algorithms, we use the MDP
formalism to computationally describe the Pavlovian autoshaping
procedure used in all simulations.
An MDP describes the interactions of an agent with its
environment and the rewards it might receive. An agent being
in a state s can execute an action a which results in a new state s’
and the possible retrieval of some reward r. More precisely, an
agent can be in a finite set of states S, in which it can perform a
finite set of discrete actions A, the consequences of which are
defined by a transition function T : S|A?P(S), where P(S) is
the probability distribution P(s’Ds,a) of reaching state s’ doing
action a in state s. Additionally, the reward function R : S|A?R
is the reward R(s,a) for doing action a in state s. Importantly,
MDPs should theoretically comply with the Markov property: the probability of reaching state s’ should only depend on the last state
s and the last action a. An MDP is defined as episodic if it includes
at least one state which terminates the current episode.
Figure 1 shows the deterministic MDP used to simulate the
autoshaping procedure. Given the variable time schedule (30–
150s) and the net difference observed in behaviours in inter-trial
intervals, we can reasonably assume that each experimental trial
can be simulated with a finite horizon episode.

The agent starts from an empty state (s0) where there is nothing
to do but explore. At some point the lever appears (s1) and the
agent must make a critical choice: It can either go to the lever (s2)
and engage with it (s5), go to the magazine (s4) and engage with it
(s7) or just keep exploring (s3,s6). At some point, the lever is
retracted and food is delivered. If the agent is far from the
magazine (s5,s7), it first needs to get closer. Once close (s7), it
consumes the food. It ends in an empty state (s0) which symbolizes
the start of the inter-trial interval (ITI): no food, no lever and an
empty but still present magazine.
The MDP in Figure 1 is common to all of the simulations and
independent of the reinforcement learning systems we use. STs
should favour the red path, while GTs should favour the shorter
blue path. All of the results rely mainly on the action taken at the
lever appearance (s1), when choosing to go to either the lever, the
magazine, or to explore. Exploring can be understood as not going
to the lever nor to the magazine.
To fit with the requirements of the MDP framework, we
introduce two limitations in our description, which also simplify
our analyses. We assume that engagement is necessarily exclusive
to one or no stimulus, and we make no use of the precise timing of
the procedure – the ITI duration nor the CS duration – in our
simulations.

Conclusion
Here we have presented a model that accounts for variations in
the form of Pavlovian conditioned approach behaviour seen
during autoshaping in rats; that is, the development of a signtracking
vs goal-tracking CR. This works adds to an emerging set
of studies suggesting the presence and collaboration of multiple RL
systems in the brain. It questions the classical paradigm of state
representation and suggests that further investigation of factored
representations in RL models of Pavlovian and instrumental
conditioning experiments may be useful.

Methods
Modelling the autoshaping experiment
In the classical reinforcement learning theory [1], tasks are
usually described as Markov Decision Processes (MDPs). As the
proposed model is based on RL algorithms, we use the MDP
formalism to computationally describe the Pavlovian autoshaping
procedure used in all simulations.
An MDP describes the interactions of an agent with its
environment and the rewards it might receive. An agent being
in a state s can execute an action a which results in a new state s’
and the possible retrieval of some reward r. More precisely, an
agent can be in a finite set of states S, in which it can perform a
finite set of discrete actions A, the consequences of which are
defined by a transition function T : S|A?P(S), where P(S) is
the probability distribution P(s’Ds,a) of reaching state s’ doing
action a in state s. Additionally, the reward function R : S|A?R
is the reward R(s,a) for doing action a in state s. Importantly,
MDPs should theoretically comply with the Markov property: the probability of reaching state s’ should only depend on the last state
s and the last action a. An MDP is defined as episodic if it includes
at least one state which terminates the current episode.
Figure 1 shows the deterministic MDP used to simulate the
autoshaping procedure. Given the variable time schedule (30–
150s) and the net difference observed in behaviours in inter-trial
intervals, we can reasonably assume that each experimental trial
can be simulated with a finite horizon episode.

The agent starts from an empty state (s0) where there is nothing
to do but explore. At some point the lever appears (s1) and the
agent must make a critical choice: It can either go to the lever (s2)
and engage with it (s5), go to the magazine (s4) and engage with it
(s7) or just keep exploring (s3,s6). At some point, the lever is
retracted and food is delivered. If the agent is far from the
magazine (s5,s7), it first needs to get closer. Once close (s7), it
consumes the food. It ends in an empty state (s0) which symbolizes
the start of the inter-trial interval (ITI): no food, no lever and an
empty but still present magazine.
The MDP in Figure 1 is common to all of the simulations and
independent of the reinforcement learning systems we use. STs
should favour the red path, while GTs should favour the shorter
blue path. All of the results rely mainly on the action taken at the
lever appearance (s1), when choosing to go to either the lever, the
magazine, or to explore. Exploring can be understood as not going
to the lever nor to the magazine.
To fit with the requirements of the MDP framework, we
introduce two limitations in our description, which also simplify
our analyses. We assume that engagement is necessarily exclusive
to one or no stimulus, and we make no use of the precise timing of
the procedure – the ITI duration nor the CS duration – in our
simulations.

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

DiscussionWe tested several mechanisms from the current literature onmodelling individual variation in the form of Pavlovian conditionedresponses (ST vs GT) that emerge using a classicalautoshaping procedure, and the role of dopamine in both theacquisition and expression of these CRs. Benefiting from a rich setof data, we identified key mechanisms that are sufficient to accountfor specific properties of the observed behaviours. The resultingmodel relies on two major concepts: Dual learning systems andfactored representations. Figure 4 summarizes the role of eachmechanism in the model.ConclusionHere we have presented a model that accounts for variations inthe form of Pavlovian conditioned approach behaviour seenduring autoshaping in rats; that is, the development of a signtrackingvs goal-tracking CR. This works adds to an emerging setof studies suggesting the presence and collaboration of multiple RLsystems in the brain. It questions the classical paradigm of staterepresentation and suggests that further investigation of factoredrepresentations in RL models of Pavlovian and instrumentalconditioning experiments may be useful.MethodsModelling the autoshaping experimentIn the classical reinforcement learning theory [1], tasks areusually described as Markov Decision Processes (MDPs). As theproposed model is based on RL algorithms, we use the MDPformalism to computationally describe the Pavlovian autoshapingprocedure used in all simulations.An MDP describes the interactions of an agent with itsenvironment and the rewards it might receive. An agent beingin a state s can execute an action a which results in a new state s’and the possible retrieval of some reward r. More precisely, anagent can be in a finite set of states S, in which it can perform afinite set of discrete actions A, the consequences of which aredefined by a transition function T : S|A?P(S), where P(S) isthe probability distribution P(s’Ds,a) of reaching state s’ doingaction a in state s. Additionally, the reward function R : S|A?Ris the reward R(s,a) for doing action a in state s. Importantly,MDPs should theoretically comply with the Markov property: the probability of reaching state s’ should only depend on the last states and the last action a. An MDP is defined as episodic if it includesat least one state which terminates the current episode.Figure 1 shows the deterministic MDP used to simulate theautoshaping procedure. Given the variable time schedule (30–150s) and the net difference observed in behaviours in inter-trialintervals, we can reasonably assume that each experimental trialcan be simulated with a finite horizon episode. The agent starts from an empty state (s0) where there is nothingto do but explore. At some point the lever appears (s1) and theagent must make a critical choice: It can either go to the lever (s2)and engage with it (s5), go to the magazine (s4) and engage with it(s7) or just keep exploring (s3,s6). At some point, the lever isretracted and food is delivered. If the agent is far from themagazine (s5,s7), it first needs to get closer. Once close (s7), itconsumes the food. It ends in an empty state (s0) which symbolizesthe start of the inter-trial interval (ITI): no food, no lever and anempty but still present magazine.The MDP in Figure 1 is common to all of the simulations andindependent of the reinforcement learning systems we use. STsshould favour the red path, while GTs should favour the shorterblue path. All of the results rely mainly on the action taken at thelever appearance (s1), when choosing to go to either the lever, themagazine, or to explore. Exploring can be understood as not goingto the lever nor to the magazine.To fit with the requirements of the MDP framework, weintroduce two limitations in our description, which also simplifyour analyses. We assume that engagement is necessarily exclusiveto one or no stimulus, and we make no use of the precise timing ofthe procedure – the ITI duration nor the CS duration – in oursimulations.

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

คำอธิบายเราได้ทดสอบกลไกจากหลายวรรณกรรมในปัจจุบันเกี่ยวกับการสร้างแบบจำลองการเปลี่ยนแปลงของแต่ละบุคคลในรูปแบบของเครื่องปรับอากาศPavlovian การตอบสนอง (ST เทียบ GT) ที่เกิดการใช้คลาสสิกขั้นตอนautoshaping และบทบาทของโดพามีนทั้งในการเข้าซื้อกิจการและการแสดงออกของCRs เหล่านี้ ได้รับประโยชน์จากชุดสมบูรณ์ของข้อมูลเราระบุกลไกสำคัญที่มีความเพียงพอที่จะบัญชีสำหรับคุณสมบัติเฉพาะพฤติกรรมที่สังเกต ส่งผลให้รูปแบบขึ้นอยู่กับสองแนวคิดหลักคือระบบการเรียนรู้แบบ Dual และเอาเรื่องการแสดง รูปที่ 4 สรุปบทบาทของแต่ละกลไกในรูปแบบ. สรุปที่นี่เราได้นำเสนอรูปแบบที่บัญชีสำหรับการเปลี่ยนแปลงในรูปแบบของพฤติกรรมวิธีPavlovian ปรับอากาศที่เห็นในช่วงautoshaping ในหนู; นั่นคือการพัฒนาของ signtracking เทียบกับเป้าหมายการติดตาม CR งานนี้จะเพิ่มชุดที่เกิดขึ้นใหม่จากการศึกษาชี้ให้เห็นการแสดงตนและการทำงานร่วมกันของหลาย RL ระบบในสมอง มันถามกระบวนทัศน์คลาสสิกของรัฐการแสดงและแสดงให้เห็นว่าการสอบสวนต่อไปของเอาเรื่องการแสดงในรูปแบบชีวิตของPavlovian และเครื่องมือทดลองเครื่องอาจจะมีประโยชน์. วิธีการสร้างแบบจำลองการทดลอง autoshaping ในการเสริมแรงคลาสสิกทฤษฎีการเรียนรู้ [1] งานจะมักจะอธิบายว่ามาร์คอฟกระบวนการตัดสินใจ (MDPs) ในฐานะที่เป็นการนำเสนอรูปแบบขึ้นอยู่กับขั้นตอนวิธีการ RL เราจะใช้ MDP พิธีคอมพิวเตอร์เพื่ออธิบาย Pavlovian autoshaping ขั้นตอนที่ใช้ในการจำลองทั้งหมด. MDP อธิบายปฏิสัมพันธ์ของตัวแทนด้วยสภาพแวดล้อมและผลตอบแทนที่อาจได้รับ ตัวแทนเป็นในของรัฐสามารถดำเนินการกระทำที่ส่งผลให้รัฐใหม่ s 'และเป็นไปได้ของการดึงอารางวัลบาง อีกอย่างแม่นยำเป็นตัวแทนสามารถอยู่ในขอบเขตของรัฐที่ S, ในการที่จะสามารถดำเนินการขอบเขตของการดำเนินการที่ไม่ต่อเนื่องA, ผลกระทบของการที่มีการกำหนดโดยฟังก์ชั่นการเปลี่ยนแปลงT: S | A P (S) ที่ P (S) คือความน่าจะเป็นการกระจายพี(s'Ds เป็น) ในการเข้าถึงรัฐ s 'ทำดำเนินการในs รัฐ นอกจากนี้ฟังก์ชั่นได้รับรางวัล R หรือไม่: S | A R เป็นรางวัล R (S, A) สำหรับการทำในการดำเนินการของรัฐ ที่สำคัญMDPs ในทางทฤษฎีควรสอดคล้องกับคุณสมบัติของมาร์คอฟ: น่าจะเป็นของรัฐถึง s 'ควรขึ้นอยู่กับรัฐที่ผ่านมาและการดำเนินการล่าสุด MDP มีการกำหนดเป็นหลักการถ้ามีอย่างน้อยหนึ่งรัฐยุติเหตุการณ์ปัจจุบัน. รูปที่ 1 แสดงให้เห็น MDP กำหนดใช้เพื่อจำลองขั้นตอนautoshaping ที่กำหนดตารางเวลาตัวแปร (30 150s) และความแตกต่างสุทธิสังเกตพฤติกรรมในระหว่างการพิจารณาคดีช่วงเวลาที่เรามีเหตุผลที่สามารถสรุปได้ว่าแต่ละทดลองสามารถจำลองกับขอบฟ้าตอนที่แน่นอน. ตัวแทนเริ่มจากรัฐว่างเปล่า (S0) ที่มีอะไรที่จะทำแต่สำรวจ ในบางจุดที่คันโยกปรากฏ (s1) และตัวแทนจะต้องทำให้เป็นทางเลือกที่สำคัญ: มันสามารถไปที่คันโยก (s2) และมีส่วนร่วมกับมัน (s5) ไปที่นิตยสาร (s4) และมีส่วนร่วมกับมัน(S7) หรือเพียงแค่ให้การสำรวจ (s3, S6) ในบางจุดที่คันโยกที่มีการหดและอาหารจะถูกส่ง หากตัวแทนอยู่ไกลจากนิตยสาร (s5, s7) มันเป็นครั้งแรกความต้องการที่จะได้ใกล้ชิด เมื่อปิด (S7) มันกินอาหาร มันจบลงในสภาวะที่ว่างเปล่า (s0) ซึ่งเป็นสัญลักษณ์ของการเริ่มต้นของช่วงเวลาระหว่างการพิจารณาคดี(ITI): ไม่มีอาหารคันไม่มีและ. นิตยสารที่ว่างเปล่า แต่ปัจจุบันยังคงMDP ในรูปที่ 1 เป็นเรื่องธรรมดาที่ทุกจำลองและเป็นอิสระเสริมระบบการเรียนรู้ที่เราใช้ STs ควรสนับสนุนเส้นทางสีแดงในขณะที่ GTS ควรสนับสนุนสั้นเส้นทางสีฟ้า ทั้งหมดของผลที่อาศัยส่วนใหญ่ในการดำเนินการในลักษณะคัน (s1) เมื่อเลือกที่จะไปทั้งคันโยกที่นิตยสารหรือการสำรวจ การสำรวจสามารถเข้าใจได้เป็นไปไม่ได้ที่จะคันหรือนิตยสาร. เพื่อให้พอดีกับความต้องการของกรอบ MDP เราแนะนำสองข้อจำกัด ในคำอธิบายของเราซึ่งยังลดความซับซ้อนของการวิเคราะห์ของเรา เราคิดว่าการมีส่วนร่วมคือจำเป็นต้องพิเศษหนึ่งหรือกระตุ้นไม่มีและเราจะทำให้ใช้ไม่ได้ในระยะเวลาที่ถูกต้องของขั้นตอน- ระยะเวลา ITI หรือระยะเวลา CS - ของเราจำลอง

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.