Inter-observer reliability was measured for appropriate play and stereotypy. For each participant, 33% of sessions were
selected randomly from across phases for reliability coding. Two independent observers that were blinded to the purpose of
the study and who had prior experience using a 10-s partial interval procedure watched the videos separately. The data
collection sheets for both data collectors where then compared and agreements and disagreements where noted during each
session. An agreement was defined as both data collectors recording the occurrence or nonoccurrence of appropriate play
and stereotypy during the same 10-s interval. Inter-observer reliability was then calculated by dividing the number of
agreements by the number of agreements plus disagreements (i.e., total number of intervals) and multiplying by 100
(Kazdin, 1982). Across participants the mean inter-observer agreement for stereotypy was 91% (range 82–100%). The mean
inter-observer agreement for appropriate play was 88% (range 76–100%).