From the experiments we have learned that all speech
segments used for the acoustic model training must meet
certain quality requirements. Experience has shown that
approximately 50 percent of the generated audio segments
have to be sorted out due to one of the following reasons:
the segment contains acoustical noise created by
objects and humans in the environment around the
speaker, e.g., doors closing, chairs moving, students
talking,
the lecturer mispronounces some words, so that they
are completely invalid from an objective point of
view,
the speaker’s language is clear, but the segmentation
algorithm cuts off parts of a spoken word so that it
becomes invalid.