This work examines pre-trained audio representations for multi-shot sound event detection. We specifically address the problem of detecting novel acoustic sequences or sound events that have a semantically meaningful temporal structure without assuming access to non-target audio. We develop procedures and techniques for preparing appropriate presentations pre-training that transfer them to our multiple-shot training scenario. Our experiments evaluate the general-purpose utility of our pre-trained representations in AudioSet and the utility of several proposed capture methods using tasks constructed from real-world acoustic sequences. Our pre-made embeds are suitable for the task at hand and enable several aspects of our multi-shot framework.
