There are multiple relatively simple solutions in the frequency domain. An obvious approach would be using fft. It is worth drawing a 3d spectrogram of the samples - most of them look like white noise filling up the whole range.

A very crude but working solution is using bandpass filters (for example the one implented in sox(1)). Some instruments will dominate specific frequency ranges, like the kick drum is very strong in low frequencies (~50 Hz). For most of the other instruments, it is not easy to select a frequency where they can be clearly distinguished. A relatively simple way to hunt uniq freqs is to run all samples through bandpass filters at different frequiencies, then plot the amplitude (after a low-pass filter) per frequency. This would show the 'footprint' of different instruments in time for the given frequency, and there should be a ferquency for each instrument where the amplitude in the moment of the stroke is much higher than for any other instrument. The accuracy of this method can be improved if multiple different frequencies are used to identify a sample. Unfortunately this all requires a lot of tuning and tends to break if relative volume of the instruments vary a lot.

The above approach tries to detect strokes. This is not really necessary as all input patterns are based on the same tempo, i.e. there is a constant minimal time between two strokes and any other timing can be calculated as an integer multiple of that. This is easy to spot with a wav editor (simple amplitude-in-time plot) or on a spectrogram. Once the timing is measured, it is enough to evaluate the song only at strokes and decide which instruments are the most likely to cause the sound. For this the above described bandpass-filtering or a simple fft can be used (after mapping each sample). An alternative approach is to generate the combined sound of all pairs (and triplets) of instruments and compare the strokes to those.

The task can be solved using other signal processing methods as well. The cross-correlation of two signals shows how similar the two signals are using a given time shift. In case of the drums problem this is the dot product of a segment of the sound track with one of the instrument signals and it is larger if the segment contains the instrument signal exactly (if there is no match then the terms of the dot product have random signs, otherwise the terms are mostly positive). With this approach one should calculate the dot product of each instrument at each time shift and find the maximums in the resulting data. (Normalization of the signals in case of varying volume may be necessary.)