Random Forest Algorithm for Recognition of Bird Species using Audio Recordings P

Random Forest Algorithm for Recognition of Bird Species using Audio
P. V. Lavanya Sudha 1, Dr. G. Lavanya Devi 2 and Naresh Nelaturi 3
1M.Tech, Department of Computer Science and System Engineering, Andhra University College
of Engineering, Visakhapatnam, India
2Assistant Professor, Department of Computer Science and System Engineering, Andhra
University College of Engineering, Visakhapatnam, India
3Research Scholar, Department of Computer Science and System Engineering, Andhra
University College of Engineering, Visakhapatnam, India
1(Email: [email protected])
2(Email: [email protected] o.in )
3(Email: [email protected] )

Birds are the part of the whole ec osystem . Real
world audio data faces certain difficulties such as
multiple simultaneously vocalizing birds, other
sources of non -bird sound (e.g. buzzing insects),
and background noise s like wind, rain, and motor
vehicles. This problem is formulated as a multi –
label classification problem .The proposed
method is based on 2D -supervised time –
frequency segmentation and by using the segment
features as a part of a multi -instance multi -label
formulation , the random forest classifier is
developed to predict the set of bird species which
is present in a given ten -second audio recording .
This method achieved an AUC of .87628 .
Keywords : Multi -Label Classification, 2D –
supervised time -frequency, Random Forest
Birds are the part of the whole ecosystem , t he
interaction between human and birds occurs in
several scenarios. Birds are numerous and easier
to de tect than other animal species. The
identification of bird species from their audio
recordings is nowadays used in several important
applications such as t o monitor the quality of the
environment and to prevent bird -plane collisions
near airports. Using recordings produced by
birds, the identification task can be done by using
signal processing techniques and machine
learning algorithms. The classification of birds is
a form of acoustic event classification (AEC)
where the target events are bird calls and songs.
Recordings for bird sounds tend to be noisy as
they are recorded in open environments and it is
common to have sounds from multiple birds or

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

othe r s pecies such as insects . It is difficult to label
such data manually due to noise, the similarity of
certain bird sounds and the potentially large
number of possible bird species for a given
location .
The major challenges that going to face in this
dataset are
1. Noise
2. Absence of any bird sounds
3. Multiple bird sounds
4. Overlapping bird sounds

2.Related Work:
In 1 authors used an ensemble of classifier
chains combined with a histogram -of segments
representation for multi -label classification of
birdsong. The proposed method is compared with
binary relevance and three multi -instance multi –
label learning (MIML) algorithms from prior
work (which focus more on structure in the
sound, and less on structure in the label sets).
Experiments are conducted on two real -world
birdsong datasets, and shown that the proposed
method usually outperforms binary relevance
(using the same features and base -classifier), and
is better in some cases and worse in others
compared to the MIML algorithms. In 2 author
prop osed a MIML bag generator for audio, i.e.,
an algorithm which transforms an input audio
signal into a bag -of-instances representation
suitable for use with MIML classifiers and used a
2D time -frequency segmentation of the audio
signal, which can separate b ird sounds that
overlap in time. In 3 author explained about the
Multi -instance multi -label learning (MIML)
framework for supervised classification where
the objects to be classified are bags of instances
associated with multiple labels. In 4 author
ex plained about the BoAW approach which
extracts audio concepts in an unsupervised
fashion. Hence this method has the advantage that
it can be employed easily for a new set of audio
concepts in multimedia videos without going
through a laborious annotation e ffort.

3.Proposed Method:
Figure1 shows the methodology that is used for
the classification of the bird species recordings.

Figure 1: Block Diagram

1) Pre -Processing
The raw audio signal is converted into a
spectrogram image using parameters
window size in 512, hamming window
75% overlap by dividing it into frames
and applied FFT to each frame. The
frequency pro?le of stationary noise
(such as wind and streams) is estimated
from low energy frames, then the
spectrogram is attenuated to suppress the
background noise while preserving bird
sound .
2) Segmentation

Each spectrogram is divided into a
collection of regions using a supervised
time -frequency segmentation algorithm .
We manually annotated 20 spectrograms
as examples of correct segmentation, by
drawing over the areas corresponding to
bird sound in red, and rain -drop in blue .
Because there are a large number of
pixels in each spectrogram, we
subsampled 30% of red pixels as positive
examples, 30% of blue pixels as negative
examples, and 4% of uncolored pixels as
negative examples. From the 20
annotated spectrograms, this sampling
process yields 467,958 examples.


Each pixel is described by a feature
vector with the following elements:
• The raw pixel intensity of all pixels in
a 17 × 17 box around the pixel (this gives
a 172 = 289 -d feature).
• The average intensity of all pixels in
that box (1 -d).
• The y-coordinate of the pixel, which
corresponds to frequency (1 -d).
• The raw pixel intensity of all pixels in
the same column as the pixel (256 -d) .

A Random Forest classi?er is trained on
the positive and negative examples . Then
the trained Random Forest classi?er is
applied to each pixel in every
spectrogram, which gives a probability
for the pixel to be bird sound. The
probabilities may be noisy when viewing
individual pixels in isolation, so they are
averaged over a neighborhood by
applying a Gaussian blur to an image of
the proba bilities, with a kernel parameter
? = 3. The blurred probabilities are then
compared to a threshold of 0.4. Pixels
with probabilities above the threshold are
considered to be bird sound and pixels
with probabilities below the threshold are
considered backg round.

3) Feature Extraction

Each segment is associated with a 38 -d
feature vector which characterizes its
shape, texture, and noise -robust pro?le
statistics . Then Euclidean distance
method is used to find the similarity
between the feature vectors.

4) Classification

Random forests has been widely used for
multi -label classi?cation. Random Forest
is operated by constructing decision tree
structure by the training examples. One
of the popular algorithm is tree bagging,
in which the training process includ es
repeatedly selecting a bootstrap sample
of the training set and ?tting the trees to
them. After the training process, the label
decision is made either on the majority of
the votes or a weighted combination from
individual trees.

4.Experiments and Results :
The proposed methods are applied to the audio
dataset which was collected in the H. J. Andrews
(HJA) Long -Term Experimental Research
Forest, in the Cascade mountain range of Or egon,
and Oregon State University Bioacoustics group
have collected over 10TB of audio data in HJA
using Songmeter audio recording devices. The
bird’s dataset is the model of relationship
between 645 ten -second audio recordings and
bird species. A Songmeter has two
omnidirectional microphones, and records audio
in WAV format to flash memory. A Songmeter
can be left in the field for several weeks at a time
before either its batteries run out, or its memory
is full . In this dataset, it includes rain and wind,
and represented a sample from 2009 and 2010
during summer of audio recording at 13 different
locations. The dataset consists of 645 ten -second
audio recordings in uncompressed in WAV
format and there are 19 species of bird in the
dataset (Table 1) . Each ten -second audio
recording was paired with a set of bird species
that were presen t. There is some relevant
information in WAV files about location, date
and time where each 13 different locations has a
distinct location code. Because each clip was
recorded in a natural setting, it may contain
environmen tal noises such as wind or rain
These spectrograms (Figure 2) are computed by
dividing the WAV signal into overlapping
frames, and applying the FFT with a Hamming
window. The FFT returns complex Fourier
coefficients. To enhance contrast, first normalize
the spectrogram so that the maximum coefficient
magnitude is 1, then take the square root of the
normalized magnitude as the pixel value for an
The spectrogram has time on the x -axis (from 0
to the duration of the sound), and frequency on

the y -axis. The maximum frequency in the
spectrogram is half the sampli ng frequency
(16kHz/2 = 8kHz).

Table 1 : 19 Bird Species in a Dataset

Figure 2: Spectrogram

The frequency pro?le of stationary noise (such as
wind and streams) is estimated from low energy
frames, then the spectrogram (Figure 3) is
attenuated to suppress the background noise
while preserving bird sound

Figure 3: Noise reduced Spectrograms

These are the spectrograms (Figure 4) with the
outlines of segments drawn on top of them. These
segments are obtained automatically using a
segmentation algorithm that is trained.

Figure 4: Segmented Spectrograms

Figure 5 : Histogram of number of birds
in each recording


In the proposed method, robust features are
extracted from the audio r ecording using t he
spectrogram based approach and extract ed mask
Code Name
BRCR Brown Creeper
PAWR Paci?c Wren =
PSFL = Paci?c Jslope Flycatcher =
RBNU = Red Jbreasted Nuthatch =
DEJU = Dark Jeyed Junco =
OSFL = Olive Jsided Flycatcher =
HETH = Hermit Thrush =
CBCH = Chestnut Jbacked=
Chickadee =
VATH = Varied Thrush =
HEWA = Hermit Warbler =
SWTH = Swainson’s Thrush =
HAFL = Hammond’s Flycatcher =
WETA = Western Tanager =
BHGB = Black Jheaded Grosbeak =
GCKI = Golden Crowned Kinglet =
WAVI = Warbling Vireo =
MGWA = MacGillivray’s Warbler =
STJA = Stellar’s Jay =
CONI = Common Nighthawk =

descriptors from the labelled regions, then an
ense mble of Random Forest classi?er is applied
to get the better results.


1 F. Briggs, X. Z. Fern, and J. Irvine. Multi -label
classi?er chains for bird sound. arXiv:1304.5862,
(abs/1304.5862), 2013.
2 F.Briggs,B.Lakshminarayanan,L.Neal,X.Fern
,R.Raich,S.J.K.Hadley,A.S.Hadley,and M. G.
Betts. Acoustic classi?cation of multiple
simultaneous bird species: a multi -instance multi –
label approach. Journal of the Acous tical Society
of America, 131:4640 –4650, 2012.

3 F.Briggs,X.Z.Fern,R.Raich,andQ.Lou.
Instanceannotation for multi -instance multi -label
learning. Transactions on Knowledge Discovery
from Data (TKDD), 2012, 2012.

4 S. Pancoast and M. Akbacak. Bag -of-audi o-
words approach for multimedia event
classi?cation. In INTERSPEECH, 2012.