Jumat, 17 Agustus 2018

MFCC Audio File

download file latihan


A.Pre-emphasis
the speech signal s(n) is sent to a high-pass filter:
s2(n) = s(n) - a*s(n-1)
where s2(n) is the output signal and the value of a is usually between 0.9 and 1.0. The z-transform of the filter is
H(z)=1-a*z-1

figure;
waveFile='sunday.wav';
[y,fs]=audioread(waveFile);
nbits=8;
y=y*2^nbits/2;
subplot(2,1,1);
time=(1:length(y))/fs;
plot(time, y); axis([min(time), max(time), -2^nbits/2, 2^nbits/2]);
xlabel('Time (seconds)'); ylabel('Amplitude'); title('Waveforms of "sunday"');

frameSize=512;
index1=0.606*fs;
index2=index1+frameSize-1;
line(time(index1)*[1, 1], 2^nbits/2*[-1 1], 'color', 'r');
line(time(index2)*[1, 1], 2^nbits/2*[-1 1], 'color', 'r');
subplot(2,1,2);
time2=time(index1:index2);
y2=y(index1:index2);
plot(time2, y2, '.-'); axis([min(time2), max(time2), -2^nbits/2, 2^nbits/2]);
xlabel('Time (seconds)'); ylabel('Amplitude'); title('Waveforms of the voiced "ay" in "sunday"');



B.Frame blocking
The input speech signal is segmented into frames of 20~30 ms with optional overlap of 1/3~1/2 of the frame size. 
Usually the frame size (in terms of sample points) is equal to power of two in order to facilitate the use of FFT. If this is not the case, we need to do zero padding to the nearest length of power of two. If the sample rate is 16 kHz and the frame size is 320 sample points, then the frame duration is 320/16000 = 0.02 sec = 20 ms. Additional, if the overlap is 160 points, then the frame rate is 16000/(320-160) = 100 frames per second.



C.Hamming windowing

Each frame has to be multiplied with a hamming window in order to keep the continuity of the first and the last points in the frame (to be detailed in the next step). If the signal in a frame is denoted by s(n), n = 0,…N-1, then the signal after Hamming windowing is s(n)*w(n), where w(n) is the Hamming window defined by:
w(n, a) = (1 - a) - a cos(2pn/(N-1)),0≦n≦N-1
Different values of a corresponds to different curves for the Hamming windows shown next:



D.Fast Fourier Transform or FFT

Spectral analysis shows that different timbres in speech signals corresponds to different energy distribution over frequencies. Therefore we usually perform FFT to obtain the magnitude frequency response of each frame.
When we perform FFT on a frame, we assume that the signal within a frame is periodic, and continuous when wrapping around. If this is not the case, we can still perform FFT but the incontinuity at the frame's first and last points is likely to introduce undesirable effects in the frequency response. To deal with this problem, we have two strategies:
  1. Multiply each frame by a Hamming window to increase its continuity at the first and last points.
  2. Take a frame of a variable size such that it always contains a integer multiple number of the fundamental periods of the speech signal.
The second strategy encounters difficulty in practice since the identification of the fundamental period is not a trivial problem. Moreover, unvoiced sounds do not have a fundamental period at all. Consequently, we usually adopt the first strategy to mutiply the frame by a Hamming window before performing FFT. The following example shows the effect of multiplying a Hamming window.


fs=8000; t=(1:512)'/fs; f=306.396; original=sin(2*pi*f*t)+0.2*randn(length(t),1); windowed=original.*hamming(length(t));
plot(t, original); grid on; axis([-inf inf -1.5 1.5]); title('Original signal'); plot(t, windowed); grid on; axis([-inf inf -1.5 1.5]); title('Windowed signal');


Tidak ada komentar:

Posting Komentar