Trying to implement vocal removal with SDL_Audio

hi all,

I’m trying to see if I can replicate a vocal removal effect I found and implemented in JavaScript with SDL audio
and am having some trouble (but I think it should be possible?).

Here’s the relevant JavaScript code I’m trying to implement with SDL_Audio:

var audioData = [];
var leftChan = buffer.getChannelData(0);
var rightChan = buffer.getChannelData(1);
				
for(var i=0;i<leftChan.length;i++){
	var currentSample = (leftChan[i] - rightChan[i])/2; //this is what causes the vocal removal
	audioData.push(currentSample);
}
						
audioToKaraoke = audioCtx.createBuffer(1, buffer.length, buffer.sampleRate); 
audioToKaraoke.getChannelData(0).set(audioData);

You can try it here: https://syncopika.github.io/misc/karaokeget.html

As far as I understand, Web Audio uses Float32Arrays to store audio data.
For example, when I load a particular WAV sample, I get these data for the first few entries in the left and right
channel buffer with the JavaScript code:

left: 0 right: 0
left:-0.00003051757 right:-0.00003051757
left: 0.000061037 right: 0.000061037
left:-0.000061035 right:-0.000061035
left: 0.000061037 right: 0.0000305185
left:-0.000061035 right: 0

In SDL after I load my WAV sample (the same one used to generate the values above) to an AudioSpec, I checked the properties and it shows that I have 2 channels, with the format being AUDIO_S16LSB (32784 was printed). When I printed the first few values of the audio data, I got:
0
0
0
0
255
255
255
255
2
0
2
0
254
255
254
255
2
0
1
0

Based on the documentation I read, since my sample is in stereo with 2 channels, each byte belongs to either the left or right channel. So I think the left channel’s bytes would be [0, 0, 255, 255, 2, 2, 254, 254, …], and the right would be [0, 0, 255, 255, 0, 0, 255, 255, …].

But I’m not sure how to proceed from here.

One thing I tried was using the LoadWAV function to get the Uint8 audio data, separate the data into channel buffers, create a float vector to store the result after doing some math on each left and right channel data, copy the bytes of the float vector data to a new Uint8 buffer, and then pass that new buffer to an AudioSpec with 1 channel, but that didn’t work.

Thanks for reading!

The data is in AUDIO_S16LSB in this case, so they would be 2 bytes for the left channel, then the next two bytes are for the right, next two for the left, etc.

If you want to play this sound out of the speakers, you need to open an audio device and feed it data. If you just want to fire off the sound as one big block of sound and let SDL figure out how to play it, you can do something like this:

https://hg.libsdl.org/SDL/file/39a92b19f99e/test/loopwavequeue.c

The other way is to give SDL a function that runs every time it needs more audio data–which might be more like Web Audio in this case?–which looks like this:

https://hg.libsdl.org/SDL/file/39a92b19f99e/test/loopwave.c

–ryan.

thanks for helping clear my understanding of how to interpret the audio data channel-wise! That definitely makes sense.

regarding the audio playing - sorry for not mentioning it but I have my audio callback function in place and am able to play back my WAV sample successfully. I’m just stuck on modifying the audio data properly. I think I’ll do some more experimenting.

One new question I have though: If I make an AudioSpec and set its format to AUDIO_F32, does that mean that it will know to interpret the audio data as float values, even though the audio data is passed as a Uint8 pointer?

If you open the device as needing AUDIO_F32, then the callback will expect float samples; it says Uint8 for convenience of pointer math, but it almost never actually wants data in Uint8 format…plan to cast that pointer to a different data type.

If you want your wav data to be floating point, you can use SDL_ConvertAudio() to change it to that format after loading it, if you want:

https://wiki.libsdl.org/SDL_ConvertAudio

–ryan.

1 Like

thanks very much icculus! The SDL_ConvertAudio() function is awesome! I tried using it a couple days ago and thought it didn’t work but having another go with it with a better understanding made my problem embarrassingly simple. if anyone would like to know, here’s working code to help remove vocals for 2-channel stereo wav files in c++.

#include <iostream>
#include <vector>
#include <SDL.h>

// audio data struct that callback will use 
struct AudioData{
	Uint8* position;
	Uint32 length;
};


// define an audio callback that SDL_AudioSpec will use 
void audioCallback(void* userData, Uint8* stream, int length){
	
	AudioData* audio = (AudioData*)userData;
	float* streamF = (float *)stream;
	
	if(audio->length == 0){
		return;
	}
	
	// length is number of bytes of userData's audio data
	Uint32 len = (Uint32)length;
	
	if(len > audio->length){
		len = audio->length;
	}
	
	// copy len bytes from audio stream at audio->position to stream buffer
	SDL_memcpy(streamF, audio->position, len);
	
	audio->position += len;
	audio->length -= len;
}


int main(int argc, char **argv){
	
	// initialize SDL before doing anything else 
	if(SDL_Init(SDL_INIT_AUDIO) != 0){
		std::cout << "Error initializing SDL!" << std::endl;
		return 1;
	}
	
	// set up an AudioSpec to load in the file 
	SDL_AudioSpec wavSpec;
	Uint8* wavStart;
	Uint32 wavLength;
	std::string file = /* path to your wav file */
	std::cout << "the file is: " << file << std::endl;
	
	// load the wav file and some of its properties to the specified variables 
	if(SDL_LoadWAV(file.c_str(), &wavSpec, &wavStart, &wavLength) == NULL){
		std::cout << "couldn't load wav file" << std::endl;
		return 1;
	}
	
	// convert audio data to F32 
	SDL_AudioCVT cvt;
	SDL_BuildAudioCVT(&cvt, AUDIO_S16, 2, 48000, AUDIO_F32, 2, 48000);
	cvt.len = wavLength;
	cvt.buf = (Uint8 *)SDL_malloc(cvt.len * cvt.len_mult);
	
	// copy current audio data to the buffer (dest, src, len)
	SDL_memcpy(cvt.buf, wavStart, wavLength); // wavLength is the total number of bytes the audio data takes up
	SDL_ConvertAudio(&cvt);
	
	// audio data is now in float form!
	float* newData = (float *)cvt.buf;

	std::vector<float> leftChannel;
	std::vector<float> rightChannel;
	
	// divide by 4 since cvt.len_cvt is total bytes of the buffer, and 4 bytes per float
	int floatBufLen = (int)cvt.len_cvt / 4;
	int count = 0; // if 0, left channel. 1 for right channel 
	for(int i = 0; i < floatBufLen; i++){
		if(count == 0){
			leftChannel.push_back(newData[i]);
			count++;
		}else{
			rightChannel.push_back(newData[i]);
			count--;
		}
	}
	
	// now eliminate the vocal by getting the diff between left and right and dividing by 2 
	std::vector<float> modifiedData;
	for(int j = 0; j < (int)leftChannel.size(); j++){
		float temp = (leftChannel[j] - rightChannel[j]) / 2.0;
		modifiedData.push_back(temp);
	}
		
	// set up another SDL_AudioSpec with 1 channel to play the modified audio buffer of wavSpec
	SDL_AudioSpec karaokeAudio;
	karaokeAudio.freq = wavSpec.freq;
	karaokeAudio.format = AUDIO_F32;
	karaokeAudio.channels = 1;
	karaokeAudio.samples = wavSpec.samples;
	karaokeAudio.callback = audioCallback;
	
	AudioData audio;
	audio.position = (Uint8*)modifiedData.data(); 
	audio.length = (Uint32)(modifiedData.size() * sizeof(float));
	
	karaokeAudio.userdata = &audio;
	
	SDL_AudioDeviceID audioDevice;
	audioDevice = SDL_OpenAudioDevice(NULL, 0, &karaokeAudio, NULL, 0);
	
	// play 
	SDL_PauseAudioDevice(audioDevice, 0);
	
	while(audio.length > 0){
		SDL_Delay(1000); // set some delay so program doesn't immediately quit 
	}
	
	// done playing audio 
	SDL_CloseAudioDevice(audioDevice);
	SDL_FreeWAV(wavStart);
	SDL_Quit();
	
	return 0;
}