iOS Live Speech recognition with tailored language model

In few upcoming posts I’m going to dive more into CoreML framework including Image recognition, Speech recognition or Sound analysis. This post shows how to create a simple iOS app that recognizes live commands using built-in microphone and custom language model. If you want to introduce some AI into your app, then that could be a easy beginning to bring some killer features.

tldr; full example available here

Set up project

First create a new iOS app. Speech recognition with custom language model is only available for iOS >= 17, so make sure to set up minimum deployment target to iOS 17. If you target previous versions you can still use Speech Recognition, but without custom models.

To use speech recognition app requires a permission NSSpeechRecognitionUsageDescription, add it to plist file. Since that is the main and only purpose of the application let just disregard UX guidelines and request the permission at the app launch in the default content view.

import SwiftUI
import Speech

struct ContentView: View {
    @State var recordButtonLabel = ""
    @State var recordButtonEnabled = false
    
    var body: some View {
        VStack {
            Button(action: {
                
            }, label: {
                Text(recordButtonLabel)
            })
            .disabled(!recordButtonEnabled)
        }
        .padding()
        .onAppear(){
            self.requestAuthorization()
        }
    }
    
    private func requestAuthorization() {
        SFSpeechRecognizer.requestAuthorization { authStatus in
            OperationQueue.main.addOperation {
                switch authStatus {
                case .authorized:
                    self.recordButtonEnabled = true
                    self.recordButtonLabel = "Start recording"
                    
                case .denied:
                    self.recordButtonEnabled = false
                    self.recordButtonLabel = "Speech permission denied"
                    
                    
                case .restricted:
                    self.recordButtonEnabled = false
                    self.recordButtonLabel = "Speech is restricted"
                    
                    
                default:
                    self.recordButtonEnabled = false
                    self.recordButtonLabel = "Unknown authorization state"
                }
            }
        }
    }
}

Train the data

Fortunately we don’t have to record every word for hundreds of times to train the model 🙂 Apple brings developers SFCustomLanguageModelData that let easily create phrase counts, templates of phrase counts or even custom pronunciations using grapheme and phonemes.

To train the data let’s create a separate Command Line application and call it ModelGenerator.

import Foundation
import Speech


let data = SFCustomLanguageModelData(locale: Locale(identifier: "en_US"), identifier: "site.modista.speechRecognizer.SpeechRecognition", version: "1.0") {

    SFCustomLanguageModelData.PhraseCountsFromTemplates(classes: [
        "color": ["red", "blue", "green", "yellow"]
    ]) {
        SFCustomLanguageModelData.TemplatePhraseCountGenerator.Template(
            "Set background to <color>",
            count: 10_000
        )
    }
    SFCustomLanguageModelData.PhraseCountsFromTemplates(classes: [
        "element": ["fire", "sea"]
    ]) {
        SFCustomLanguageModelData.TemplatePhraseCountGenerator.Template(
            "I am <element>",
            count: 10_000
        )
    }
    
    SFCustomLanguageModelData.CustomPronunciation(grapheme: "Aw Aw R", phonemes: ["aU aU @r"])
    
    SFCustomLanguageModelData.PhraseCount(phrase: "See the sea", count: 100)
    
}

try await data.export(to: URL(filePath: "<path>/MLData.bin"))

We will build an app that sets background based on a color (wow…). First we tell the app to learn using templates towards a sentence „Set background to <color>”. That starts learning towards each color:
1. Set background to blue
2. Set background to red
and so…

The second phrase we want to do is to learn from the element. For „I am sea” background is sets to blue, for „I am fire” sets the background to orange. Resolving a „sea” element might be tricky. Speech recognition should correctly find out a phrase like „sea of something”, but when saying „I am sea” it will rather pick „I am see”. If there are more words with the same pronunciation Speech Recognizer will try to find it by context of the sentence. That’s when custom models come and correct the model to put more weight towards „I am sea” instead of „I am see”.

At the end there’s an example of custom pronunciation and just a phrase without the templates for more static usage.

Update the last line to decide where to save the model, run project and add MLData.bin to SpeechRecognition app.

Import training model

Now that we have prepared the training data we need to import it to the app.
First we set up the configuration object.

// ... state variables
private var lmConfiguration: SFSpeechLanguageModel.Configuration {
        let outputDir = FileManager.default.urls(for: .cachesDirectory, in: .userDomainMask).first!
        let dynamicLanguageModel = outputDir.appendingPathComponent("LM")
        let dynamicVocabulary = outputDir.appendingPathComponent("Vocab")
        return SFSpeechLanguageModel.Configuration(languageModel: dynamicLanguageModel, vocabulary: dynamicVocabulary)
    }

then if user grants authorization to use SpeechRecognition we read the file that was generated using ModelGenerator and assign it to the configuration.


// ... requestAuthorization() ...
switch authStatus {
  case .authorized:
    self.setUpTrainingData()
// ... 
    
private func setUpTrainingData() {
    Task.detached {
        do {
            self.recordButtonLabel = "Initiating training model"
            let assetUrl = URL(fileURLWithPath: Bundle.main.path(forResource: "MLData", ofType: "bin")!)
            try await SFSpeechLanguageModel.prepareCustomLanguageModel(for: assetUrl,
                                                                       clientIdentifier: "site.modista.speechRecognizer.SpeechRecognition",
                                                                       configuration: self.lmConfiguration)
            
            self.recordButtonEnabled = true
            self.recordButtonLabel = "Start recording"
        } catch {
            NSLog("Failed to prepare custom LM: \(error.localizedDescription)")
        }
    }
}

Fire up microphone

To use microphone we need to set up to classes: AVAudioEngine and AVAudioSession and of course one more permission in .plist file „NSMicrophoneUsageDescription„.

AVAudioSession is responsible for communicating audio between the app and the system. The shared object is available to other software on the phone as well. Sometimes starting the session may not work for example if you try launching it while talking on the phone.

AVAudioEngine is more of a low level connector. This class allow direct access to microphone input and forwarding it through audio processing chain.

Let’s write a method initiateRecording that will run when the record button is clicked.

let audioSession = AVAudioSession.sharedInstance()

do {
    try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
    try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
} catch {
    self.showAlert("Microphone is not available")
    return
}
let inputNode = audioEngine.inputNode

First get the shared instance of AVAudioSession. After that we configure it for recording with an option of measurement results. .duckOthers is supposed to lower the volume of any active sounds on the phone.

setActive method initiate the recording. If the microphone is for some reason not available we display a warning alert.

Finally we set AVAudioInputNode which represents a node connection to the hardware audio input which can be either device’s built-in microphone or any active connected microphone to device.

Speech recognition

Now we are ready to include speech recognition into the audio chain. Let’s add a new method that will be responsible for setting up speech recognition. First of all we need to set up Speech Recognition Request.

@State private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?

//....
private func setupSpeechRecognition() {
    self.recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
    guard let recognitionRequest = self.recognitionRequest else {
        
        return
    }
    recognitionRequest.shouldReportPartialResults = true
    recognitionRequest.requiresOnDeviceRecognition = true
    recognitionRequest.customizedLanguageModel = self.lmConfiguration
}

We want to get immediate results when speech recognition finds some segments, therefore shouldReportPartialResults is set to true. requiresOnDeviceRecognition tells the framework to not send audio through network connection. Final piece of configuration is setting the language model with customized data from ModelBuilder.

SFSpeechRecognizer

Now let’s create a SFSpeechRecognizer which is responsible for checking if Speech Recognition is available and also for initiating the speech recognition. To use it within the view we need to implement SFSpeechRecognizerDelegate protocol. Instead let’s introduce a view model that will contain all elements that we are gonna display in the UI. We want to have a record button, background color setting, last received text from speech recognition and optional alert if anything wrong comes up.

public class ContentViewModel: NSObject, SFSpeechRecognizerDelegate, ObservableObject{
    @Published var recordButtonLabel = ""
    @Published var recordButtonEnabled = false
    @Published var alertVisible = false
    @Published var alertText = ""
    @Published var backgroundColor: Color = .white
    @Published var lastText: String = ""
    
    
    let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!
    
    public func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer, availabilityDidChange available: Bool) {
            if available {
                self.recordButtonLabel = "Start recording"
                self.recordButtonEnabled = true
            } else {
                self.recordButtonLabel = "Recognition not available"
                self.recordButtonEnabled = false
            }
        }
    public func markButtonActive() {
        self.recordButtonLabel = "Start recording"
        self.recordButtonEnabled = true
    }
}

SFSpeechRecognizerDelegate method speechRecognizer looks for any change in the availability of speech recognizer and changes the record button status.

Now we can update ContentView with action and alert.

    var body: some View {
        VStack {
            Text(viewModel.lastText)
            Button(action: {
                audioEngine.isRunning ? stopRecording() : initiateRecording()
            }, label: {
                Text(viewModel.recordButtonLabel)
            })
            .disabled(!viewModel.recordButtonEnabled)
        }
        .padding()
        .onAppear(){
            self.requestAuthorization()
            
        }
        .background(viewModel.backgroundColor)
        .alert(viewModel.alertText, isPresented: $viewModel.alertVisible, actions: {
                
            }
        )
    }

Setting up speech recognition task

We are now ready to write a method to initiateRecording. That method sets up audio session, speech recognition and finally executes a speech recognition task. Once everything is prepared we set the output audio format, connect it to the chain and then start the recognition.


    private func initiateRecording() {
        let audioSession = AVAudioSession.sharedInstance()
        
        do {
            try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
            try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        } catch {
            self.showAlert("Microphone is not available")
            return
        }
        let inputNode = audioEngine.inputNode
        
        setupSpeechRecognition()
        
        recognitionTask = viewModel.speechRecognizer.recognitionTask(with: recognitionRequest!) { result, error in
            var isFinal = false
            
            if let result = result {
                
                
                isFinal = result.isFinal
                
                if (result.bestTranscription.segments.count == 4 && result.bestTranscription.formattedString.contains("Set background to")){
                    switch result.bestTranscription.segments.last?.substring{
                    case "green":
                        viewModel.backgroundColor = .green
                    case "blue":
                        viewModel.backgroundColor = .blue
                    case "red":
                        viewModel.backgroundColor = .red
                    default:
                        print("Color not recognized")
                    }
                }
                if (result.bestTranscription.segments.count == 3 && result.bestTranscription.formattedString.contains("I am")){
                    switch result.bestTranscription.segments.last?.substring{
                    case "sea":
                        viewModel.backgroundColor = .blue
                    case "fire":
                        viewModel.backgroundColor = .orange
                    default:
                        print("Be whoever you want to be")
                    }
                }
                viewModel.lastText = result.bestTranscription.formattedString
            }
            
            if error != nil{
            }
        }
        
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
                self.recognitionRequest?.append(buffer)
        }


        audioEngine.prepare()
        do {
            try audioEngine.start()
            self.viewModel.recordButtonLabel = "Recording. Tap to stop"
        } catch {
            print("Something's not right")
        }
        
    }

Inside the recognition task we receive a result with a transcript. Each transcript is partitioned into segments. If there are 4 chunks the text contains. „Set background to” we check the last chunk to be one of the supported colors.
Similar situation goes when there are 3 chunks and it contains „I am”. Each word recognized is a chunk/segment.

Final method to add is stopRecording. Add it to stop the audio engine, release all elements in audio chain and free resources.


    
    private func stopRecording() {
        self.audioEngine.stop()
        audioEngine.inputNode.removeTap(onBus: 0)
        
        self.recognitionRequest = nil
        self.recognitionTask = nil
        
        self.viewModel.recordButtonEnabled = true
        self.viewModel.recordButtonLabel = "Start recording"
    }

Summary

In this post I showed how to set up a speech recognition in iOS using live recording. Built-in CoreML services allow to recognize many spoken languages, but only one can be used at a time. It allows provides features to implement custom models into an app.

Using speech recognition you can also build your own Siri-like experience for users within your app. Although in example we disallowed sending audio through internet, it’s worth metioning that using Apple’s web services provides a better precision.


Opublikowano

w

, ,

przez