CY: Playing YouTube videos with Voice commands

I love music, and I am also lazy. So, amongst the many things I have always imagined asking the Genie for (once I get lucky with the lamp :)), this one has been a constant – playing good music as I lay cosily on the couch sipping my mojito. Well, till the time Genie shows up, let’s try and make do with our CipherYogi. He might not be residing in a lamp or sporting a cool goatee (or may be he does!), but he can surely do the job for us – minus the mojito though 🙁

So, now we make CipherYogi learn playing YouTube videos on Chromecast following a voice command.

import speech_recognition as sr 
import pyttsx3  
import pychromecast
from pychromecast.controllers.youtube import YouTubeController
from youtubesearchpython import SearchVideos
import json

We start by importing required dependencies. Will use speech_recognition for recognizing speech input. Since I plan to use Microphone, I have also installed PyAudio. You can use other sources also. Further, I shall be using Google recognizer recognize_google. You have other options too – e.g Bing, Sphinx etc. You can check out the library documentation here and make your choice. We will use pyttsx3 to give our Genie a voice :). We can simply pass a string and have it voiced out. You can configure a bunch of things e.g volume level, voice (male/female) etc. If interested, check out the documentation and examples here. Next I have imported pychromecast and YoutubeController for locating Chromecasts on my network (Wi-Fi), and to play/control YouTube video. json as you might have guessed, will help us parse search results to locate requisite details. All right then! let’s set sailing!

def SpeakText(voiceinput): 
    engine = pyttsx3.init() 
    engine.say(voiceinput)  
    engine.runAndWait() 

To begin with, let’s start building services that intend to reuse. In above code, I have made a super simple voice output service that takes a string as input and have called it SpeakText. Another approach could be to initialize the engine once and keep invoking it, but as I gather from a few articles, this isn’t the best way to do it. You would rather have engine reinitialized every time you need a voice output.

def PlayYoutube():    
    attempts = 0
    r_yt = sr.Recognizer()
    while(1):            
        try: 
            with sr.Microphone() as source2: 
                SpeakText("Which YouTube video do you want me to play")
                audio2 = r_yt.listen(source2)
                search_input = r_yt.recognize_google(audio2)
                search_input = search_input.lower()
                SpeakText("Great, Playing now" + search_input)
                break                
        except:
            SpeakText("Sorry, I could not understand that")
            attempts = attempts + 1 
            if(attempts == 3):
                SpeakText("I am sorry. Please start over")
                return       
    
    my_device = "Family room TV"
    chromecasts = pychromecast.get_chromecasts()
    cast = next(cc for cc in list(chromecasts)[0] if cc.device.friendly_name == my_device)
    cast.wait()
    mc = cast.media_controller 

    search = SearchVideos(search_input, offset = 1, mode = "json", max_results = 3)
    results = json.loads(search.result())
    video_id = results['search_result'][0]['id']

    yt = YouTubeController()
    cast.register_handler(yt)
    yt.play_video(video_id)
    mc.block_until_active()
    mc.play()

Brilliant! So, now we will build the core – PlayYouTube. Once invoked, it initializes a counter to track number of unsuccessful attempts. I have kept it as 3 here. It then initializes Recognizer object which in turn shall be processing Microphone input. Next we stream Microphone input (using sr.Microphone()) to the Recognizer object using listen() function in an infinite loop. We then prompt the user to voice out the video he/she wants to play using SpeakText(). Once the input audio has been captured, we use Google library to convert it to text (search_input) and exit the loop to start processing over the text, which is video user wants to play. Any exception to this process is captured as an unsuccessful attempt.
Once we have the text string of the video user wants to play, we use get_chromecasts() to locate all the Chromecasts devices in the local network as a list object (it’s actually list of lists, with a list each for each Chromecast) and zero down on the one we want to cast. First element of each of these lists is the Chromecast object and we capture the same in the variable cast. We then call wait() to keep it ready, and assign a controller using .media_controller, to be used to control playing of video. We would need it after we have located the video on YouTube.
So, next we start looking up for the video by passing search_input to SearchVideos function. We capture the response in variable = search, which is a class object. I haven’t ventured into understanding more on the attributes of this object, except for what is required for my objective here (told you, am lazy 🙂 ). So, result() function attribute is what I need as it gives me string output of the search results, and parse them using json.loads(). We then get a dictionary of search results with key = search_result and value = list of search results, each one in a dictionary format. These results have following parameters: dict_keys([‘index’, ‘id’, ‘link’, ‘title’, ‘channel’, ‘duration’, ‘views’, ‘thumbnails’, ‘channelId’]). In our case, we need 'id' for playing videos. Btw, in case you have YouTube content of your own, and want to track how it’s faring, youtubesearpython could be an excellent place to start.
Okay! so the dish is now ready, we just need to plate it up. So, we create a YoutubeController object and pass it to the register_handler() function of cast object. This will allow us to cast whatever gets played on YouTube. And so, we simply play on YouTube using play_video and stream it on Chromecast using play() attribute of media_controller

def CipherYogi():
    SpeakText("Please look at camera for Authentication")
    AKSHAT_AUTH = FaceinVideoStream()    
    time.sleep(5)
    if(AKSHAT_AUTH):
         SpeakText("Akshat Authenticated")
         ObjFinder()
    SpeakText("Hello Akshat what do you want me to do")
        r = sr.Recognizer()
        while(1):     
            try: 
                with sr.Microphone() as audio_source: 
                    r.adjust_for_ambient_noise(audio_source, duration=0.2)   
                    input_audio = r.listen(audio_source) 
                    AudioText = r.recognize_google(input_audio) 
                    AudioText = AudioText.lower() 
                    print("Hi Akshat, Did you say " + AudioText) 
                    SpeakText(AudioText)
                    if "youtube" in AudioText:                    
                        PlayYoutube()
            except sr.RequestError as e: 
                print("Could not request results; {0}".format(e)) 
            except sr.UnknownValueError: 
                print("unknown error occured") 

If you have followed it till here, this one must be super easy. All I do here is to read Microphone input and check if the user (which is me 🙂 ) has said “YouTube” in it or not. If yes, it becomes a trigger for me to load PlayYouTube() function.

So, here is time to… (Woo-Hoo!)

CipherYogi()

Leave a Reply

Your email address will not be published. Required fields are marked *