Voice controlled Selenium automation using DialogFlow: Part II

In the first part of this tutorial, we saw how to set up the DialogFlow conversation model. In this part, we will see how to setup a selenium server and code the application for web automation using this server and finally, how to plug in this application with our DialogFlow conversation model.

For the purpose of this tutorial, lets configure the selenium server in the target machine itself, and to make things more simpler, lets make the target machine our computer itself. You will see in a minute how this setup simplifies the process.

Selenium server setup

Download the files selenium-server-standalone-2.53.0.jar and chromedriver.exe to your machine and run the below command to start the selenium server in your machine with chromedriver as the selenium webdriver.

java -jar <PATH_TO_SELENIUM_SERVER_JAR>\selenium-server-standalone-2.53.0.jar -Dwebdriver.chrome.driver=<PATH_TO_CHROMEDRIVER>/chromedriver.exe

Now your machine is the selenium server and the target machine. If you are woking behind a LAN, you might want to open a port in your router to the selenium server you just started so that the heroku selenium app can hit the server from outside the LAN.

Selenium application logic and setup

Once the user initiates a conversation by saying “open google in chrome” , the open_url_in_browser intent will be triggered and a JSON request corresponding to this intent will be generated. The selenium application will receive this request and retrieve the parameter values from this JSON request and use these values to initiate a chrome browser session and load the URL. Since we have started the selenium server with chromedriver as webdriver in our machine itself, the browser session will open in our machine itself. Once the URL is loaded, the user can initiate the next action by saying ,for example, “search america”. The application will then parse the JSON request generated by DialogFlow for the intent operation to retrieve the action to be “search” and value to be “America” and will do a search for America in the opened browser. You can see all these actions happening real time in your machine itself since you have configured the target machine to be your machine itself. This will enable you to understand and modify the selenium script with ease since you dont want to maintain a separate machine to see all the action happening!!!

Now lets get our hands dirty with coding the application. For creating the application, i am using VSCode since its easier and intuitive to use. You may even use notepad files for writing the script.

As the entry point to the application, create an index.js file with the below content.

var express=require('express');
var app=express();
var bodyParser = require('body-parser');
// configure body-parser for express
app.use(bodyParser.urlencoded({extended:false}));
app.use(bodyParser.json());
var server=app.listen(process.env.PORT ||7000,function(){
console.log("server has started");
});
// routes here
var routes=require('./routes.js');
routes(app);

We can see that an Express server is created in the port 7000 and the routes are declared in the file routes.js. The routes.js file is defined as below.

var express = require('express');
var entry= require('./entry.js');
module.exports = function(app) {    
    app.route('/').post(entry.dialogflowFulfillment);    
    app.get('/',function(request,response){    
    response.send('hellooo');
});
}

In the routes file, we are configuring all the POST requests to ‘/’ be directed to dialogflowFulfillment defined in the entry.js file. The contents of the entry.js file is as below.

var driver,webdriver,By,Capabilities,Builder,Key,until;
exports.dialogflowFulfillment = function(req, res) {    
    async function open(){        
        webdriver=require('selenium-webdriver')        
        By=webdriver.By        
        Capabilities=webdriver.Capabilities        
        Builder=webdriver.Builder        
        Key=webdriver.Key        
        until=webdriver.until        
        driver=new Builder().usingServer('http://***.***.***.***:4444/wd/hub').        
        //   withCapabilities({'browserName': 'firefox'}).        
        withCapabilities(Capabilities.chrome()).        
        //    withCapabilities({'platform':'WIN8_1'}).        
        build();        
        await driver.get('http://www.google.com')
    }
async function search(keys){    
    const element = await driver.findElement(By.name('q'))    
    await element.sendKeys(keys, Key.RETURN)    
    await driver.wait(until.titleIs(keys+' - Google Search'), 1000)
}
async function close(){    
    await driver.quit()
}
if (req.body.queryResult.intent.displayName == "open_url_in_browser") {    
    open();    
    return res.json({        
        speech: 'Opening google in chrome!',        
        displayText: 'Opening google in chrome!',        
        source: 'na' });
}
if (req.body.queryResult.intent.displayName == "operation" && req.body.queryResult.parameters.op == "search") {    
    var enterKeys=req.body.queryResult.parameters.keys;    
    search(enterKeys);    
    return res.json({        
        speech: 'operating on google in chrome!',        
        displayText: 'operating on google in chrome!'+enterKeys,        
        source: 'na' });
}
if (req.body.queryResult.intent.displayName == "close_browser") {    
    close();    
    return res.json({        
        speech: 'operating on google in chrome!',        
        displayText: 'operating on google in chrome!',        
        source: 'na' });
}
};

The *’d out portion is where you will be defining the selenium server URL.

We can see that there are three if statements, each checking for the intent name in the JSON requests received from DialogFlow using the parameter req.body.queryResult.intent.displayName . If the intent name of the request received from DialogFlow is open_url_in_browser, open() function will be executed which essentially opens the chrome browser and loads the URL in chrome.

When the application sees that the intent name is operation and the browser operation to be performed (value of “op” key) is search using the parameters req.body.queryResult.intent.displayName and req.body.queryResult.parameters.op respectively, the application will use the value of req.body.queryResult.parameters.keys parameter to do a search operation in the browser.

And, last but not the least, if the application identifies the request to be a close intent, the driver.quit() statement will be executed.

Each handler should return a JSON object having the parameters speech, displayText and source. The speech parameter can be used to provide a voice output from DialogFlow corresponding to each handler’s execution. The displayText will come in handy where there is a UI also involved if you are using a GUI capable smart speaker device.

Deployment

Once this project structure is setup, we need to deploy the application as an app in Heroku. For deploying a node.js application in Heroku, you can refer the Heroku docs. Once the upload and build process in Heroku is success, we will get a URL like https://**.herokuapp.com for that application. This URL needs to be provided in the webhook field in the Fulfillment section of DialogFlow.

And…….we are all set!!! Now go ahead to DialogFlow and try out the application by voicing out or typing the query in the simulator present in the right hand side of the console page. Once you are all set with testing in the DialogFlow simulator, you can integrate the conversation model into different platforms including Google Assistant.