How a Computer can Recognize an Image and Tell you what it Sees – [Desktop App]

49 lines of code. That’s all it takes to make a computer look at an image and tell you what it sees in a lifelike sounding voice.

I don’t have to reinvent the wheel. Others came before me; they made amazing creations. I can stand on their shoulders and praise their work.

To be relevant to programming, I don’t have to write a powerful algorithm for computer vision or one for speech synthesis from scratch. I can, but that’s not what I want to do with my time, especially if there are many out there that you can readily use.

What would it be like if every time you need to write code you have to program in assembler or at other lower levels of abstraction? You’d have to write thousands of lines of code for the most basic output, like a print statement for example.

Now this doesn’t mean that I’m all against that. On the contrary, I am contextually driven. I like to program my own stuff from lower levels of abstraction when I need something very specific and custom-made, which has not already been done by others.

A few days ago I posted a video demonstration of a desktop application I coded in Python with tkinter (for GUI development), that takes an image URL as input, applies computer vision to it, using IBM Watson, and speaks what it ‘sees’ using Ivona, which mimics natural language. Here’s the demo:

In this post, I’m going to go through the code, explaining what I did.

Computer Vision + Speech Synthesis – Code Overview

I’ll explain the code by major blocks. First, the pre-requisites:

– Watson API – free tier – get your API key
Ivona – discontinued – included/merged into Amazon Polly
pyvona – python wrapper for Ivona

My Specs: Windows 10, 64-Bit, Python 3.4.

  1. I begin with the imports. ‘keys’ is a file to hold my API credentials for Watson and Ivona, which I prefer to keep private.

  1. Building a minimal GUI using tkinter: one label widget, and an entry widget, for URL input. I’ll later add a button for the callback function.

  1. I create a Watson Visual Recognition instance using my ‘watson_key’ from the ‘keys’ file (here you replace ‘watson_key’ with your own API key). I also create an instance of Ivona using my API keys.

As of now, you can no longer register for Ivona since they’re already into Polly. Polly is also used for natural sounding voice synthesis and I’ll probably make a video about it in the future.

If you have registered with Ivona in the past and you still have your keys, you can still use them, but I don’t know for how long.

  1. This is the main part of the app. Here’s what the callback function does:

– takes the image URL from entry and applies Watson’s image classification utility
– Watson outputs in json format the image classes it recognized (i.e. tree, person, animal, etc.) in descending order of confidence (from the highest to the lowest probability). img1 is a variable that takes the most confident of these results.
– img2 uses PIL to open the image (this is not necessary)
– the entry field is cleared so that it can take another input (image) conveniently
– Ivona speaks/tells us the most confident of the results (img1)
– additionally, all results (classes and objects recognized) in descending order of confidence, are displayed in the console
– once Ivona finishes speaking, the Image window (in this case Microsoft Photos) is closed.

  1. For convenience and to prevent freezing/unresponsiveness of the app, I use threading, putting the entire callback function into a threading function.

  1. Then I create the button that will callback the threading function, which, in turn, calls the main callback function.

  1. Finally, as with all tkinter apps, I run the mainloop. For convenience, as soon as the app starts, the cursor is put in entry field for immediate and quick paste (CTRL+V) of the URL.


This is probably one the most basic, crude, and unpolished applications that can be created using this type of technology. There are many functionalities that I can and cannot think of that can be added to this desktop application.

For example, I could take a bunch of images and have them organized into categories or folders based on the classes they have been recognized for.

Or, I could use video input instead of image and have an alert set up for every time some object, person, or action of interest appears in the stream. But that’s at a different level of complexity.

Anyway, you can grab the entire code, in a single piece, from my github. Feel free to modify it to your own desire. Linking back to the repository or to my blog is enough for me to credit my work.

Get on my list of friends
More about my book Stress and Adaptation
More about my book Persistent Fat Loss
More about my book Ketone Power
More about my book Periodic Fasting

Related posts:



Leave a Reply

Your email address will not be published. Required fields are marked *