Image recognition within the browser

In this article, I delve more into the realm of ML models operating directly within web browsers. The fascinating world of image recognition has consistently captivated my interest, particularly the concept of submitting any arbitrary image to a computer and witnessing its ability to identify the subject. For fans of the Silicon Valley TV show, who can forget Jian Yang’s innovative SeeFood startup app idea? Upon my initial encounter with the image-classification TensorFlow model, my thoughts immediately relates back to the SeeFood app’s remarkable capability — feeding it an image and having the ML model discern the subject.
The How
Imagine an engineer needing to teaching a robot to recognize subjects in pictures. There are many steps involved to have the computer make a prediction.
Inspecting Pictures
The robot starts by looking at lots of pictures. These pictures have labels, like “pizza” or “hotdog”, so the robot knows what’s in each one.
Recognizing Patterns
The robot has special glasses (convolutional layers) that help it find patterns, like shapes and colours, in different parts of the pictures.
Remembering Important Stuff
It takes note of important things, like ramen is in a bowl or pizza is usually round, as it goes through the a large collection of pictures.
Putting Pieces Together
The robot then thinks about all the important things it found and decides, “Hey, this combination of features looks like a hotdog!” (because it’s long, seems to have a split bun and something that looks like a sausage) or “Ah, this looks like a pizza!” (because it’s round and seems to have an outer circle that is like pizza crust).
Making Predictions
Now, the user presents the robot with a new picture without telling it what it is. The robot looks at previous patterns, remembers what it learned, and makes a guess by mapping known features within the picture.
Getting Retrained and Self-Improving
Repeat this process with many pictures, and the robot gets better and better at figuring out what’s in them. It learns to recognize pizza, ramen bowl, pasta and maybe endless classification of food!
So the above model describes CNN (Convolutional Neural Network) which is like a smart filter for images. It learns to recognize patterns and features, like shapes and colours, enabling it to understand and classify objects given an image input. Used in image recognition, it’s trained on labeled images to identify and categorize various visual elements, making it a key technology for tasks such as identifying objects in photos or videos. It understand and remember the important things in pictures, and image classification. This is the robot processing and feeding back, “I think this picture is a hotdog!” or “That one looks like a pizza slice!”
Demo
This sounds really darn cool. How will an engineer build it?
Luckily there are great examples on the Web on how to build something like this. I tried this out myself and built a very simple React app that does exactly this all within the browser and a pre-trained image-classification model from TensorFlow without a need for any round-trips to any servers aside loading the model files.
So the application flow looks like the following:

The steps are:
- User loads the application
- The application will use the pre-trained image-classification CNN model. This model is interesting because Mobilenet is aimed for only using the power of mobile device to do model computes.
- The user will supply an image.
- The browser renders the uploaded image blob. Once rendered and loaded, the image data is sent for prediction against the loaded image-classification model.
- The results will be returned with the confidence level and rendered as feedback to the user.
Codesandbox example
Live demo:
Following are some cool captured sample results screenshots of the demo SeeFood application. I uploaded a couple of food images I found on the web and it was able to “confidently” predict the image input.


Parting thoughts
Image classification stands out as a clever means for user interaction as a computer interface. The seamless operation of these models in both browsers and on mobile devices opens up a realm of possibilities for user input. In this demonstration, I focused on utilizing static images for computer predictions. Imagine the engagement for end-users by extending this capability to live video streams. Additionally, it’s worth noting that the showcased demo application employed a generically trained image classification model from TensorFlow, not a specialized food image Convolutional Neural Network (CNN) model trained with a dataset like Food101. With tailored training to such specific needs, the results are likely to be surprisingly accurate. This is just the initial iteration in recreating a mimicked SeeFood app. Future articles will delve into training these models and integrating them with TensorFlow.js.