Image recognition within the browser

Kevin Wong
4 min readDec 20, 2023

--

Image classification application — Dall-E 3 generated

In this article, I delve more into the realm of ML models operating directly within web browsers. The fascinating world of image recognition has consistently captivated my interest, particularly the concept of submitting any arbitrary image to a computer and witnessing its ability to identify the subject. For fans of the Silicon Valley TV show, who can forget Jian Yang’s innovative SeeFood startup app idea? Upon my initial encounter with the image-classification TensorFlow model, my thoughts immediately relates back to the SeeFood app’s remarkable capability — feeding it an image and having the ML model discern the subject.

The How

Imagine an engineer needing to teaching a robot to recognize subjects in pictures. There are many steps involved to have the computer make a prediction.

Inspecting Pictures

The robot starts by looking at lots of pictures. These pictures have labels, like “pizza” or “hotdog”, so the robot knows what’s in each one.

Recognizing Patterns

The robot has special glasses (convolutional layers) that help it find patterns, like shapes and colours, in different parts of the pictures.

Remembering Important Stuff

It takes note of important things, like ramen is in a bowl or pizza is usually round, as it goes through the a large collection of pictures.

Putting Pieces Together

The robot then thinks about all the important things it found and decides, “Hey, this combination of features looks like a hotdog!” (because it’s long, seems to have a split bun and something that looks like a sausage) or “Ah, this looks like a pizza!” (because it’s round and seems to have an outer circle that is like pizza crust).

Making Predictions

Now, the user presents the robot with a new picture without telling it what it is. The robot looks at previous patterns, remembers what it learned, and makes a guess by mapping known features within the picture.

Getting Retrained and Self-Improving

Repeat this process with many pictures, and the robot gets better and better at figuring out what’s in them. It learns to recognize pizza, ramen bowl, pasta and maybe endless classification of food!

So the above model describes CNN (Convolutional Neural Network) which is like a smart filter for images. It learns to recognize patterns and features, like shapes and colours, enabling it to understand and classify objects given an image input. Used in image recognition, it’s trained on labeled images to identify and categorize various visual elements, making it a key technology for tasks such as identifying objects in photos or videos. It understand and remember the important things in pictures, and image classification. This is the robot processing and feeding back, “I think this picture is a hotdog!” or “That one looks like a pizza slice!”

Demo

This sounds really darn cool. How will an engineer build it?

Luckily there are great examples on the Web on how to build something like this. I tried this out myself and built a very simple React app that does exactly this all within the browser and a pre-trained image-classification model from TensorFlow without a need for any round-trips to any servers aside loading the model files.

So the application flow looks like the following:

The steps are:

  1. User loads the application
  2. The application will use the pre-trained image-classification CNN model. This model is interesting because Mobilenet is aimed for only using the power of mobile device to do model computes.
  3. The user will supply an image.
  4. The browser renders the uploaded image blob. Once rendered and loaded, the image data is sent for prediction against the loaded image-classification model.
  5. The results will be returned with the confidence level and rendered as feedback to the user.

Codesandbox example

Live demo:

Following are some cool captured sample results screenshots of the demo SeeFood application. I uploaded a couple of food images I found on the web and it was able to “confidently” predict the image input.

Demo application using image-classification model to make prediction of the food for uploaded images

Parting thoughts

Image classification stands out as a clever means for user interaction as a computer interface. The seamless operation of these models in both browsers and on mobile devices opens up a realm of possibilities for user input. In this demonstration, I focused on utilizing static images for computer predictions. Imagine the engagement for end-users by extending this capability to live video streams. Additionally, it’s worth noting that the showcased demo application employed a generically trained image classification model from TensorFlow, not a specialized food image Convolutional Neural Network (CNN) model trained with a dataset like Food101. With tailored training to such specific needs, the results are likely to be surprisingly accurate. This is just the initial iteration in recreating a mimicked SeeFood app. Future articles will delve into training these models and integrating them with TensorFlow.js.

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Kevin Wong
Kevin Wong

Written by Kevin Wong

Software Engineer and Technology Enthusiast based out of Vancouver, British Columbia. (https://www.linkedin.com/in/kevinkswong/)

No responses yet

Write a response