A robot that manipulates objects while, say, working in the kitchen, would benefit from understanding which objects are made of the same materials. With this knowledge, the robot would know how to apply the same force whether it was picking up a small piece of butter from a shadowy corner of the counter or an entire stick from inside a brightly lit refrigerator.
Identifying objects in a scene that are composed of the same material, known as material selection, is a particularly challenging task for machines because the appearance of the material can vary dramatically depending on the object’s shape or lighting conditions.
Scientists at MIT and Adobe Research have taken a step toward solving this challenge. They developed a technique that can identify all the pixels in an image representing a given item that is displayed in a pixel selected by the user.
The method is accurate even when objects have different shapes and sizes, and the machine learning model they developed isn’t fooled by shadows or lighting conditions that can make the same object look different.
Although they trained their model using only “synthetic” data generated by a computer that modifies 3D scenes to create many different images, the system works effectively on real interior and exterior scenes never seen before. The approach can also be used for videos; Once the user identifies a pixel in the first frame, the model can identify objects made of the same material in the rest of the video.
In addition to applications in robotics scene perception, this method can be used for image editing or incorporated into computational systems that infer the parameters of image materials. It can also be used for content-based web recommendation systems. (Perhaps the buyer is looking for clothing made from a certain type of fabric, for example.)
“Knowing what material you’re dealing with is often quite important. Although two objects may be similar, they may have different material properties. Our method can make it easier to pick out all the other pixels in an image that are made of the same material,” said Praful Sharma, a graduate student in electrical engineering and computer science and lead author of a paper on the technique.
Sharma’s co-authors are Julien Philippe and Michael Garby, research scientists at Adobe Research; and senior authors William T. Freeman, the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Fredo Durand, Professor of Electrical Engineering and Computer Science and CSAIL Fellow; and Valentine Deschaintre, research scientist at Adobe Research. The research will be presented at the SIGGRAPH 2023 conference.
A new approach
Existing material selection methods struggle to accurately identify all pixels representing the same material. For example, some techniques focus on whole objects, but a single object can be composed of many materials, such as a chair with wooden arms and a leather seat. Other methods may use a predetermined set of materials, but they often have broad labels such as “wood”, despite the fact that there are thousands of types of wood.
Instead, Sharma and his colleagues developed a machine learning approach that dynamically evaluates all pixels in an image to determine the material similarities between a user-selected pixel and all other regions of the image. If the image contains a table and two chairs, and the legs of the chair and the table top are made of the same type of wood, their model can accurately identify those similar regions.
Before the researchers could develop an AI method to learn how to select similar materials, they had to overcome several hurdles. First, no existing database contained material that was labeled finely enough to train their machine learning model. The researchers presented their own synthetic dataset of indoor scenes, which included 50,000 images and more than 16,000 objects, randomly applied to each object.
“We wanted a database where each individual type of material was listed independently,” says Sharma.
With the synthetic database in hand, they trained a machine learning model for the task of identifying similar objects in real images, but it failed. The researchers realized that a change in distribution was to blame. This happens when a model is trained on synthetic data, but it fails when tested on real-world data that may be very different from the training set.
To solve this problem, they built their model on a pre-trained computer vision model that had seen millions of real images. They used that model’s prior knowledge by using the visual features it had already learned.
“In machine learning, when you use a neural network, it usually learns the representation and the process of solving the task together. We have disabled this. The pre-trained model gives us a representation, and then our neural network just focuses on solving the task,” he says.
The researchers’ model transforms common, pre-trained visual features into material-specific features, and does so in a way that is robust to object shapes or different lighting conditions.
The model can then calculate an item similarity score for each pixel in the image. When the user clicks on a pixel, the model figures out how close each other pixel is to the query in appearance. It creates a map where each pixel is ranked on a scale of 0 to 1 for similarity.
“The user simply clicks on a single pixel, and then the model will automatically select all regions that have the same material,” he says.
Since the model provides a similarity score for each pixel, the user can refine the results by setting a threshold, such as 90 percent similarity, and obtain an image map with those regions highlighted. The method also works for cross-image selection—the user can select a pixel in one image and find the same item in a separate image.
In experiments, the researchers found that their model could more accurately predict areas of an image that contain the same material than other methods. When they measured how well the prediction compared to the ground truth—that is, actual areas of the image made up of the same material—their model matched with about 92 percent accuracy.
In the future, they want to improve the model so that it can better capture the fine details of objects in the image, which will increase the accuracy of their approach.
“Rich materials contribute to the functionality and beauty of the world we live in. But computer vision algorithms typically ignore materials, focusing heavily on objects instead. This work makes important contributions to image and video content recognition under a wide range of challenging conditions,” said Kavita Bala, dean of Cornell’s Bowers College of Computing and Information Science and professor of computer science, who was not involved in this work. . “This technology can be very useful for both end consumers and designers. For example, a homeowner can visualize how expensive choices like reupholstering a couch or changing the carpet in a room might appear, and can be more confident in their design choices based on these insights.”