Richard Jacobs: Hello. This is Richard Jacobs with the future tech podcasts. My guest is Adria Recasen the MIT computer science and artificial intelligence laboratory we will be talking about a new AI system that deals with speech and object recognition all at once. So, Adrian, thanks for coming. How are you doing?
Adria Recasens: Good. How are you?
Richard Jacobs: Good, good. So tell me about this AI system how do you combine speech and object recognition? What’s the whole premise of the thing?
Adria Recasens: So basically this work started as part of an experiment to see if we could actually learn too much optics in emitters with the spoken word, which basically is that the words that people use to describe an object. So the idea would be if you have someone describing an image for us as a human, it’s actually very easy to match the spoken word that people use to describe a particular object.
However, machines are not as good as us to do this without my team formation. So basically we want to see if training a machine learning system to do this.
The task basically matches together descriptions of images we’ve invented itself the system would be able to actually learn some of the relations. So, basically we were asking ourselves, okay, if we give it half a million of images with its corresponding description, is the system going to learn how a person looks like or how a ball looks like or how about tree looks like. So that’s the basic premise of the work. We just use draw out the signal. So there is a lot of work in computer vision and basically in technology using texts. But we wanted to tackle the system just from the draw with your perspective. So we use to draw out your signal we use a large state, the set of images, and we had that lots of people describing, did you test for us? And basically, we train a system to learn this correspondence between the images and the descriptions.
Richard Jacobs: Do I would look at a picture and I would say, hey Adria you see that in the bottom left corner? That purple thing that’s a little bit shiny. It looks like a square laying on its side or something.
I would use something like that and the system would listen to me describe the object and then it would label it and identify it.
Adria Recasens: Yeah. Basically the idea is that if you have a picture and you say exactly like, Oh, there is a chair next to the table in the center of Image, the system should be able to tell you, okay, you are talking about these parts of the image about this chair and he’s actually learning what chair mean’s because that’s basically the only way it has to do the matching between fit within the image and the speech is to understand the relationship between what delicts you’re describing and how they look like in the image.
We actually do some analysis on how many optics the system learn and how do they learn it. So basically which kind of objects are the best one for the system ground these spoken words in but yeah basically the basic use cases. You’ve described some parts of the team and the system learns to tell you, okay, you are actually talking about these parts or these other parts of the team.
Richard Jacobs: What would be the use of this? I could see it maybe if you’re presenting to someone, you know, instead of drawing over an image, you can describe stuff, but what would be the reason they have so many cases? It would catalog extra patients, have doctors talk about it or how would you use it?
Adria Recasens: So I think there are a lot of different uses, one use would be I can have a faster way for a notation. So in computer vision, there is this problem where we have to update large data sets. Usually, what people do either draw a bounding box into the image or we describe it with text? But describing notation is actually way cheaper because you’re just actually asking someone. Okay, can you describe it? They take 10 seconds describing the next image.
So and then we show that you were able to use this information to leverage the content of the image corresponding to the idea. So actually you can learn from there, which means you have a cheaper way to annotate at the same time.
We want to see on whether you’ll tactfully learn more of an experiment itself or can we actually learn these groundings between Audio on image seems to be working fairly well and then you can also think about systems that the fix humans.
So either in cars or at home or basically robot of all different kinds. Will you describe a particular objects that the robot is being just with your voice and the systems, it’s able to actually pick it up and say, okay, you are talking about these objects and I have to do something with that and this something could be, so you say okay, open the light next to the table in the living room, or it could be, okay, bring me the glass of water next to the chair and leave it here so all these, references, spatial reference in our speech can be actually captured by the system and be used for let’s take Robert to make use of it and help humans. So I think there are lots of different variations of applications. These are that I think are cool and interesting.
Richard Jacobs: So you would train a system with your particular way of describing things, where would you want to know you train system that’s a new person can just start using, it would work.
Adria Recasens: Yeah, You can imagine different scenarios where you have a prepaid system with a lot of general description, or if you want to go into a particular setting that it’s very specific like you have a very particular way of describing things. You could just describe so we’ve had Betty White retained system you can kind of retrain or retrain your system for your particular needs with your particular way of describing things. So you know the ideas that the system is able to ground speech with whatever it thinks in an image.
So you could give that either from a data that the train from like a lot of people or your own data if you want your system to keep learning about how you describe things. So there are lots of different settings that could be interesting here.
Richard Jacobs: So where is this being used right now? Or is it an early stage of development?
Adria Recasens: So this is something that we did start the project to understand the relation between the human’s speech and the images when people were worth describing and basically the idea is this point. So we started with this idea. Okay, if you are giving a lot of, you had images some descriptions in two through these machine learning system is actually going to learn, to teach this particular objects because it’s going to learn to relate them to the spoken word that it’s being used and we found so then we are actually exploring.
Well thanks users of the system or potential expansions or improvements of the system so that it would actually work better. It could work in different settings. So I think there are a lot of explorations like would it work in different languages? Can you use this to learn languages? Do they run different languages and relate them? So there are other things that could be done with the system and yeah, we are just, so this is like totally, there are a few public videos for four or this and we hope to, grow these and to start learning more about how the system works and then improving it.
Richard Jacobs: Well, how will it work? How will it help you learn? The language, how will that work? Is it, if you don’t speak the native language that you could describe things and it would still work and show the objects? So I mean how do you envision this would work?
Adria Recasens: There are a couple of use cases. One interesting one is for languages that don’t have a written form. So there are a lot of languages in the world where you can actually not write because they don’t have written it’s only spoken. So this system would actually get to or some of this, it would create this, it was learned basically to relate whatever the person is saying in the Audio away form to the optics in the, in the image. So you would be able to create some kind of dictionary between the spoken word, the fanatics and whatever it is that’s one option. And the other one is that you can imagine that the weight for the system to learn the languages does your describing stuff.
So in some sense, the system is learning words in English like when we are given case called Chameleon image descriptions and its people saying, oh! There is a girl next to a lighthouse or there is a sheep in the field, it’s learning to the text and to ground into the image of the concepts. So they scan of learning the very basic notion of um, of English. So you got, you could imagine that they’ve somehow used this basic super training with English and then you want us to sterilize them some other language you could start like describing in a different language and the system to start learning this different language.
Richard Jacobs: Gotcha! What platforms would this be aligned with? This is an APP on a smartphone and eventually, but would it have to be on a first division?
Adria Recasens: So I think we’re not thinking that far at this point. We already tested in an exploration phase it’s a research project. We can understand the system itself fair, who won’t understand its weaknesses. It’s the strong points and we want to basically try to make it better so that it’s kind of understand more things in the image There are a lot of open, nice directions to go it’s good at detecting Coptic’s, but we have a publication on the system learning. So can you learn what red means? What blue means, what black means, what small means? I think that is a line that we are interested in but this point we are not considering how to create an APP or we have a demo, which is nice it just for our own use and to see how the system works. But I think, yeah we are not discussing possible forms at this point.
Richard Jacobs: What would be your ideal, you know, the next few years, what would you love to see happen?
Adria Recasens: So I think I would love to see Robert Understanding, speech you know, interacting with humans, through boys kind off without needing any particular help, like I would love to see them that train in our settings where either you all are at home and you refer to something that’s common, basically you act on it like opens the light or brings through the Roberta product home brings you something or I would like your car to do the same like basically understand whatever you’re saying can interact with you so I think along with potential users with this like speech interaction with, with machines and that’s I think a very good start and we showed that the system is able to they get back very well hold some particular objects in the follow up. We show that we are able to detect some roots and basically I think language is very complex. This is the beginning, I would love that. Let’s see how we are actually improving our understanding of spoken language and with that, we are actually impacting then how applications that interact with humans work does? Because if you have a better understanding of spoken a language, you can actually, um, basically interact better with humans.
Richard Jacobs: okay! I guess the focus is not on describing things in an image, but I’m training the system to the point where any spoken command or description much more accurate whether a robot or a device, let’s say Siri and Alexa had this advanced AI recognition. I mean, it’d be a lot more powerful if you can speak to them. Well, you can’t speak to them. There’ll be a lot more powerful and what they could do for you because they wouldn’t understand far more.
Adria Recasens: Yeah! That, so I think part of the point that we are trying to convey in this, in this work is that we’ve spoken a language you can learn a lot and of course systems, they’re very good.
I think we make up our mind by live learnings through this is something you can imagine in a few that we improve the way or in the amount of data you need for learning. You can imagine systems are trained this way. As you are mentioning, interact with humans and basically enabled this possibility of grounding particular words we’ve updated the aim at or something like that and actually I think it’s also relevant in the sense that if the system keeps it’s good to run the particular language that you use at home a week isn’t necessarily. The same as some other people use this language to refer to things at home. So basically you could specialize in these devices on how a particular person talks or the way they may think that would be useful as well.
Richard Jacobs: Very good, so what’s the best way for people to learn more about this and your work and maybe to get in contact?
Adria Recasens: Okay! So, we have a website for the project they can download, download the data if they want to Tran’s things with that they have well my email, of course, is there and the other outsource email. So if people are interested in this, if they want to chat about this, you may ask we’ll be more than happy to chat, to discuss and to learn more.
Richard Jacobs: All right, very good Andrea, I appreciate you coming on the podcast. Thank you so much!
Adria Recasens: Okay! Thanks for inviting me.
Subscribe to Our Newsletter
Get The Latest Finding Genius Podcast News Delivered To Your Inbox