How Academy Award winner John Gaeta is changing video game characters with AI

We sat down with The Matrix' visual effects supervisor John Gaeta to discuss the future of gaming and AI characters

Author:
Jason McMaster
Updated:
Apr 11, 2023 5:57 AM EDT
Original:
Apr 6, 2023

Inworld

We sat down with The Matrix' visual effects supervisor John Gaeta to discuss the future of gaming and AI characters

One of the hottest topics in tech news is AI (Artificial Intelligence). It seems like every day there’s a new AI-based text generator, face changer, or voice modulator. From ChatGPT to DALL-E 2, the new age of computer helpers is upon us, and it’s becoming more and more common as the days go on. We’re starting to see our interactions with the digital world becoming more like a natural conversation than typing search terms.

For kids who grew up in the ’80s and ’90s, The Matrix was a groundbreaking movie that introduced many to the concept of an interconnected metaverse of machines. One of the movie’s more notable aspects is its special effects and how they changed how we fundamentally think about visual recreations of our world.

The man responsible for those effects is John Gaeta. He won an Academy Award for his work with The Matrix, but he didn’t stop there. In the meantime, he has worked extensively in the world of virtual reality and AI. Recently, I sat down with John to discuss the future of gaming, artificial intelligence, and whether we’ll be living in the Matrix within the next 20 years.

JM: Starting out, can you give us a quick overview of AI?

John Gaeta: Well, first of all, I do assume you know who you’re talking to, right? <laugh> I work with world-class engineers and inventors. I love being in that crowd. They’re very forgiving with me insofar as I use naivete and ignorance to some degree, to cut through what sometimes is perceived as limits. They like the relationship as well because they want someone to provoke and prompt them towards something to reach for.

My background begins with storytelling - with cinema and photography. The old paradigm. And then through that, I wound up falling sideways into hardcore digital effects. I’m, to some degree, closing this circle. Where I began purely sort of trying to envision and visualize words that described the sublime nature of AI and virtual reality and all things of that nature. Simulation as concepts, right? Hack the concepts visually ’til enough time has gone by where we’re all collectively kind of closing the circle now, and we’re starting to see the actual manifestation of some of these things that we unpacked in sci-fi storytelling years ago.

At the moment, deep mimicry is what it’s capable of doing. It’s not necessarily aware unless humans help it be aware through all manner of sensing and guidance to allow it to know what actually is happening. It’s a sort of illusion of general intelligence at the moment. That’s not my goal. My interest is not pursuing general intelligence. In fact, I’m afraid of it. I’m not sure yet what that is gonna be like, and whether it’s manageable, but I would describe what we’re doing as very narrowly focusing on using large language models. Natural language processing. We’re trying to narrow it down and focus it into the meaningful portrayal of characters. Characters that I would consider good characters.

How would you define a good character?

There’s one-dimensional characters that let us down and we say “Okay, that was not interesting.” Then you have three-dimensional characters that seem to have layers of sophistication buried inside under the surface. Motives and things of that nature. Things that you would attribute to a brilliant performance of a great character by an actor. We aspire towards trying to narrowband our use of these large language models to try to understand how to do deep character. That’s our goal. To have fun and entertain. We’re not pursuing AI for management of systems and productivity, and the countless use cases that people are trying. We’re really trying to stay focused on how to create a really compelling story or gameplay experience that is genuinely engaging and can lift a few mediums in the meantime.

Would you mind telling us a bit about what you folks are doing at Inworld AI? I know the company focuses on better, deeper AI for NPCs in games that goes beyond being just a chatbot.

There are a lot of fascinating things that people probably don’t realize are possible. It’s not just a conversation. The characters can be made to understand their own history. They can have a disposition that you can tweak. You can make them introverts or extroverts. You can create, essentially, a formula of emotional balance.

There are all sorts of interesting knobs to play with. And the knobs are based on pattern recognition in large language models about how people, or characters, can be diverse, and react in a diversified way in different scenarios. You can tune the system to look for these attributes that the creator puts in there. We’re working on an experimental function that will enable us to, at runtime, create live feeds of information that can come into the character. So, to some degree, it’s like a state machine that can be updated.

So that characters can change as the world progresses?

A year ago, most of the common knowledge and the background knowledge that a character may have about itself are curated. You can put in hundreds of thousands of entries. I mean, it’s not a small amount. You can put in quite a lot. But what’s most exciting is awareness at runtime. Let’s say it’s an AI character related to some sports event, and it can understand what’s happening from moment to moment. It could understand the state of a game at the moment of your interactions. And so, you can start to inject real-time awareness and context. These are all features we’re working on in earnest, which will make the characters very dynamic.

We’re talking about more than conversation. We’re, to some degree, talking about a very simple brain. What Inworld is doing is essentially orchestrating a cluster of different machine learning models that do different things. There’s one completely devoted to the detection of emotional states that it detects through the language of what’s being spoken about. But you can add other inputs. You could, for example, add a sort of computer vision. The posture of a person standing before a camera could be interpreted as friendly or hostile or et cetera. You can put in a lot of different inputs. There are other kinds of environmental inputs. Think of a theme park that has all sorts of interesting sensors that could be drawn upon to determine what emotion might be happening.

There are additional things that I would say are worthy of noting in so far as the character’s mind can do more than speak. It can also send signals that say how a character is feeling or what the character is thinking about right now. And with that signal, you can fire off any kind of blueprint. Let’s say you have a Batman game and something that occurs in the mind of Batman, if that’s your character, that danger is near.

And then can basically fire off a whole new chain of events that a game designer could construct. Batman will run or find a bad guy or start fighting. Or start to talk to the bad guy while fighting. Or driving. You can start having this overlap of action and so much more. So, there are goals and motivations. There are hidden motives that could be embedded into the mind of the character. Like an improv actor. So, imagine an improv session like “Whose Line Is It Anyway.” They have secret motives, like they are trying to achieve X and you are trying to achieve Y. Neither person knows the other’s motivation, right? Then they’re basically left to try to improv their way through their hidden motive. You can bury things inside under the surface of these characters that can influence not only what they say, but how they behave.

Your character databases are trained on massive amounts of examples, but at a certain point, the developers have to teach the AI how to view this info in a relative way. In your development suite, there are a lot of sliders to mess with to set a character’s disposition. That’s kind of the magic part of the whole process to me.

You can create a character with one sentence, literally. It’s that amazing. What’s genuinely incredible about the Inworld platform is that it offers a spectrum of control. If you are just a writer that knows nothing about programming, and that’s not what you do, you can write your way to a super compelling character very quickly. But if you are a game designer, you design procedural character type stuff in other kinds of games, you are actually afforded the means to get really granular and sort of code like almost instructions and goals and such in a way that can be like IF/THEN type of statements, right? Like, if this happens, then that happens, right?

It’s appealing to an unusual collection of skill types. And what it actually does, in my opinion, is it brings creative and technology-minded people into a central place in a way that we haven’t quite seen before. You can use natural language, you can use regular, everyday language to produce a behavioral real-time animating character. And then if you wanna get really fancy and have it have true interplay with a larger game, let’s say, because a lot of these characters will be examples where you’ll be making populations of characters that have all sorts of roles and relationships to one another and to the users in the game. There’s really a wide lane for a lot of different skill sets to get in and experiment with.

With Inworld you have the ability to make characters with archetypes and overall general knowledge that can be shared between the NPCs in a game. With that in mind, how fast can you iterate on those characters and, say, make a village?

We’ve been learning a lot of interesting techniques that are not complicated to understand. These kinds of insights that, once people grok, they can get very masterful quickly. This is one way to look at it. Let’s say you have a world, and in the world, there are all different types of groups and roles within groups. You could essentially craft an archetype of a character.

Something like, “you’re going to be a knight, you’re going to be a shopkeeper, you’re going to be a peasant.” Whatever. You can craft essentially a base-level archetype of what a common one of those kinds of characters might do in their role. Let’s say there’s gonna be a thousand knights that live in this kingdom and there’s the knowledge of the kingdom. Like what it is, how it works, et cetera. And all of the common knowledge about the kingdom and ongoing events. All of the other knights could learn about the events as time goes on.

You can really have fun. You can set the persona, name, memories, and background of this knight. Then you craft the persona on top of a pre-existing archetype. And you can be very efficient. Once you have common knowledge, you can distribute that across a population of characters. And then, whether it’s a creator, a team of writers, or users themselves, you can basically do the fun part of creating the persona and the disposition, which could be done in literally a sentence if you want it to be so simple.

So, there are methods to devise a society of AI characters. Think about a property like Star Wars - you know about Jedi, and the different classes of droids because we’ve just seen so many of these movies. You can essentially lay all that stuff out as archetypes, and then you just go in and start devising very specific personas and roles on top.

When a game has a lot of lore and history, like Skyrim from Bethesda, is there a way for you to train characters off of just that information?

There’s this fourth wall concept where basically you decide how narrow you want the character’s understanding of what is in and out of bounds. Most of the time people don’t want their fictitious universe to have any kind of reality leak. You don’t want to think about the real world.

On top of that, there’s a suite of safety tools on top. Where you can determine everything from behavior to just blocking the outside of your fictitious universe. Every day of the week, we speak to people who have world bibles. And that’s where they want their canon. To be the edges of the walls of the characters.

Right? So somebody in Skyrim shouldn’t know about linear algebra, for instance.

Unless you want that. With anything, there are going to be edge cases where you may decide that you want to do something like that.

We were just talking about the safety tools created by Inworld. How do you folks approach safety when it comes to video games and kids?

There’s a whole team dedicated to safety. It’s been part of the foundation of the platform since we began. Just to give you an idea of the degree to which we are doing things we’ve been working with Disney for a bit now. And that’s a high bar. That company has the highest of all standards because its entire brand is prefaced upon being safe for families. And we quite purposefully decided to hold ourselves to that standard. So, we’ve been building Inworld with the strictest control. There’s always a possibility you can wind your way through to an obtuse edge case, and then you learn about a bug that you would like to put in the bug chamber. As you learn edge cases, you make it where the character does not know this anymore. The safety aspect of the product is one of our strongest aspects.

Your Disney Accelerator presentation was interesting because it slips into the real world and animatronics in the parks. Seems like a great use of AI.

I can only explain it to you literally as an equal layman as yourself. If you can imagine something that ought not to be allowed inside the world, or talked about, it can be targeted to be excluded from any interaction. The other thing I would also say is that there are filters both on what the user says and what the character says. So, on the way in we are looking for inappropriate statements by real humans trying to break things, right? Then the character literally may not even hear a provocation that could lead it astray. And then on the out is a whole other system that basically prevents the character from saying something inappropriate. The system is so strong because it’s coming from both directions.

So you check the input and the answer.

Let’s be completely honest about it: There are always some people that will try to use tricks to sort of abstract their way to getting something weird to pop out. And there are other tools, for example, that if it’s detected, can be tagged and sent into the bug chamber immediately. If somebody heard it or saw it, it can be tagged. There’s this additional self-policing.

As AI continues to operate with human oversight, it can also learn from its mistakes.

It’s crazy. It’s literally explosive innovation happening in a variety of these areas.

It strikes me as funny that we tend to think of advancement through processing power, but a lot of technology has been gated more by storage. It takes such a massive amount of space to store the information needed for a lot of the current AI.

This whole idea of optimizing a large language model to be focused on a particular objective has revealed the most impressive results. For example, when they cracked DALL-E 2, it was a subset of a subset of the large language model that it was trained on originally. And this narrowing down to relevant patterns to look for has created greater and greater results. There’s this sort of interplay, again, layman’s approach to this, but there’s an interplay between the large ones and these optimized focused smaller ones. And with that, there’s cost efficiency. There’s also processing trip turnaround time efficiency. We’re entering this era of customizing and optimizing these things for a particular purpose.

I keep getting this feeling that we’re wandering into Isaac Asimov’s territory with the tech we’re working on now. Obviously, AI still requires our input and instruction. They’re not self-aware.

No, but you can create an illusion. From the beginning of my career to today, it’s still really pure illusion, like doing visual effects, even making movies is pure illusion. Before that was magic on stages. It’s pure illusion. The suspension of disbelief, as they put it. If you think that you saw a woman get sawed in half, you truly believe it, then it potentially really happened for you. That’s the sort of human side of it. So the question is what is really required to get value out of things? The way we’re using these language-based models is to essentially create the illusion of a sort of thinking character. But it’s not necessarily thinking. It could be made aware by adding sensing and all that stuff. But this is still information input. It’s not cognizant in a way.

It’s not real discovery of its own accord.

There’s this kind of fantasy, to think that the metaverse is going to happen upon us, like a light switch. Who knows how long it will take? Decadelong crossfade, and one day we’ll just wake up and it’s just what we do. I’m not actually saying that we put on VR goggles because that’s not necessary to have this virtual living space. That can be on the other side of the screens. It can manifest through IOT (Internet of Things) devices.

It can manifest in your autonomous transportation. All those things can, to some degree, talk to each other. And before you know it, you have the beginnings of this idea. All of those things kind of intertwine, so we don’t need to get really hung up on it, but as we go along, it’s so vast and complex just to think about all that could be known. An AI character could know all that is inside the internet, you know? Literally in a moment. So it is interesting to imagine populating game worlds and experiences or even home utility stuff with a companion or a guide. It could be an unconditional protector.

Unconditional meaning it doesn’t work for a company, it works for you. You know it is essentially having a digital sixth sense that you can use to peer into this vastness and make sense of it. So there’s a lot of really good things that can happen, a lot of problems that can be solved. Potentially a lot of good for human beings if used that way.

Like anything, it can be used for good or evil, but the idea of having a character respond to you instead of just running through a dialog tree is very fascinating.

Just to distill it back to what people think right now is that this natural language technology is for conversations alone, but it’s not. It is essentially a mind-state machine. State of mind. And with this, you can use it to send signals to get characters to act and behave and act. But what is coming down the pike is that just like real people, and real characters in a long story or game, we are not static. Things happen in our lives regularly that create growth inside of us.

The best of all characters have growth across the duration of your relationship with them. To enable stages of growth that can be bent by sort of milestones. Relationships could be changed in the world and so forth. It’s not all about talking about a subject and then I’m done. When in true fact they will become capable of relationships and personal growth on their own.

It’s a crazy world that we live in. A lot of the stuff I imagined as a kid has become reality and a few that are better. When you were working on The Matrix movies, could you have imagined you’d be doing what you are doing now?

The thing about well-written, or in the case of movies, executed science fiction, is that it literally opens your mind right then and there. That’s the power of great storytelling - it literally can open your mind and once open, you’re now enabled to continue to explore and imagine on your own. And that’s a gift of some humans to others. Could I imagine it? Yes. Because at the time we were working on that it felt sublime. Like there was something intuitive about the way it seemed to be portrayed. Because it was not glittery, it was more sublime, and that seemed right.

When we were working on those pictures, the way we tried to execute some of that stuff is we did some of the very beginning of volumetric human capture, which is now everywhere. Then we did the very beginning of image-based rendering. Going out and taking photographs and extracting three-dimensional objects you could run in a computer animation program. It’s like, “hey, there’s that building that really exists outside.” We were able to take pictures and make a version of this that seems similar. We were starting, at that time, to connect the dots. And you really think that downstream there’s going to be a whole new form of camera or capture. It’s going to be some way of ingesting reality and converting it into a digital asset.

And you could start to understand that at some point people will be able to make all of this and get inside it. Because we were halfway inside by just being inside working on it in a visual effects sense. You go from then to today, and a lot of my colleagues back then are inside of these super labs and these real-time companies manifesting the same concepts that we worked on. We’ve been pulled in the direction of the science fiction that we were all mingling in as younger people. It’s been really outrageous.

In the Matrix, there’s this thing called the Construct, the white void. The Construct has this ability to generate reality-based experiences instantly. The locations, the events, the people. We’re literally moving towards the Construct right now. That is my way to summarize, in concept form, what seems to be going on right now in generative everything. Make me a world in seven words. I do think that the Construct is actually going to appear in a literal sense, probably inside the decade.

NVIDIA has this Earth-2 thing that they’re putting together, which is like a twin of the real world. They want to use it for product creation and testing. That’s pretty close to the concept we were discussing.

NVIDIA’s a crazy company. We’re getting close to being able to speak and produce a world and the situations inside.

Oh, yeah. We’re oddly close to the Star Trek holodeck. Oddly close.

The Construct.

I remember a conference where you spoke about AI and game development. At one point you talked about how we’re at the beginning of this new world of AI and interconnection. We’ll be the ones to shape it, but we certainly won’t live long enough to experience it in full.

Every year that we live there’s so much change. There’s no dull moment, right? There’s constant growth and evolution. And it’s always exciting. It’s always interesting to see the beginning of things. But these last couple of years, going into the next couple, it’s kind of a mad time to be alive and paying attention. You know I don’t think the general population of the world realizes what’s coming and how very serious it’s going to be.

And how fast. Machine learning doesn’t have a lot of razzle-dazzle, but when it starts being in large circulation, people are going to be amazed.

Humans only care about the same things they’ve always cared about. They want to be compelled. They want to escape. Living vicariously through the roles and lives of others. That gives a lot of joy to human beings, and that’ll never change. And if, suddenly, we can start serving even deeper sorts of ways of doing that then everyone’s going to do it.

I think it extends to development as well. If you have the choice of creating a character that can interact with you based on the creation of your world, you’re going to do that.

It kinda turns you into a writer or creator. You don’t necessarily have to be a professional writer at all. You could just have an imagination. In a sense, you are incepting the character, but also, you’re directing the character. Sort of like what you would say to an actor as they’re about to do a scene. This is how I want you to approach this. There is a lot of opportunity to do that directing during world-building.

There are scientists that are humble enough to know that they need to listen to creators. Those of us who are in professional entertainment still are in awe and don’t even truly understand how a great writer might orchestrate the journey of a character across a great story. So you need to kind of understand a little bit about that. And if you can get insights like that, you can try to design a mechanism that can replicate that kind of progression. So we’re, we’re trying to learn right now, and a lot of what happens next is about being guided by great creators.

It’s a really interesting topic. I know a lot of people, writers included, are afraid of being replaced by machines.

I’m trying to stay narrowly into an area that has been sort of static for a long time. Creating these NPCs is laborious. People will get all sorts of wonder and interest in some of these toys, but you still can’t deny that the greatest actors of our time could never be surpassed. They’re just going to be that. So, there isn’t really a need to aim in that direction. What needs to happen is we need to create a deeper exposition.

Everyone always complains they don’t have enough time for exposition. They don’t have vehicles for exposition or world-building fanfiction. They’re like a support system around the actual human creators. It might be interesting to see generated worlds, but I don’t imagine that they could ever be as interesting as what a human being could create. And going forward, I just don’t see how there would ever be a threat to that. I really don’t.

It’s like when you read a computer-generated article. It looks good at a glance, but at closer examination, you see some cracks. There’s something magical about the human imagination. It’s very unpredictable.

It’s human insight. And I’ll tell you in the weirdest of ways, what I think it might do is to revalue that which is taken for granted. XR (extended reality) makes you re-examine how immensely complex reality is. All you gotta do is spend time in something synthetic, and then you use the same eyeballs to sort of peer at nature in front of you, and you’re like, that is really, truly dumbfoundingly complex and beautiful. So, there is something to that, I think.

In closing, how do you, as a visionary, see the next 10 or 20 years from an AI perspective?

20 years is a good number. It was interesting how you got a little squishy on my 10 years prediction. Then I returned by saying, “oh, prototypes,” right? But 20 years is a serious amount of time. And at the same time, it’s not. Think about Walt Disney trying to give us the confidence to use our imaginations - to dream and wonder. I think about what potentially we could do with our imaginations through these amplifications.

It takes the smallest kernel of imagination to envision, experiment, and experience it. So I think within 20 years we really will be able to go from thought to creation whether it’s words or even thinking. And I think that’s going to basically do a couple of things. Sure, it’ll create a lot of problems, but it’ll also potentially solve a lot of problems, and it will literally shock all of us. In 20 years, I really think the Construct will happen.

How Academy Award winner John Gaeta is changing video game characters with AI

Latest Video Game News

FF7 Rebirth ending explained: everything you missed in Rebirth’s finale

Honkai: Star Rail – Boothill revealed for update 2.2

How to beat the Charizard seven-star Tera Raid event