By Scott Broock and Mike Seymour
Part two of several articles on the case for new cloud based interactive media companies built on top of today’s game engine technology.
Author’s update: Epic Games announced today, March 12th, that they have acquired Cubic Motion, a leading facial capture company. This is a significant addition to the acquisition of facial rigging experts 3Lateral last year. Both become key assets within the Unreal Engine for real-time virtual productions and interactive experiences. Cornerstones to expressive, emotive characters in the convergence of games, social media and entertainment — the topic of this article.
– SB
Back to the Future
A common talking point these days is that video games are becoming the new social hangouts or video games are beginning to look like cinematic universes.
Neither of which is true. They always have been.
Experiences like Fortnite, Roblox and Minecraft are overnight successes 30 years in the making with a lineage that can be traced directly back to creative building games such as Club Penguin, Habbo Hotel and the granddaddy of all social multiplayer survival games, DOOM.
A snapshot of the market without such context fails to capture the velocity with which gaming, computing and networking are now converging to create the ‘Metaverse’ of interactivity that Tim Sweeney, CEO of Epic Games, recently described as our next great leap in communication.
An accounting that skips over the history of modern gaming also gives a false impression that the threat to legacy media is still early and manageable, leading incumbents to conclude that they have time to buy their way into the market or control the pace of progress through their war chest of intellectual property.
Neither of which is also true.
The reality is that social gaming is rapidly evolving towards something entirely new, beyond just gaming, entertainment or technology.
In part one of this series, we introduced Sweeney’s specific vision for the Metaverse as an answer to the shortcomings of legacy media models, the siloed economics of today’s game platforms and the walled gardens of IP that stifle innovation.
In this part two, we first look at the major shift in gaming that occurred with the release of DOOM and the community of ‘modders’ and gifted exhibitionists that it created. Next, we look at how phones and production tools are now blurring the line between you the player and you as a direct actor. We look last at the spaces for communication and interaction that will make the experience of controlling an emotive character in the Metaverse fun and entertaining.
DOOMED and Loving It
DOOM blasted onto the PC gaming scene in 1993 as the brainchild of John Carmack, John Romero and the team at ID Software. Gamers were blown away by the fast paced action of a first person shooter that pushed the boundaries of what was possible on PCs of the day with dynamic level design, peer to peer Deathmatch and team-based multiplayer modes.
Most importantly, DOOM championed the hacker ethos that was at the soul of gaming by encouraging the modification of its levels by a passionate community of players and developers. This broke with the user models of earlier games that were locked down by their publishers.
As game play could be captured in-game and then replayed in real-time, ‘players’ also became ‘spectators’ and ‘students’ who studied and shared the moves of legendary players for entertainment and competitive advantage. A trend that continues today as streamed game play and “let’s play videos” that draw billions of views on YouTube and Twitch.
With ID’s release of Quake and internet play in 1996, and Epic Games’s Unreal Engine in 1998, the stage was thus set for the explosion of esports and global social play that runs straight to today’s hyper networked social experiences.
In some cases an unintended use of the engine by players has led to a major supported feature that extends gaming culture to a new class of creators. Players discovered early on with Doom and Quake that they could modify in-game cameras to reframe the default first person point of view to record real-time animated stories. By the year 2000, game engine based movies had evolved to a popular new art form on the internet known as “Machinima” — defined as “film making within a real time 3D virtual environment.” Players had hacked their way into the role of movie makers using game engines as found technology.
Twenty some odd years later, game engines have matured to the point where they are no longer hacked to produce stories as an adjunct to games. Narrative and virtual production tools are baked in as critical features for today’s creators and actors with revolutionary implications for legacy media as well as the rise of a new creative class within Sweeney’s Metaverse.
As with any production, these templates for stories need life to succeed. They need you as an actor.
You are the New Interface
As we advance further, controls for characters will need to evolve as well. The next logical step is a virtual mask that you control with your face and you puppet with your body.
Facial capture tech is already familiar to many as Animojis and Memojis on iPhones and AR Emojis on Samsung phones. Held or suspended in front of your face the phone’s camera and/or facial tracking sensor mirrors key features that can be mapped to an animated character in real time.
The quality of the tracking and the translation to facial animation increases with purpose built head gear and camera systems available from companies such as Faceware and Cubic Motion. We expect to hear more on this topic soon from Epic Games as they integrate the expertise of facial rigging experts 3Lateral, which they acquired last year.
In terms of body tracking, the latest versions of both ARKit for iOS and ARCore for Android provide AI based human pose estimation to track your skeleton in real time through video. At the time of this writing, Apple has not announced its iPad Pro update or the features of the iPhone 12, but the expectation is that both will feature a “time of flight” sensor embedded with their rear cameras. Similar to the front facing depth sensor on the current generation of iPhones, this will allow for better isolation of objects from their background, providing better skeletal tracking.
As with facial tracking, the quality of this full body track scales with dedicated software and hardware that ranges from relatively inexpensive suits such as Rokoko all the way to premium priced suits made by Xsens.
The importance of hands in communicating intent also has received a great deal of attention. From the optical hand tracking in the Oculus Quest to the specialized gloves from Stretchsense, the finger sensors built into Valve’s Index controllers and the rumored VR controllers for the Playstation 5, realistic gestures will be mapped in real time to your character.
The net result is that you can perform for friends, participate in live competitions and act in narratives through a directly articulated avatar for immersive collaborative role-playing. Perfect for the resurgence of choose your own adventure stories and the group dynamics of Dungeons & Dragons.
In terms of distribution, such content can be streamed live with interactive elements through Twitch or Mixer extensions. In the case of cloud game streaming services like Google’s Stadia, it will also be possible to “jump into” live experiences as your character through “Crowd Play” and “State Share.”
In short, the new “inter-face” for the Metaverse will be you. But how real must your representation in a game or simulation be?
Self Control
As a new generation of visually oriented youth spend more of their time online, the fidelity and expressiveness of their digital selves will need to convey richer and more subtle expressions of emotion. We value the rich non-verbal communication that comes from talking to someone and observing their expressions in real time.
Today, machine learning and facial simulation software can produce real-time emotional expressions that are orders of magnitude greater than the early days of Machinima. A critical development for deeper emotional engagement within the synthetic worlds that we are now creating.
Nevertheless, photoreal humans are incredibly complex to produce, and it is difficult to capture the micro movements of muscle, fat and tissue that we subconsciously monitor for emotion and intent in faces and body language.
A ‘sort-of’ realistic digital human can be unsettling. That’s the idea behind the Uncanny Valley theory, which assumes as digital humans get more realistic they become less appealing until they are sufficiently realistic to cross the ‘uncanny’ valley of being ‘off.’
Interestingly, we are far more forgiving of an uncanny face when it is interactive. Games such as Mass Effect, The Last of Us and The Outer Worlds feature extensive dialogues and branching narrative trees with non-playable characters. Even though they are not photoreal nor do they move with the fluidity we expect of a real human, it only takes a few minutes of play to accept these characters as believable enough for suspension of disbelief.
Compare this with non-interactive digital humans, such as digital doubles in the movies. As we’ve seen over the past several decades, audiences do not tolerate approximate facial expressions and dislike the look of characters in films such as Polar Express and Beowulf. Only at the highest end of filmmaking today in movies such as The Avengers or Gemini Man do we see a general suspension of disbelief for digital humans and humanoids. Yet for all the artistry, there is still a while to go before Thanos or a digital Will Smith passes for lifelike flesh and bone. Stylized characters in today’s video games such as Fortnite or music videos from Riot Games will likely be the standard for mass adoption and transition from Animojis.
Having a well constructed, expressive character that you can control in an interactive environment is necessary, but not sufficient to realize a successful Metaverse. You must also have an engaging reason to be there.
No Second Life
Why isn’t the Metaverse just a repeat of the now semi-dormant Second Life? The famed virtual world that was intended to be a mirror of our own, complete with a marketplace for virtual goods and characters.
The answer is that Second Life provided no barriers and no limits, leading to little in the form of direction. It was the Wild West, where the worst of the internet was visibly displayed. Additionally, the characters were technically limited, falling into the Uncanny Valley with a thud. People seeking order and a safe community moved on, looking for a better place.
Nevertheless, Second Life didn’t fail; it was just early. It created a reference point for subsequent models of large socialized play. Now the technology is mature enough and cheap enough to reach a large pool of natural talent raised and trained on Roblox and Minecraft.
Online spaces merely need a critical number of people engaged in meaningful activity to generate sustained interest. This is human nature. We like crowds. They reinforce our sense that what we are doing is the right thing and worthwhile.
Take Fortnite for example. Scope of experience and presence of others is established by the structure of the island and its inevitable storm. Every time you visit the island, you are there with full awareness of ninety nine other people competing to eliminate you, and as opponents leave (die), the playable area of the island shrinks. As the number of players also shrinks, you become even more aware of the competition around you.
This is similar to choosing a restaurant. If you feel like a restaurant looks good, you’re even more likely to check it out if it is busy. Fortnite feels like you have turned up to a great restaurant. It never feels unpopular, and you are always aware of others.
The problem with a game like Second Life is that it wanted to be a duplicate of the real world. The creators of Second Life thought it needed this mirroring of reality to be successful down to the level of real shops and places, but why?
By its very nature the Metaverse cannot be packaged and sold fully formed to users. They must participate in defining it and finding purpose in it. This is not to suggest that an open ‘anything-goes’ approach is warranted. It is important to learn from what has come before and acknowledge where it can be improved.
Characters and interactive spaces should have clear ratings and standards and practices that ensure COPPA compliance for age appropriate programming. Facebook and Twitter, which monetize friendships, opinions and emotions as commodifiable assets, can be challenged with a family friendly option that provides trustworthy entertainment.
In the end, the Tim Sweeney vision of a Metaverse recognizes that we as humans love interaction and crave contact in order to declare our identity and to demonstrate mastery of a skill to our tribe.
Everyone is Engaged
It is not unexpected that people are anxious to understand the next big thing, just as it is not unprecedented that they would try to interpret change with reference to current dominant business models and societal norms. The internet was first minimized as effectively books and magazines, but online. Television was initially thought to be radio shows, but with pictures, and the radio and telegraph merely as spoken print. Similarly, the Metaverse is now being framed as a simple extension of today’s gaming. This is a mistake.
As Marshall McLuhan said, incumbents tend to confront changes in their mediums with rear view decision making, missing the opportunity to adapt before obsolescence.
Focusing on the content of any new medium alone largely ignores the changing role and expectations of the audience as its participants. Particularly youth. What matters next in the coming medium of the Metaverse is no longer what we watch or post, it is the networked form of participation and identity that it will engender through interactivity, personal representation and participation in a new form of spatial engagement.
Authors
Scott Broock is the former Executive Vice President of Strategy and Innovation for Illumination Entertainment, Global VR Evangelist at YouTube, and Vice President of VR Content and Deal Development at Cinematic VR pioneer, Jaunt. He has worked at the intersection of gaming, telecommunications, TV and film for almost two decades. Broock’s current venture, Totem Networks LLC, is funding and developing avatar-first interactive global experiences for streaming media, cloud gaming and location based entertainment.
Mike Seymour is a lead researcher in the Motus Lab at The University of Sydney. His research explores using artificial intelligence and interactive photoreal faces in new forms of Human Computer Interfaces (HCI). He is the co-founder of fxguide and fxphd which chronicles and educates the film, television and gaming world on state of the art visual effects, virtual production and digital humans. Seymour recently was the chair of SIGGRAPH ASIA’s “Real-Time Live” session featuring the latest advances in motion capture, render and interactivity.