Question

I ask this academically, I want to ask aloud a very important question and have the community try to answer it. Can we build a system that generates a scene to play out along a live anonymous group video chatroom that can read the text typed at it and respond with a chatbot?

Live Internet video is often blurry and has low resolution. One cannot make out many details in the scene of the distant party. Scenes can be rendered with modern software tools that look very real when not moving. Making them move realistically is a large piece of simulation software.

Faces can be rendered at 24 frames per second by a cluster of 24 systems capable of 1 frame per second. The video would then have a 1 second lag from the point where the decision was made as to which facial expression to generate. These facial expressions and their generation is a key problem. The skin realism requirement is a solved problem by the graphics community.

Facial expressions have been categorized by several researchers. They also can be rendered, this has been shown in modern computer graphics literature. We can do them if we can know which ones are appropriate for a given situation.

Chatbots have been in use for decades. There exist now quite 'smart' chatting programs that will read what it is asked and reply in a sensible way. They have always done this with text, but text-reader software can speak out in a human-ish voice, and speech recognition software is getting better every year.

What I propose is the fact that it should be quite rudimentary to connect all of these disparate parts of software development and create some truly amazing turing-test beater.

This program could enter a virtual space and display a realistic environment as if on a webcam like the other participants. It can watch their facial expressions and it can listen to their speech and it can read their text. It could then create a response and either type or say it back to the group. The choosing of what to respond with is a difficult problem that not even most humans have mastered. We can get it close with a lot of work.

The Turing Test is about proving that a communicator is a human, but 'proof' only in the sense that it is good enough to fool the human judges. If the human judges are simply everyone, they will not likely apply a strict formal procedure. Guessing or falling for a trick is good enough.

Do you think we can do this?

Is this plan flawed? Are there moral implications to tricking the average viewer in this way? Can we make millions of dollars by generating personal intelligent assistants?

Was it helpful?

Solution

There is already research going on in this area. Digital avatars have been used with some success. Some of the key points:

  • Modern PCs can render a convincing human face in real time, no problem. Just put in a medium gfx card and a good model and you're done. (see Dawn, for example).

  • Current voice generation software can produce fluent text and it's able to pronounce it properly. It's still a bit monotonous since the speaker doesn't have emotions. (See this article).

  • There is research to make machines "feel". I say "feel" because it's basically just a little program with a couple of variables ("anger", "fear", "hunger", "bored", "sadness", ...) and a complex set of rules which influence these variables. (See the Wikipedia article for details).

The main problem right now is that we don't know what emotions are. Are they just amounts of molecules floating in certain parts of the brain? If so, which molecules and in which part of the brain? Neuroscientists today try to predict the state of the mind by looking at MRT images. To understand what this means, here is an analogy: They try to guess what mankind is up to by looking at the distribution of light on Earth from the Moon with their naked eyes.

So we don't understand what emotions are. The next hurdle is that emotions mean nothing without context. It's easy to write a program that feels "sad" by just setting the value of the variable sadness to 1.0. But that would feel weird if there was no reason. So the program must be able to follow the conversation, build a mental image of it (what are the people talking about and how do they feel right now) and then adjust their own mental state following the current rules of the respective group.

You know how it feels when you join a new group for the first time and try to get a grip on what is going on and how you should behave. That's a hard task for humans and even more so for a program.

There is an article "Können wir eine Seele simulieren?" (Geman only but the output of Google translate is pretty good.)

OTHER TIPS

We can't pass the traditional text based Turing test. Adding video on top is irrelevant.

I disagree with the presence of your question here, but I feel it necessary to point out that you have severely misunderstood what the point of the Turing test is. It has nothing to do with looking like a human, or sounding like one.

In fact, most proposed tests involve a time delayed teletype terminal, so there is as little transfer of information as possible beyond the actual communication that is under test.

I hate to break your bubble, but the current generation of chatbots, and even the most advanced AIs in the lab are nowhere near beating the turing test. It becomes obvious very quickly that there's not a real person there.

The big problem is not to render appearance (visual or vocal), it is to render intelligence and emotions.

What you suggest is the front-end of a realtime Shrek. But what about the back-end?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top