teaching machines

See Books Listen

My office of research sent me this invitation last semester:

ORSP is wondering if you would be interested in presenting at our Forum this spring.  This is an opportunity for faculty or staff to present on their ongoing research/scholarly/creative activity to a broad segment of the university community.  Presentations are 20 minutes to half an hour followed by discussion and are geared to an educated lay audience.

My office of research has been very nice to me, so I consented. What follows is a draft of my presentation.

See Books Listen
or ear-Books
or The Problems Computer Scientists Face

How much time do you suppose the average American spends reading each day? How about watching television? [Based on the answers, we university folks are either really optimistic (and out of touch), or we are very jaded (and out of touch). Perhaps the guesses are spot on, in which case we are well-educated.] The National Endowment for the Arts said in 2007 that we spend 7 minutes reading and 2 hours watching television each day. Is this bad? I’m hesitant to put such a label on it. Perhaps a more meaningful question would be, “Would I let my three boys be average?” No. I value the active process of reading, and I fear the passive enslavement of television. Bicycling and driving present the same dilemma to me. I want to be in control. Not the machine.

You may think it funny that a computer scientist is saying this. I should love the proliferation of technology, right? But I say it’s perfectly consistent. A computer scientist’s daily task is to make machines do things. When these roles reverse, I go from programmer to programmed. At the same time, I really do like technology, especially animation and 3-D graphics. I love visually exploring imagined worlds and seeing lifeforms that will never walk this Earth. I play video games and enjoy Pixar’s works.

How can we navigate the tension between these two worlds? Well, books and technology are not exclusive. Computers have been able to put words on the screen for a good long time now. Can’t we somehow mix the reading in with the visual stimuli of animation?

A couple of years ago my brother babysat my oldest son and brought an iPad storybook of one of Dr. Seuss’s works. I read it a bit with my son, but reading isn’t the best word to describe what we did. We mashed it. We made things dance. We made things fly. We lifted up rocks and found things hiding. I had a similar experience with this app called Four Seasons, which I’ll share with you. [demo] A lot of digital storybooks follow a similar pattern. The “bookiness” is minimized and the “gaminess” is elevated. Certainly these apps have their merits, but I’m not really reading. Technology can do better than this. At least, I think it can.

Most of our mobile devices support speech recognition. The microphone picks up the pressure waves that hit it and we can attempt to reconstruct the words that produced those waves. So last year, a computer science student named Jonathan, a graphic design student named Katie, and I named Chris tried to build a book that listened to us read.

Here’s the result, a story that Katie wrote and animated:

The whole idea is that we have flashy stuff, but it doesn’t play until the words are read and recognized. I think this turned out pretty well for our first go, but arriving at it was a painful process. We first used Android’s speech recognition capabilities. The process behind speech recognition is this:

  1. Google engineers and linguists sat down and collected audio and text samples, forming patterns of how we arrange words and sounds, making the patterns searchable.
  2. Our phones sample the audio from their environment and ship off the results to Google’s data centers.
  3. Google gives us back some possible reconstructions as text.

That this can work even a little bit is a great score for computer science. The hitch we ran into is that this simply does not work for interactive reading. Sending out the audio and waiting for the results really makes giving quick feedback difficult.

Timings to get different maximum numbers of results for the phrase “Binky the ball bounced away.”


Timings to get different maximum numbers of results for the phrase “As an interactive experience, Amnesia quickly runs into difficulties.”

The effect we were going for was like the little bouncing ball that followed the words on old sing-alongs:

With delays of up to four seconds, the bouncing ball isn’t feasible. Another problem is that the results tend to be a little wonky. Google makes no assumptions about what we are trying to say, apart from its statistical analyses of real speech and text. For example, when I tried to recognize the sentence “David Bowie has a pet dragon,” I got these results:

  1. David Bowie as a pet dragon
  2. David Bowie as a pet dragon (the same, no?)
  3. David Bowie as Dragan
  4. David Bowie as a pet dragon eye
  5. David Bowie in the pet dragon

The good news is that Jonathan discovered a library from Carnegie-Mellon called PocketSphinx. It performs speech recognition directly on our mobile device—no going out to the cloud and waiting for results. Despite the project looking like (and being) the work of a bunch of graduate students, we were able to find one little scrap of documentation so that we could get it up and running. One of the nice things about this library is that we can feed it a lexicon of words that actually occur in the story. It’ll only match those, instead of trying to match words from the entire English language.

The way Digger and Tell works is that the phone is doing many things at once. It’s playing the movie that Katie built independently of us. And it’s listening for the reader’s voice. When the voice is matched, the listener tells the movie player to go. A separate thread of execution is waiting for the next narration point—and it’ll tell the movie player to stop. Managing synchronization between these three threads is not perfect, and sometimes the movie player will charge on through the point it needs to stop. It is frustrating.

This year we decided to try out a real game engine called Unity 3D, which was built to do 3-D interaction and animation. It’s been a lot more fun working with this tool. I was able to put this little proof-of-concept together in a week’s time. It’s called The Unseen, and it’s a little story about hypocrisy and blindness. The trees were a bad idea; they really slow things down on my poor little phone.

As you can see, I couldn’t really get rid of the trees without changing the story.

This year we’ve also looked at supporting other languages, which works incredibly well thanks to the fact that mobile devices were designed from the get-go to change languages. The largest problem by far is getting good results from the speech recognition system. Even with PocketSphinx matching only words that occur in the story, we get back some terrible matches, especially if you sneeze or cough. Currently we are exploring modeling the narration not just as a collection of independent words, but as a miniature grammar, where word order is considered.

As we continue to fight this battle of trying to make books listen to us, we come full force with the problems that every computer scientist faces:

  • Our tools are in various states of brokenness and unsuitability, and it’s our own fault.
  • The world through sensors is noisy and slow to interpret.
  • We’ve seen freedom. The virtual has fewer limits than the physical, and our imaginations soar. But guess who’s in charge, when these limits disappear? Us.

Can we do it? Can we make books listen? Probably. I think we just need to read up on it some more.


Leave a Reply

Your email address will not be published. Required fields are marked *