On reward functions.

From a 2003 article in The Guardian:

At the Institute for Marine Mammal Studies in Mississippi, Kelly the dolphin has built up quite a reputation. All the dolphins at the institute are trained to hold onto any litter that falls into their pools until they see a trainer, when they can trade the litter for fish. In this way, the dolphins help to keep their pools clean.
Kelly has taken this task one step further. When people drop paper into the water she hides it under a rock at the bottom of the pool. The next time a trainer passes, she goes down to the rock and tears off a piece of paper to give to the trainer. After a fish reward, she goes back down, tears off another piece of paper, gets another fish, and so on. This behaviour is interesting because it shows that Kelly has a sense of the future and delays gratification. She has realised that a big piece of paper gets the same reward as a small piece and so delivers only small pieces to keep the extra food coming. She has, in effect, trained the humans.
Her cunning has not stopped there. One day, when a gull flew into her pool, she grabbed it, waited for the trainers and then gave it to them. It was a large bird and so the trainers gave her lots of fish. This seemed to give Kelly a new idea. The next time she was fed, instead of eating the last fish, she took it to the bottom of the pool and hid it under the rock where she had been hiding the paper. When no trainers were present, she brought the fish to the surface and used it to lure the gulls, which she would catch to get even more fish. After mastering this lucrative strategy, she taught her calf, who taught other calves, and so gull-baiting has become a hot game among the dolphins.

A real-life example of what happens when humans are not able to define reward functions properly. A great example in the RL context is given by OpenAI in this blog, wherein they show that in a racing game, the agent learns to turn circles to repeatedly pick the mini-rewards instead of the (relatively small) big reward for finishing the race!

Written on January 11, 2018 +2134

Back to Posts

Abhishek Naik

Ph.D.,
University of Alberta; Amii

On reward functions.