Google digests 11,000 novels to improve AI conversation

karenarchey · September 28, 2016, 4:12pm

Richard Lea reports for the Guardian that Google has fed its AI program, Google Brain, 11,000 novels in order the improve the quality of its conversation. Unfortunately, Google took those 11,000 novels without notifying their authors (many of them were unpublished and freely available on the web), resulting in something of an ethical conundrum for authors. Read Lea in partial below, in full via the Guardian.

When the writer Rebecca Forster first heard how Google was using her work, it felt like she was trapped in a science fiction novel.

“Is this any different than someone using one of my books to start a fire? I have no idea,” she says. “I have no idea what their objective is. Certainly it is not to bring me readers.”

After a 25-year writing career, during which she has published 29 novels ranging from contemporary romance to police procedurals, the first instalment of her Josie Bates series, Hostile Witness, has found a new reader: Google’s artificial intelligence.

“My imagination just didn’t go as far as it being used for something like this,” Forster says. “Perhaps that’s my failure.”

Forster’s thriller is just one of 11,000 novels that researchers including Oriol Vinyals and Andrew M Dai at Google Brain have been using to improve the technology giant’s conversational style. After feeding these books into a neural network, the system was able to generate fluent, natural-sounding sentences. According to a Google spokesman – who didn’t want to be named – products such as the Google app will be “much more useful if they can capture the nuance of language better”.

For the moment, the research is just a “proof of concept”, the spokesman continues via email, but these methods “could help Google understand and produce a broader, more nuanced range of text for any given task”.

“We could have used many different sets of data for this kind of training, and we have used many different ones for different research projects,” he adds. “But in this case, it was particularly useful to have language that frequently repeated the same ideas, so the model could learn many ways to say the same thing – the language, phrasing and grammar in fiction books tends to be much more varied and rich than in most nonfiction books.”

The only problem is that they didn’t ask. The Google paper [PDF] says that the novels used in this research were taken from “the Books Corpus”, citing a 2015 paper by Ryan Kiros and others [PDF] which describes how the authors “collected a corpus of 11,038 books from the web”, describing them as “free books written by [as] yet unpublished authors”. It’s a collection that has been used by other researchers working in artificial intelligence and which is currently available for download in its entirety from the University of Toronto.

Forster says that she “always appreciates an interesting use of words”, but while Hostile Witness is available to download for free, no one asked her permission to use her novel as raw material to train a computer.

“Perhaps I’m still thinking in the old way, that a reader will read my book – it didn’t even occur to me that a machine could read my book. What I found curious was that these were referred to as ‘free books written by as yet unpublished authors’ because my state is very different,” she says.

Like many of the novels in the Book Corpus collection, the edition of Hostile Witness used in the research was published on Smashwords and includes a copyright declaration that reserves “all rights”, specifies that the ebook is “licensed for your personal enjoyment only”, and offers the reader thanks for “respecting the hard work of this author”. While Forster says she’s no lawyer, the “spirit of this declaration is clear – you hope that your work would be respected by readers”.

“I take great pride in my craft, and perhaps it was chosen because of that. Which would be great. Or perhaps it was chosen because it was there, because it was free?”