Meta faces lawsuit from authors over alleged use of copyrighted books to train AI

Justin Sullivan

Meta Platforms (NASDAQ:META) is facing a lawsuit from ceratin authors alleging that the company used their works without consent to train its artificial-intelligence language model Llama, Reuters reported.

The new filing late on Monday consolidates two suits brought versus Meta by comedian Sarah Silverman, Pulitzer Prize winner Michael Chabon and other authors, the report added.

Last month, a California judge dismissed part of the Silverman lawsuit and signalled that he would give the authors permission to amend their claims.

The new complaint now includes chat logs of a Meta-affiliated researcher discussing obtainment of the data in a Discord server, a potentially vital evidence suggesting that Meta was aware that its use of the books may not be protected by U.S. copyright law.

In the chat quoted in the complaint, researcher Tim Dettmers talks about his back-and-forth with Meta’s legal department whether the use of the book files as training data would be “legally ok,” the report noted.

“At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons,” Dettmers wrote in 2021, referring to certain data Meta has acknowledged using to train the first version of AI model Llama, as per the complaint.

In the previous month, Dettmers had written that the company’s lawyers told him that the data cannot be used or models cannot be published if they are trained on that data, according to the complaint.

While Dettmers does not describe the lawyers’ worries, his counterparts in the chat point out “books with active copyrights” as the main likely source of concern. They say training on the data should “fall under fair use,” a U.S. legal rule which protects certain unlicensed uses of copyrighted material, the report noted.

Meta had introduced the first version of the large language model, or LLM, Llama in February and published a list of datasets used for training, including “the Books3 section of ThePile.” The person who assembled that dataset has said elsewhere that it contained 196,640 books, the report added citing the complaint.

In September, Microsoft (MSFT)-backed OpenAI, developer of ChatGPT, was sued in a New York federal court by several authors, including George R.R. Martin and John Grisham over alleged copyright infringement. Firms filed a class action suit on behalf of the Authors Guild and said that OpenAI “copied the plaintiffs’ works wholesale, without permission or consideration” and fed the work into the LLMs used to train its services.

Generative AI services have taken the world by storm since the launch of ChatGPT last year. Companies worldwide are developing their own LLMs.

Meta Platforms’ (META) Emu Video, Emu Edit, AudioCraft, SeamlessM4T, and Llama 2, Alibaba’s (BABA) Tongyi Qianwen 2.0 and Tongyi Wanxiang, Baidu’s (BIDU) Ernie Bot, OpenAI’s text-to-image tool DALL·E 3, Alphabet (GOOG) (GOOGL) unit Google’ Bard, Samsung’s (OTCPK:SSNLF) Gauss, and Getty Images’ (GETY) model called Generative AI by Getty Images, are some of the LLMs, among the many, being developed.

Source link