Technology
Apple says it took a ‘responsible’ approach to training its Apple Intelligence models
Apple published technical paper detailing the models developed for Apple Intelligence, a range of generative AI features coming to iOS, macOS, and iPadOS over the subsequent few months.
In the article, Apple opposes accusations that it took an ethically questionable approach to training a few of its models, reiterating that it didn’t use private user data but as a substitute relied on a combination of knowledge publicly available and licensed to Apple Intelligence.
“(The) pre-training dataset consists of… data we have licensed from publishers, curated publicly available or open datasets, and publicly available information crawled by our web crawler, Applebot,” Apple writes within the article. “Given our focus on protecting user privacy, we note that no private Apple user data is included in the data mix.”
Proof News in July reported that Apple used a dataset called The Pile, which comprises captions from a whole lot of 1000’s of YouTube videos, to train a family of models designed for on-device processing. Many YouTube creators whose captions were wolfed up by The Pile were unaware of this and didn’t consent to it; Apple later issued a statement saying it had no intention of using the models to power any AI features in its products.
A technical paper that offers a sneak peek on the models that Apple first unveiled at WWDC 2024 in June, titled Apple Foundation Models (AFM), emphasizes that the training data for the AFM models was acquired in a “responsible” manner — or at the least responsibly by Apple’s definition.
The training data for the AFM models includes publicly available Internet data, in addition to licensed data from undisclosed publishers. According to The New York Times, Apple I contacted several publishers in late 2023, including NBC, Condé Nast, and IAC, with multi-year deals price at the least $50 million to train models on publishers’ news archives. Apple’s AFM models were also trained on open-source code hosted on GitHub, specifically Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go code.
Training models on code without permission, even open source, is a point of contention amongst developers. Some developers have argued that some open-source code bases are unlicensed or don’t allow AI training of their terms of use. However, Apple says it has “licensed” the code to try to include only repositories with minimal usage restrictions, reminiscent of those licensed under the MIT, ISC, or Apache licenses.
To boost the mathematical skills of the AFM models, Apple specifically included math questions and answers from web sites, math forums, blogs, tutorials, and seminars within the training set, according to the article. The company also used “high-quality, publicly available” data sets (which the article doesn’t specify) with “licenses that allow use to train… models,” filtered to remove sensitive information.
In total, the training dataset for the AFM models weighs in at about 6.3 trillion tokens. (Tokens are small pieces of knowledge which are typically easier for generative AI models to digest.) By comparison, that’s lower than half the variety of tokens — 15 trillion — that Meta used to train its flagship text-generating model, Llama 3.1 405B.
Apple acquired additional data, including human and artificial data, to refine the AFM models and attempt to mitigate any undesirable behaviors reminiscent of toxicity release.
“Our models are designed to help users perform on a regular basis tasks on Apple products in a way that’s well-established
in Apple’s core values and rooted in our principles of responsible AI at every stage,” the corporate said.
There is not any hard evidence or shocking insights within the article, and that is due to its careful design. Rarely are such articles very revealing, due to pressures of competition, but in addition because revealing much of the data could get corporations into legal trouble.
Some corporations that train models by scraping public web data claim that their practice is protected by fair use doctrine. But that is a difficulty that is extremely controversial and the topic of a growing variety of lawsuits.
Apple notes within the article that it allows webmasters to block the crawler from scraping their data. But that puts individual creators in a difficult position. What’s an artist to do if, for instance, their portfolio is hosted on a site that refuses to block Apple from scraping their data?
Court battles will resolve the fate of generative AI models and the way they’re trained. For now, though, Apple is trying to position itself as an ethical player while avoiding unwanted legal scrutiny.