Unlocking a Million Times More Data for AI (ifp.org)

williamtrask 1 day ago

hunterpayne 1 day ago

"Homomorphic encryption enables the aggregation of these distributed model pieces while they are encrypted, allowing for federated learning without centralizing data."

A bigger hand wave has never been done I think. Homomorphic encryption increases computational load several fold. And I'm not aware of anyone trying to use this (very interesting) technology for much of anything, let alone GPU ML algorithms.

Yoric 1 day ago

Doesn't Zama have an homomorphic machine learning product?

williamtrask 1 day ago

Yeah Zama's stuff is great.

williamtrask 1 day ago

(OP here) Homomorphic addition (e.g. aggregation) is very performant, including for the Federated Averaging algorithm used in Federated Learning. Not hand-waivey.

rafale 1 day ago

Didn't a hedge fund publish data encrypted with homomorphic encryption to run an open competition to see how can build the best trading AI. The encryption allow them to keep the sensitive data private.

destroycom 1 day ago

This doesn't seem like an article that was made with proper research or proper sincerity.

The claim is that there is one million times more data to feed to LLMs, citing a few articles. The articles estimate that there is 180-200 zettabytes (the number mentioned in TFA) of data total in the world, including all cloud services, including all personal computers, etc. The vast majority of that data are not useful to train LLMs at all, they will be movies, games, databases. There is a massive amount of duplication in that data. Only a tiny-tiny fraction will be something useful.

> Think of today’s AI like a giant blender where, once you put your data in, it gets mixed with everyone else’s, and you lose all control over it. This is why hospitals, banks, and research institutions often refuse to share their valuable data with AI companies, even when that data could advance critical AI capabilities.

This is not the reason, the reason is that this data is private. LLMs do not just learn from data, they can often reproduce it verbatim, you cannot give medical records or bank records of real people, that will put them at a very real risk.

Let alone that a lot of them will be, well-structured, yes, but completely useless information for LLM training. You will not get any improvement in the perceived "intellect" of a model by overfitting it with terabytes of tables with bank transaction records.

williamtrask 9 hours ago

"This is not the reason, the reason is that this data is private. LLMs do not just learn from data, they can often reproduce it verbatim, you cannot give medical records or bank records of real people, that will put them at a very real risk."

(OP) You make great points. I think we're actually more in agreement than might be obvious. Part of the reason you need to "give" data to an LLM is because of the way LLMs are constructed... which creates the privacy risk.

The principle of attribution-based control suggested in this article would break that principle, enabling each data owner to control which AI predictions they make more intelligent (as opposed to only controlling which IA models they help train).

So to your point... this is a very rigorous privacy protection. Another way to TLDR the article is "if we get really good at privacy... there's a LOT more data out there... so let's start really caring about privacy"

Anyway... I agree with everything in your comment. Just thought I'd drop by and try to lend clarity to how the article agrees with you (sounds like there's room for improvement on how to describe attribution-based control though).

k310 1 day ago

What that locked (private) data entails.

> What makes this vast private data uniquely valuable is its quality and real-world grounding. This data includes electronic health records, financial transactions, industrial sensor readings, proprietary research data, customer/population databases, supply chain information, and other structured, verified datasets that organizations use for operational decisions and to gain competitive advantages. Unlike web-scraped data, these datasets are continuously validated for accuracy because organizations depend on them, creating natural quality controls that make even a small fraction of this massive pool extraordinarily valuable for specialized AI applications.

Will there be a data exchange where one can buy and sell data, or even commododata markets, where one can hedge/speculate on futures?

Asking for a friend.

williamtrask 1 day ago

This is the magic :)

Normal_gaussian 1 day ago

Big data's no true scotsman problem:

> Despite what their name might suggest, so-called “large language models” (LLMs) are trained on relatively small datasets.1 2 3 For starters, all the aforementioned measurements are described in terms of terabytes (TBs), which is not typically a unit of measurement one uses when referring to “big data.” Big data is measured in petabytes (1,000 times larger than a terabyte), exabytes (1,000,000 times larger), and sometimes zettabytes (1,000,000,000 times larger).

williamtrask 1 day ago

(OP here) — with you on that analysis. This was in an effort to make the piece legible for a (primarily) non-technical, policy audience. Rigorous numbers are in other parts of the piece (and in the sources behind them).

eichin 1 day ago

The joke (10 years ago) was that "big data" means "doesn't fit on my Mac". Kind of still works...

pimlottc 1 day ago

Isn’t the basically true? The crux of big data is that requires different techniques since you can’t just processing it on one device

svieira 1 day ago

> What makes this vast private data uniquely valuable is its quality and real-world grounding.

This is a bold assumption. After Enron (financial transactions), Lehman Brothers (customer/population databases, financial transactions), Theranos (electronic health records), Nikola (proprietary research data), Juicero (I don't even know what this is), WeWork (umm ... everything), FTX (everything and we know they didn't mind lying to themselves) I'm pretty sure we can all say for certain that "real world grounding" isn't a guarantee with regards to anything where money or ego is involved.

Not to mention that at this point we're actively dealing with processes being run (improperly) by AI (see the lawsuits against Cigna and and United Health Care [1]), leading to self-training loops without revealing the "self" aspect of it.

[1]: https://www.afslaw.com/perspectives/health-care-counsel-blog...

williamtrask 1 day ago

(OP Here) This is a fair point. Internal datasets can be deceitful just as public ones can. That said, most propaganda lives in the public domain. :)

collingreen 1 day ago

I'll be surprised if public data is less accurate than private data on average. I've watched many people lie to themselves or others with data within an organization because they are often incentivized to do so.

Animats 1 day ago

Does vast amounts of lower and lower quality data help much? If you can train on the entire feeds of social media, you keep up on recent pop culture trends, but does it really make LLMs much smarter?

Recent progress on useful LLMs seems to involve slimming them down.[1] Does your customer-support LLM really need a petabyte of training data? Yes, now it can discuss everything from Kant to the latest Taylor Swift concert lineup. It probably just needs enough of that to make small talk, plus comprehensive data on your own products.

The future of business LLMs probably fits in a 1U server.

[1] https://mljourney.com/top-10-smallest-llm-to-run-locally/

williamtrask 1 day ago

I think this is the right question to ask. I think it depends on the task. For example, if you want to predict whether someone has cancer, then access to avast amounts of medical information would be important.

themafia 1 day ago

It's simple.

Pay them.

Otherwise why on Earth should I care about "contributing to AI?" It's just another commercial venture which is trying to get something of high value for no money. A protocol that doesn't involve royalty payments is a non starter.

williamtrask 1 day ago

(OP) 100% and this piece advocates for an enforcement mechanism for that kind of payment (attribution-based control)

runako 1 day ago

One would have to be a special kind of fool to expect honest payments from the very same organizations that are currently doing everything possible to avoid paying for the original training data they stole.

williamtrask 1 day ago

Fwiw - this post doesn't advocate for trust. It advocates for an enforcement mechanism (attribution-based control).

runako 1 day ago

Yes, but then you still have to trust the counterparty enough to bother investing in anything like this. And so far, the main counterparties are demonstrating that they are not very trustworthy when it comes to paying for training data.

Maybe that will change over time. But to hear OpenAI and Anthropic tell it, paying for training data will be the death knell of the industry[1].

1 - there are a number of statements to this effect on the record, for example: https://www.theguardian.com/technology/2024/jan/08/ai-tools-...

williamtrask 1 day ago

With you on this one. I do think ABC is a step in the right direction to improve things. <3

jerf 1 day ago

This document seems to treat "data" as a fungible commodity. Perhaps our use of the word encourages that. But it's not.

How valuable is 70 petabytes of temperature sensor readings to a commercial LLM? It is in fact negative. You don't want to be training the LLM on that data. You've only got so much room in those neurons and we don't need it consumed with trying to predict temperature data series.

We don't need "more data", we need "more data of the specific types we're training on". That is not so readily available.

Although it doesn't really matter anyhow. The ideas in the document are utterly impractical. Nobody is going to label the world's data with a super-complex permission scheme any more than the world created the Semantic Web by labeling the world's data with rich metadata and cross-linking. But especially since it would be of negative value to AI training anyhow.

williamtrask 1 day ago

(OP here) I agree with this in spirit, but also it's hard to imagine the world can be fully described with 200 terabytes of data. There's a lot more good stuff out there.

But to your point, a crucial question in AI right now is: how much quality data is still out there?

As far as the impracticality, it's a great point. I disagree and have spent about 10 years working in the area. But that can be a post for another day. I understand and appreciate the skepticism.

lxgr 1 day ago

> it's hard to imagine the world can be fully described with 200 terabytes of data

Why? Intelligence and compression might just be two sides of the same coin, and given that, I'd actually be very surprised if a future ASI couldn't make due with a fraction of that.

Just because current LLMs need tons of data doesn't mean that that's somehow an inherent requirement. Biological lifeforms seem to be able to train/develop general intelligence from much, much less.

williamtrask 1 day ago

Well, we're opining about a statement about the world. Is the universe only 200 terabytes of information?

"Biological lifeforms seem to be able to train/develop general intelligence from much, much less."

This statement is hard to defend. The brain takes in 125 MB / second, and lives for 80 years, taking in about 300+ petabytes over our lifetime.

But that's not the real kicker. It's pretty unfair to say that humans learn everything they know from birth -> death. A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.

lxgr 1 day ago

> A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.

That also seems several orders of magnitude off. Would you suspect that a human that only experiences life through H.264-compressing glasses, MP3-recompressing headphones etc. does not develop a coherent world model?

What about a human only experiencing a high fidelity 3D rendering of the world based on an accurate physics simulation?

The claim that humans need petabytes of data to develop their mind seems completely indefensible to me.

> A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.

Isn't that like saying that you only need the right data? In which case I'd completely agree :)

williamtrask 1 day ago

"The claim that humans need petabytes of data to develop their mind seems completely indefensible to me."

And yet every human you know is using petabytes of data to develop their mind. :)

ttfvjktesd 1 day ago

I think one important point is missing here: more data does not automatically lead to better LLMs. If you increase the amount of data tenfold, you might only achieve a slight improvement. We already see that simply adding more and more parameters for instance does not currently make models better. Instead, progress is coming from techniques like reasoning, grounding, post-training, and reinforcement learning, which are the main focus of improvement for state-of-the-art models in 2025.

williamtrask 1 day ago

(OP) the scaling laws / bitter lesson would disagree, but I tend to agree with you with some hedging.

If you get copies of the same data, it doesn't help. In a similar fashion, going from 100 TBs of data scraped from the internet to 200TBs of data scraped from the internet... does it tell you much more? Unclear.

But there are large categories of data which aren't represented at all in LLMs. Most of the world's data just isn't on the internet. AI for Health is perhaps the most obvious example.

CuriouslyC 1 day ago

More data isn't automatically better. You're trying to build the most accurate model of the "true" latent space (estimated from user preference/computational oracles) possible. More data can give you more coverage of the latent space, it can smooth out your estimate of it, and it can let you bake more knowledge in (TBH this is low value though, freshness is a problem). If you add more data that isn't covering a new part of the latent space the value quickly goes to zero as your redundancy increases. Also, you have to be careful when you add data that you aren't giving the model ineffective biases.

joe_the_user 1 day ago

the scaling laws / bitter lesson would disagree

I have to note that taking the "bitter lesson" position as a claim that more data will result in better LLMs is a wild misinterpretation (or perhaps a "telephone version) of the original bitter lesson article, which say only that general, scalable algorithms do better than knowledge-carrying, problem-specific algorithms. And the last I heard it was the "scaling hypothesis" that hardly had consensus among those in the field.

williamtrask 1 day ago

Agree with you on the nuance.

lordofgibbons 1 day ago

We don't have a data scarcity problem. Further refinement to the pretraining stage will continue to happen, but I don't expect the orders of magnitude of additional scaling to be required any longer. What's lacking is RL datasets and environments.

If any more scaling scaling does happen, it will happen in the mid-training (using agentic/reasoning outputs from previous model versions) and RL training stages.

williamtrask 1 day ago

I agree with you in a way - that it seems likely that new data will be incorproated in more inference-like ways. RAG is a little extreme... but i think there's going to be middle grounds betweeen full pre-training and RAG. Git-rebasin, MoE, etc.

horhay 1 day ago

Man, it's not like the wave of generative AI has showed us that these companies don't work with altruistic intentions and means.

joegibbs 1 day ago

I remember that a couple of years ago people were talking about how multimodal models would have skills bleed-over, so one that's trained on the same amount of text + a ton of video/image data would perform better on text responses. Did this end up holding up? Intuitively I would think that text packs much more meaning into the same amount of data than visuals do (a single 1000x1000px image would be about the same amount of data as a million characters), which would hamstring it.

supermatt 1 day ago

This entire article reads like some hand wavey nonsense, throwing pretty much every cutting edge AI buzzword around to solve a problem that doesnt exist.

All the top models are moving towards synthetic data - not because they want more data but because they want quality data that is structured to train utility.

Having zettabytes of “invisible” data is effectively pointless. You can’t train on it because there is so much of it, it’s way more expensive to train per byte because of homomorphic magic (if it’s even possible), and most importantly - it’s not quality training data!

williamtrask 1 day ago

This article is meant for a policy audience, so that does keep the technical depth pretty thin. It's rooted in more rigorous deep learning work. Happy to send your way if interested.

supermatt 1 day ago

Posting info on that “rigorous deep learning work” here would be more beneficial to all than just sending to me.

williamtrask 1 day ago

I'm relatively close to publishing my PhD thesis which is broadly a survey paper of what you're describing. Will share (almost done with revisions).

JackYoustra 1 day ago

I'm a bit worried - there could be idiosyncratic links that these models learn that causes deanonymization. Ideally you could just add a forget loss to prevent this... but how do you add a forget loss if you don't have all of the precise data necessary for such a term?

williamtrask 1 day ago

This is the right question. If full attribution-based control is achieved, then this would be impossible. And the ingredient you've suggested could be a useful way to help achieve it.

squigz 1 day ago

I hope the author considers the morality of advocating for health records and financial transactions - and probably every other bit of private data we might have - to be openly available to companies.

I have a better idea: let's just cut the middlemen out and send every bit of data every computer generates to OpenAI. Sorry, to be fair, they want this to be a government-led operation... I'm sure that'll be fine too.

williamtrask 1 day ago

The piece advocates for the opposite of this. Attrbution-based control keeps data holders in control.

catigula 1 day ago

If only laws and having to respect people and privacy didn't exist, then we could build our machine God and I could maybe (but probably not) live forever!

palmotea 1 day ago

> If only laws and having to respect people and privacy didn't exist, then we could build our machine God and [a small handful of billionaires could] live forever [while our useless bodies can be disposed of to make room for their vile creations]!

FTFY

aaroninsf 1 day ago

I literally laughed out loud when I got to the modest proposal.

williamtrask 1 day ago

(OP) YOLO

janice1999 1 day ago

Think-tank wants to enable companies get access to private medical and other personal data. "Solution" to privacy "problem" sounds like a blockchain pitch circa-2019. Wonderful.

01HNNWZ0MV43FF 1 day ago

Sounds good.

I am going to make a blank model, train it homomorphically to predict someone's name based on their butt cancer status, then prompt it to generate a list of people's names who have butt cancer, and blackmail them not to send it to their employers.

CrazyStat 1 day ago

I’m glad you specified you were going to train it homomorphically, for a minute there I was worried about the privacy implications.

williamtrask 1 day ago

(OP) fwiw I fully agree with the privacywashing you're describing here, and this piece is advocating for a more rigorous standard than input privacy (homomorphic encryption), which is insufficient to enable data owners to actually retain control over their data (but is a useful ingredient).

ajjahs 1 day ago

[dead]

techlatest_net 1 day ago

[dead]