Data Wars: Control The Data | Control The World
Exploring the growing fight between tech companies over who gets to use their data, and how this might change the way we teach and train artificial intelligence models.
Welcome to the latest edition of ‘The API Economy’ — thanks for being here.
To support The API Economy, be like one of the 3,439 early, forward-thinking people, who subscribe for monthly-ish insights on the API economy, Web3, AI, and more emerging trends.
The fight for data rights
In the cult classic ‘Dune’, Frank Herbert claims that “he who controls the spice controls the world.” But as it pertains to artificial intelligence, those who control the data control the AI.
In the world of 'Dune,' spice was a rare and precious resource that possessed extraordinary properties. Its scarcity and unique qualities made it the center of power and control, shaping the political and economic landscape of Dune’s universe.
Similarly, data has become the lifeblood of today’s AI frenzy.
Data permeates every aspect of our interconnected world, enabling individuals and businesses to make more informed decisions, predict trends, and identify opportunities for growth. From ChatGPT to your Netflix recommendations to the ads you see on social media, data powers the AI systems we interact with (and rely on) daily.
However, data isn’t easily obtained or replicated, especially when it comes to high-quality and diverse datasets. Intellectual property considerations, privacy concerns, and the cost of the collection are just a few challenges that make data a scarce resource — and a competitive advantage for those who have access to it.
In fact, much like the control of spice in 'Dune,' the control of data has far-reaching consequences. Those who have access to vast amounts of data can shape markets, influence consumer behavior, and outpace the competition. They can train AI models more effectively, leading to better predictions, personalized experiences, and improved decision-making.
As companies work at breakneck speed to develop AI and gain a competitive advantage, the demand for data is skyrocketing. Not to mention, the price.
Just recently, Elon Musk & Twitter started charging for access to its previously free API. Instead of training their AI models for free on Twitter’s massive stores of user data (and ultimately profiting off those models), companies will have to cough up a pretty big chunk of change for access.
Reddit followed suit shortly after, announcing that it would charge companies for access to its API:
“The Reddit corpus of data is really valuable,” [Reddit co-founder and CEO Steve Huffman told The Times]. “More than any other place on the internet, Reddit is a home for authentic conversation. There’s a lot of stuff on the site that you’d only ever say in therapy, or AA, or never at all … But we don’t need to give all of that value to some of the largest companies in the world for free.”
As tech companies begin restricting access to their data, not everyone is happy about it. In fact, some say that the prices are “exorbitant.” One Reddit user put it this way:
“[Charging for access to Reddit’s API] strays into the territory of antitrust. … Reddit has a platform that only Reddit can provide access to. They allowed others to access it, build things using their protocols, and then when those others had built something of value, Reddit comes along and uses its market power (by virtue of its platform control) to hold those third parties to ransom.”
But the effects to users of the changes to Reddit’s API to counter AI ‘over-tapping’ its data via API have been felt almost immediately. In what’s been dubbed the ‘Reddit Blackout’, thousands of subreddits (Reddit discussion forums) have gone dark this week to protest the new API policy that has led to rising costs for some third-party apps to access data on the site and increasing worries about content moderation & accessibility.
As we’re starting to see already users, companies, and even countries stand to gain a lot — or lose a lot — from today’s data wars.
In this essay, we’ll discuss those data wars in depth, including the key players and the growing controversy around proper compensation, privacy, and security. Let’s dive in.
Data is the new oil
In many ways, it feels like the world has been turned upside down since ChatGPT made artificial intelligence mainstream last year. Now, as AI chatbots and the large language models (LLMs) that power them become engrained in everyday life, the globe’s biggest tech companies stand to profit.
In case you need a refresher, AI can be thought of in essentially 3 layers: the user-friendly interface (for example, ChatGPT), the LLM that powers it (think, GPT-4), and the data that LLM is trained on (books, movies, and pretty much everything on the internet).
To put it simply, AI wouldn’t be possible without data — and that makes data the most valuable resource in the world today.
So the question becomes, who already possesses ridiculous, unthinkable amounts of data? Who controls the new oil of the digital age?
The answer: It’s the companies that have been thinking about data since the internet age, better known as FAANG (Facebook, Apple, Amazon, Netflix, and Google). These tech companies are sitting on treasure troves of data structured to build wildly profitable ad recommendation engines.
With AI taking off in the last few years, tech companies have their eyes on a better business model that includes using large sets of structured data to train AI models. But who owns the social media data that sits on the public internet?
Just weeks ago, Elon Musk claimed that Microsoft had trained its AI models on Twitter’s data, without permission. When Musk argued that Microsoft needed to pay to access its data, Microsoft refused and kicked Twitter off its advertising platform.
But Microsoft isn’t the only company losing free access to Twitter’s API:
“[In March 2023], Musk set Twitter’s API monthly access price at $42,000 for access to 50 million tweets, $125,000 for 100 million tweets, and $210,000 for the highest plan with 200 million tweets. Twitter discontinued free access to APIs by third parties and developer plugins in February, forcing some developers to suspend their Twitter integration projects.”
These APIs represent the bridge for companies like Microsoft to interact with and access the large data sets companies like Twitter provide.
Twitter & Reddit have strategically opted to close the API bridge to many AI companies while best practices are put into place to determine how data providers should be compensated for LLM & AI training.
They see a future where API usage and data licensing agreements are the core of their business, as Reddit CEO Steve Huffman hinted at in a recent interview with Jay Peters:
To understand the true value of data, let’s briefly go back to the early days of tech companies, their persistent path to user monetization, and how it led to the mountains of data these companies hold today.
The data gold rush
The world's biggest tech companies have been collecting data for over 20 years, ever since the inception of the internet. In fact, the storage of data has become an integral part of their business models.
The use of data in advertising has been monumental in helping tech companies find a path to user monetization over the years. However, today the rapid growth of the AI market is presenting an entirely new (and profitable) opportunity for big tech — much like the dot-com boom of the 1990s.
As the internet became more accessible, the 90s welcomed a wave of new, internet-focused companies. Silicon Valley investors saw the internet as a transformative force that would revolutionize industries and generate significant profits, allowing start-ups to raise massive amounts of investment capital.
In the early days, companies like Facebook and Twitter recognized that providing free services to people around the world required alternative revenue streams, and advertising emerged as the solution. With massive user bases, extensive data, and targeted advertising capabilities, social media companies are able to create valuable advertising opportunities for businesses — and generate substantial revenue in the process.
For example, Google openly Claims:
“Ads help fund our products. Because of advertising, we’re able to offer our products to users around the world free of charge, helping people find answers and get things done.”
Some of those companies became household names, achieving amazing growth and generating hundreds of billions in revenue (Amazon, Google, Yahoo, and eBay, to name a few). These pioneering companies seized the opportunities presented by the internet, and as a result, were able to transform and launch entire industries, from e-commerce to digital advertising.
The idea behind many of the largest tech companies has always been to offer free products to people around the world in exchange for data & attention.
To make advertising more profitable, capturing and analyzing user data is key. The more insights tech companies have into users' preferences, behaviors, and demographics, the better they can tailor ad experiences, leading to increased engagement — and of course, more money.
As a result, the demand for data has also increased, driving companies to invest heavily in data collection and analysis capabilities.
In ‘The Hidden Costs of Free Social Media’ Thomas Sowell, an American economist and senior fellow at Stanford University’s Hoover Institution explains:
“While these platforms have provided us with unprecedented information and content, the nature of their business model has created consequences for society at large. Another such consequence is the large-scale elimination of our consumer privacy through data mining. Big Data is the fuel that artificial intelligence uses to run these algorithms. Big Data has led to these companies’ possession of mass quantities of data, which they must store properly under relevant federal and state law.
This is how Facebook and YouTube know which content the user may be interested in, and it is how advertisements are “micro-targeted.”
Big Data has led to these companies’ possession of mass quantities of data, which they must store properly under relevant federal and state law. These processes can be infiltrated, as was the case in September 2018. Facebook announced that 50 million accounts were known to be affected while an additional 40 million accounts may have been affected. The problem of cybersecurity incidents, also known as breaches, has grown exponentially with the large-scale use of the internet and "cloud technologies.”
This business model works because “The online economy is built in large part on the fact that consumers are willing to give away their data in exchange for products that are free and easy to use” according to Walter Frick of HBR.
Simply put, consumers are willing to exchange their data because they believe the benefits of using these platforms outweigh potential privacy concerns.
So who really stands to profit? Well, to develop artificial intelligence, you need data — and that puts the world’s biggest tech companies (and their massive stores of user data) in the perfect position to benefit from AI’s popularity.
AI algorithms require vast amounts of high-quality data to learn and improve over time. The more data an AI company has, the more accurate and reliable their AI models are likely to be, which can give them an advantage in an increasingly competitive marketplace.
Access to large volumes of data can allow AI companies to create unique data sets for niche applications. These applications can then be sold to new, niche groups. And if an AI company can meet the needs of an underserved niche faster than its competitors, it will have the opportunity to win the lion’s share of revenue available from that segment.
The more data an AI company has, the more competitive it will be. As a result, it’s not just artificial intelligence that’s creating a new gold rush — it’s the data that fuels AI development. Whoever can access the most data will be able to create the best AI models, and as a result, make the most money from the artificial intelligence boom.
Power to the people
But does the platform even have the right to own the data? What about the people?
Today, data ownership and control predominantly reside with platforms and companies, like Meta. However, there’s another approach that we haven’t discussed yet: the idea that data rights should belong to the users themselves.
In 2019, will.i.am wrote for the Economist,
“Personal data needs to be regarded as a human right, just as access to water is a human right. The ability for people to own and control their data should be considered a central human value. The data itself should be treated like property and people should be fairly compensated for it.
There is no data freedom when the options of who to sign up with are limited, the data monarchs rake in billions and all one gets is a “free” account bursting with advertising, faux news, and lame “sponsored content”. The current arrangement feels lopsided, benefiting the data monarchs more than it benefits individuals and communities.”
— will.i.am
Four years later, it seems there’s a possible solution to this problem. Through emerging technologies like Web3 and decentralized social media, there’s a real chance for a future where individuals have greater control over and ownership of their data.
Decentralized platforms can give individuals the agency to decide how their data is collected, stored, and used, giving them more privacy and security. Features like powerful data encryption, transparent data transactions, peer-to-peer architecture, and smart contracts would give users the tools they need to determine the fate of their data and make truly informed decisions about data sharing.
Users could be properly compensated for their data — and have a real option to sell or keep it private. This would create a more personalized and consent-driven data ecosystem, one that recognizes the significant role individuals play in creating the vast datasets that fuel AI and digital platforms and properly compensate creators.
Elon Musk has already taken steps to position Twitter in the direction that shares a portion of revenue with users who can already earn USDC on the platform.
By shifting data ownership from a platform-centric model to a user-centric paradigm, we could move towards a future where data is seen as a valuable asset under the control of the rightful owners – the users themselves.
Data is the bedrock of AI
Imagine you're building a sculpture. You start with a giant block of marble and work to chisel it down. In doing so, you hope to create something beautiful. AI works in a similar way. It takes countless pieces of existing data and rearranges them into meaningful, and seemingly new & useful outputs.
Because of this, the original data we have today is extraordinarily valuable. It's pure, untapped, and forms the bedrock of all AI could ever learn and produce. As we move into the future, the majority of data will either be AI-reformatted versions of today's data or new data that have become increasingly complex to identify as unique or authentic.
That means today's data sets are, arguably, the most precious resource for the future of AI. Those who control this data have a significant role in shaping AI's trajectory. They will determine how AI learns, evolves, and impacts our world.
It’s undeniable that we’re facing a pivotal moment in history. We need to decide how AI should access and use public data sets. We also need to define the legal precedents for how these data sets are credited, compensated, and accessed. It will be one of the most critical decisions we (or our lawmakers) will make in our lifetime.
In my opinion, people like Elon Musk & Steve Huffman are correct in the fact that they own the data created on their platforms, and if AI companies want to train LLMs with Twitter or Reddit data, AI companies have to pay to have that right.
However this war concludes it will form the economic bedrock that shapes AI data licensing, a field that has the potential to be the most valuable field in all of tech.
Thanks for reading — until our next adventure.
—
Special thanks to my friend Haley Davidson for copy help & edits and Mama Schroeder for additional edits (any typos are on them 😊).
Disclaimer: The views in this essay are my own personal opinions and don’t necessarily represent the views of my employer, those mentioned in this article, or anyone other than myself.
Hey mate, I like your content. Would you be down for a cross promotion ?