The Data Tokenization Trap

How Monetizing AI Data Stifles Innovation

 

When we think of the phrase “artificial intelligence,” it’s only natural for us to associate it with one thing: data.

At the very heart of any AI model is a tremendous amount of data — this goes for both a traditional centralized model and its decentralized variant. Data serves as the lifeblood of AI. It fuels its learning algorithms, refines its predictions, and ultimately shapes the decisions it makes. Without vast, diverse, and high-quality datasets, AI models would be nothing more than hollow frameworks, incapable of the sophisticated reasoning we have come to expect. And in today’s ecosystems, blockchain technology has made possible a new framework for data utilization: tokenization.

The notion of tokenizing a model’s training data has emerged as a compelling narrative. By turning data into digital assets that can be bought, sold, and traded on decentralized platforms, proponents make several valid points: 

There should be democratized access to necessary training data and we should ensure fair compensation for data providers.

However, as I will caution, beneath this seemingly beneficial framework lie potential threats to the very innovation it seeks to benefit. Specifically, tokenization could create new barriers to democratic AI models by increasing costs, fragmenting data pools, and shifting focus from collaboration to commercialization. We will explore how the tokenization of AI training data – while promising in theory—may inadvertently stifle the open and collaborative environment we strive for.

The Case For Tokenization

Before expanding on this “tokenization trap,” let’s consider the case for such tokenization in the first place. On the surface, digitizing model training data offers several allures:

1). Incentivization of data sharing

By turning datasets into digital assets represented by tokens on a blockchain, data owners gain the inherent ability to monetize their data. These owners, including individuals, companies, and institutions, have a financial incentive to share their data, as they will be compensated by users in return. This leads to a potentially larger and more diverse pool of training data for AI developers that is representative of many different sources.

2). Improved transparency.

The use of blockchain technology ensures that the provenance and ownership of tokenized data are transparent and verifiable, reducing the overall risk of data misuse or theft.

3). Fair compensation

The exchanging of data in digitized asset form creates an inherently more fair system, as data contributors are fairly compensated for their contributions. This is a major step forward in addressing the concerns about the exploitation of data by large tech companies. Instead of elusive corporations owning necessary training assets, individual providers can truly have control of their data—specifying who sees it, how it’s used, and receiving equitable compensation in the process.

Consider a data tokenization protocol like Ocean. Ocean Protocol is a decentralized data exchange that allows data providers to perform this tokenization of their datasets. The sets can then be sold or licensed on the Ocean marketplace, democratizing access to this training data while simultaneously ensuring the providers are fairly compensated. The other benefit in this context is that the higher diversity of data from different sources means that AI models can learn to solve a wider variety of problems and provide a better experience for everyone. This decentralized data sharing helps to eliminate the bias that may be inherent in models trained on data from only one group of people.

A Barrier to Innovation

With respect to the benefits of data tokenization, we must consider the opposing argument. While not flawed, the tokenization process could inadvertently create significant barriers to innovation in the field, particularly for small players, open-source communities, and researchers. Here’s how:

1).  Increased cost of accessing data

            With a tokenized data model, the natural market forces in competition for training sets could presumably lead to higher prices associated with accessing and utilizing them. This would have two primary effects:  

Exclusion of small players:

As data becomes monetized, the cost of accessing high-quality training data could rise significantly. Small start-ups, independent researchers, and open-source developers may struggle to afford the data they need, which could stifle innovation. The best datasets may likely become exclusive to well-funded entities, reinforcing the dominance of large corporations and creating a barrier for new entrants.

Commercialization over collaboration:

The tokenization model may emphasize commercial transactions over collaborative sharing of data. The best way to create innovative models designed for everyone is through collaborative, open-source efforts where data is shared freely. Tokenization could shift the focus instead towards profit, reducing the willingness of data holders to participate in non-commercial, community-driven AI projects.

Consider Filecoin, for instance. The commercialization of data storage and retrieval on platforms like Filecoin can lead to increased costs, particularly when demand for storage spikes. Smaller projects or developers who rely on affordable storage might find themselves priced out, particularly when competing with larger entities that can afford to outbid others for storage resources. This commercialization of data access could hinder smaller players' ability to participate fully. It drives a wedge between large, resource-rich organizations and smaller developers, thus reducing the overall collaborative spirit of decentralized platforms.

2). Fragmentation of data pools

Inadvertently, tokenizing data on the blockchain could lead to its fragmentation into silos. In the tokenizing case, each data owner controls data access through their own tokens. The issue here is that such fragmentation could make it difficult for AI developers to aggregate diverse datasets, which are necessary for training robust and generalizable intelligence models. Much like the effects of data commercialization, the lack of access to comprehensible datasets could lead to biased or less effective AI systems.

What’s more, this framework could reduce data interoperability. Different data owners might tokenize their datasets using various standards or platforms, leading to compatibility issues. Developers might also face challenges in integrating datasets from multiple sources and the quality of the resulting models may suffer. Consider Big Data Protocol, a tool that allows for tokenization and trading of data assets in the DeFi space. Big Data’s framework leads to data being split into numerous tokens, with each one representing a fragment of the overall dataset. These tokens can then be traded independently, leading to a situation where different pieces of a potentially useful dataset are owned by different parties. This not only fragments the data but also makes it more challenging for researchers and developers to assemble a complete set for meaningful model training.

3). Innovation stifled by market volatility

Regardless of how hard we try, we cannot change human nature. When we introduce assets in tokenized form to an exchange of actors, we necessarily create speculation and market volatility. With speculation, data tokens will be traded for profit rather than for their intended use in such areas as AI model training. The value of data will fluctuate based on market demand rather than intrinsic value and the resulting volatility can create an unstable environment for model developers. They may find that the data they need is either unaffordable or unavailable due to speculative trading.

When market conditions hinder access to critical data, it creates unpredictability. And when there is unpredictability, entities are discouraged from investing in long-term research and developing innovative solutions.

Consider Big Data from earlier. The protocol minted a capped supply of 18,000 bALPHA tokens, which are used to unlock access to valuable datasets. Naturally, this scarcity and the hype around the mint drove demand for the tokens as speculative assets — not just for the data they unlocked. Particularly in the early existence of the protocol, market speculation aggressively drove the price of the tokens up, making it increasingly difficult for genuine users — those interested in accessing valuable data—to afford them.

As a result, we see the inherent issue with this tokenized framework: smaller developers, researchers, and organizations that needed the data faced increasing barriers to entry. Again, this does not support the thesis of a democratic and collaborative AI data marketplace.

What Really Matters

If we are to truly achieve a more democratic and innovative AI landscape, we must limit the pitfalls of data tokenization. If data becomes a commodity accessible only to those with significant resources, we find ourselves in virtually no better place than the current system of control by central entities. Smaller start-ups, academic institutions, and independent researchers should not have to struggle to compete over training data to innovate on their models.

AI development truly thrives on collaboration. The underlying danger of tokenization is that it can shift the focus toward profit-driven data exchanges to the detriment of democratic ideals. We must therefore devise ways to maintain a balance between the monetization of data and the fostering of open collaboration.

In the context of data fragmentation, there must be efforts to promote continued interoperability between data pools and ensure that tokenization does not lead to isolated or inaccessible data. To avoid any form of monopoly, we need mechanisms to encourage the sharing of data across different stakeholders. And in the ever-present reality of speculative markets, there need to be safeguards to mitigate the impact of speculation. Perhaps we establish price stability mechanisms or create non-speculative data access models.

After all, like the underlying architecture of blockchain technology itself, if we are to achieve continued growth and success of decentralized AI systems we must focus on long-term sustainability over short-term gains.

Previous
Previous

The Overvaluation of Network Effects