While AI has stolen the show, it’s always about the data
The Foundational Role of Data in AI's Evolution
Artificial Intelligence (AI) has dominated the conversation since ChatGPT's launch in late 2022, captivating imaginations with its boundless potential. Yet, as we advance in the AI revolution, it’s crucial to shift our focus back to the real star of the show: data. The role of data is indispensable, serving as the foundation upon which all AI capabilities are built and expanded.
As the largest data search engine, Nomad Data sees requests for data across continents, industries and use cases. The first wave of training data requests post ChatGPT is all about large corpuses of text to train LLMs. This includes libraries of book text, news articles and broad web scrapes. We are increasingly seeing book publishers, media companies and website operators coming to market with data that wasn’t widely in demand until recently. The lucrative license fees generated have only encouraged more of these broad content libraries to emerge.
The Shift Towards Specialized Data in AI Development
However, the demand for generalized textual content is likely to be just the first act in the AI Data surge. As early leaders in this space gain scale, it will become increasingly challenging for followers to afford the rapidly escalating training costs of these generalized models, which are comprised of both data licensing and compute costs. The world doesn’t need a hundred different broad language models. As long as there are a few models to keep pricing competitive, most AI consumers will be content building on top of existing models.
With the landscape for generalized models already narrowing, the demand we see on Nomad’s search platform is increasingly for highly specialized data. Building AI models to solve narrow problems requires very specific types of data.
Imagine trying to build a model that shows a person what they might look like after cosmetic surgery. To train this type of model you’d need a dataset of before-and-after photos across many different types of surgeries and patients. The only people likely to have these images at scale are surgeons.
If you wanted to build a model that proofreads tax returns, you’re going to need a large library of sensitive tax documents that are likely only to live in the systems of a government or a large accounting firm.
Imagine you want to build software to detect issues in insurance claims, you’re going to need a large dataset of past claims.
Overcoming Challenges in Accessing and Utilizing Proprietary Data
The challenge in the next phase of AI development is that a lot of the data being asked for isn’t publicly available. This data tends to be buried in a company that has never even contemplated selling it, is unsure of the legal ramifications of doing so, and doesn’t have a sales force that would have the slightest clue of how to find this new type of customer. In short, it’s a market where the buyers and sellers don’t know the other exists and have no simple way to find each other, let alone consummate a transaction.
As the battle to build the large, generalized models settles out, the next round of competition will be to build these far more specific models. The ability to source and lock up access to critical data will be the key determinant of what applications will ultimately be built and how effective they prove to be.
At Nomad Data, we are intensifying our vendor recruiting efforts to stay ahead of this problem. If we can’t meet a data search need within our vendor base of over 3,000 companies, we reach out directly to businesses likely to have the data in question. We've built software that uses unique data to scan tens of millions of companies, identifying potential targets for specific types of data. We then conduct broad outreach to these companies, guiding them through the process of monetizing their unique data assets.
Nomad Data has also started partnering with companies possessing unique data to help them build differentiated AI products which they can monetize. Using our Document Chat engine, we fine-tune models with highly proprietary data from partners to launch domain-specific chatbots that can address highly specific problems. While the initial drive toward custom chat bots was all about prompting, the bigger opportunity is around specialized data to fine tune models.
As AI continues its march to pervasiveness, we are likely to see a majority of businesses globally become data vendors in some respect. Most data purchases will be between two companies who have never been involved in data transactions.
The evolving world of AI is fundamentally a story about data. While AI models and algorithms capture headlines, the data that powers them remains the critical factor for success. As the technology advances, the ability to source, test, manage, and utilize highly specialized data will define the future landscape of AI and its applications across industries. Nomad Data is at the forefront of this transformation, enabling companies to harness the power of their unique data assets and drive the next wave of AI innovation.