Training Data: 8 best ways to locate training data for your next AI project

Nomad Data

July 23, 2024

Training data is rapidly becoming the most prized resource in Artificial Intelligence. While there have been many resource crunches on the way to developing modern AI systems, the need for high-quality, diverse training data is poised to be the biggest determining factor in what can and cannot be achieved with this powerful new technology. In this article, we will walk you through the best ways to locate the training data you’re looking for.

Why do we need training data?

Training data is the key ingredient in teaching a computer model how to complete a task. By giving a model large numbers of labeled examples, it can learn how to perform the task. For example, if you want to create a computer vision model to look at a picture of a cat or a dog and tell you which is contained in the photo, you will start by inputting pictures of either, along with the desired output: cat or dog. This type of data is made up of two parts (the raw data or image) and the label (cat or dog). Without the labels, it’s not possible to train the model since it doesn’t know anything about the image just by looking at it.

Labeling is a key challenge surrounding training data. In many cases it’s simpler to acquire the raw data, but without the labels. For example, I can easily web scrape pictures of people online, but if I wanted to then use that data to train a model to predict a person’s age from their photo I would be stuck. Finding labeled data at scale poses a big barrier to the generation of training data.

What is the difference between training and testing data?

Training data is the data that is initially used to teach a model to perform a task whereas testing data is data used to verify whether the model has been trained correctly. The underlying data is ultimately the same. It will contain both the raw input along with the label or desired output. For small volumes of testing data, it may be simple enough for a human to verify the output, which may relax the need for labeled data.

So, once you know what training data you’re looking for, how can you find it? Below we dive into the most effective approaches.

1. Use Case Search Platforms Match You with High-Quality Training Data

The easiest way to begin your search for training data is to use a “use case” search platform. These types of systems allow you to search across thousands or tens of thousands of curated datasets to find exactly what you want. These sites work to continually track which companies are selling data and the associated uses.

The advantage of a use case-based platform is that you don’t need to know what the data you need is called or who is selling it. You come to the platform and simply describe either what you need the data to look like or what insights it needs to provide. These platforms then send your request to many companies that focus on data that solves that type of problem. These companies first confirm that their data can do what is needed and then you are introduced to the vendor. Because these types of systems don’t provide any type of list, companies sensitive about broadly advertising that they sell data are more comfortable participating, meaning the volume and uniqueness of data vendors on these platforms are significantly higher than on other systems. This is the best way to get high-quality training data for your AI and machine learning needs.

What is the best use case search platform for training data?

Nomad Data – Nomad Data is currently the only use case-based platform. The platform has a variety of offerings including a free tier which helps you confirm whether a certain type of data is being sold by anyone before having to sign up for a paid account. As of publishing Nomad Data provided access to nearly 3,500 companies globally selling data across 140 different verticals of data, including a significant amount of training data.

2. Traditional Data Marketplaces Have Limited Training Data

A traditional data marketplace is essentially a list of datasets or data vendors. These lists are often categorized by type of data and searchable by keyword.

Traditional marketplaces can work if you’re looking for standard data and know what that data is called or its category. You can navigate to a category and see all the data options that are available. Most are typically free to peruse as well.

However, list-based marketplace can be very challenging to navigate if you don’t know exactly what you’re looking for. It’s also very hard for a data provider to condense what may be a multi-terabyte dataset into a single paragraph description without losing important details on the data.

Another challenge of list-based marketplaces is that many companies selling data don’t want to be listed, causing the list-based sites to be severely limited to a fraction of what’s truly available. The final challenge is that most list-based marketplaces are run by companies promoting other goods and services. This bias tends to lead to datasets listed being focused on those supporting a certain industry or a specific delivery technology such as a SQL table vs a flat file or medical datasets vs financial datasets.

Examples of Traditional Data Marketplaces:

AWS Data Exchange – A list of several hundred data vendors and their associated datasets. They primarily focus on datasets that can be delivered to your Amazon S3 storage bucket. The biggest focus of the data is financial and marketing data.

Snowflake Data Marketplace - A comprehensive list of data providers offering live, ready-to-query data directly within the Snowflake platform. The data spans various industries, including healthcare, retail, and financial services.

Databricks Marketplace - A marketplace offering a variety of datasets, python notebooks, and source code for data manipulation that can be seamlessly integrated into the Databricks platform. The focus is on providing data for analytics, machine learning, and data engineering tasks.

3. Reaching Out to Corporations Directly for Training Data

Much of the data that people need for training domain-specific models is currently being created inside companies that likely aren’t data companies, but ones selling other goods or services that generate the data you might need as a byproduct of their core business. For example, if you need a dataset of employment agreements to train a model to analyze these types of agreements you may want to reach out to a company that signs many of these agreements asking to license a version of this data with sensitive information removed.

This method, while it may seem intimidating, can be quite fruitful. Companies are increasingly interested in new sources of revenue for their less sensitive data. The downside is that this approach can be slow. The data owner may not have thought about their comfort in selling data and may not have the legal licensing agreements in place to allow them to do so. Over time this friction is likely to subside as there is significant momentum globally around corporate data monetization initiatives.

In this area, synthetic data is gaining the spotlight as a way to license data while removing many of the privacy and compliance concerns around the data. For company data that is too sensitive to sell, the company can run an algorithm on the source data to create a version that has the same characteristics as the underlying data but no longer contains anything sensitive about customers, users, or other parties.

Another popular approach to help in this area is tokenization, where a third party removes sensitive information from one or more companies’ datasets and replaces it with a random number or series of characters. While these tokens don’t mean anything on their own, they allow multiple datasets to still be linked to do certain types of analysis.

4. Mechanical Turk to Create Training Datasets

Amazon Mechanical Turk (MTurk) is a crowdsourcing platform that enables businesses and developers to access a diverse, on-demand workforce to complete tasks that require human intelligence. The platform allows users to break down complex tasks into smaller, manageable units called Human Intelligence Tasks (HITs), which are then completed by human workers (Turkers) around the globe for a fee.

Amazon Mechanical Turk is a powerful tool for creating high-quality training datasets for AI and machine learning models. Here's how it can be utilized:

Data Labeling: One of the primary uses of MTurk is for labeling data. For instance, if you have a large collection of images, you can create HITs where Turkers are asked to identify and label objects within those images. This labeled data is essential for training computer vision models.
Data Collection: MTurk can be used to gather new data. For example, you can create surveys or tasks where workers provide textual responses, annotate documents, or transcribe audio files. This collected data can then be used to train natural language processing (NLP) and other models.

Mechanical Turk can be very useful in situations where you have the raw dataset but no labels. The downside to Turk is that it can require meaningful effort to set up and can be expensive depending on the labeling complexity and the volume of data you need to label.

5. Custom Data Building Companies for Niche Training Data

As AI is becoming more ubiquitous, so are the number of companies whose sole focus is to acquire and label very niche datasets. The downside is that these services can be a very expensive way to obtain a highly sought-after dataset. Think of these companies as consultants who are experts at locating, licensing, and then labeling data.

Scale AI - Scale AI provides high-quality training data for AI applications by combining machine learning and human intelligence. They offer services in image annotation, transcription, and 3D sensor fusion, among others.

Appen - Appen offers comprehensive data annotation services across text, image, video, and audio. They have a large global workforce and provide high-quality labeled data for training AI models.

Figure Eight (formerly CrowdFlower) - Figure Eight specializes in transforming unstructured data into high-quality AI training data. They provide services for data collection, labeling, and enrichment across various data types.

Lionbridge AI - Lionbridge AI offers a range of data annotation services, including image, video, text, and audio labeling. They focus on delivering high-quality data for training machine learning models.

6. Web Scraping to Extract Training Data

Web scraping is a method used to extract large amounts of data from websites. This technique can be especially useful when you need to gather data that isn't readily available through traditional data marketplaces or APIs. Web scraping involves writing code to access web pages, retrieve the necessary information, and store it in a structured format like a CSV or database.

Web scraping can provide valuable data for training AI models. For sentiment analysis of product reviews, websites like Amazon, Yelp, TripAdvisor, and IMDb are rich sources of labeled data. However, there are legal and ethical implications to this method, as well as the technical challenges associated with web scraping.

7. Open Data Repositories for Public Training Data

Open data repositories offer publicly accessible databases containing a variety of datasets from government agencies, academic institutions, and non-profit organizations. These repositories are invaluable for obtaining training data without incurring costs and often include diverse datasets suitable for different AI projects.

The biggest advantages tend to be that there is a lot of freely available data. Some of the largest challenges are that the listed datasets are often stale and the metadata isn’t well maintained. The source and integrity of much of the data can also be challenging to verify in these repositories and there is also little consistency in how well the data was labeled or maintained.

Examples of Open Data Repositories:

UCI Machine Learning Repository: Offers databases and datasets for machine learning research.

Kaggle: Provides a large repository of datasets contributed by the global data science community.

data.gov: U.S. government's open data site with high-quality datasets.

European Union Open Data Portal: Contains datasets from various EU institutions and bodies.

8. Build/Collect Training Data Yourself

Building your own dataset can ensure it meets your project’s specific needs and can result in higher-quality data. This involves identifying data sources, collecting raw data, cleaning it, labeling it, and performing quality assurance. This is by far the most expensive and time-consuming option. It typically involves hiring large numbers of people to manually create and label the data you need from scratch. Costs in this area have seen some reduction recently as several technology companies provide platforms to make it easier for you to hire armies of smartphone workers to collect data in the wild and then label it for you.

A new option for building certain types of data has recently emerged: generative AI. Generative AI can produce text often indistinguishable from text written by a human. You can also create AI agents that have the persona of a person and then have them interact with others. Imagine if you needed to source text messages between a real estate agent and a client who is looking for a home. You can set up two AI agents, one as the real estate agent and the other as the client. After you tell the agent what the first message is you can have these AI agents generate an entire conversation of text which would be remarkably similar to an actual conversation. The downside of this approach is that you may not capture the randomness in an actual conversation but the upside is that you can generate an almost unlimited amount of this text in seconds.

Conclusion

The quest for high-quality training data is critical for any AI project, underpinning the accuracy and efficacy of the resulting models. As outlined above, there are numerous avenues to explore when seeking out this data, each with its unique advantages and challenges.

As we are still in the early days of the AI revolution, there remains significant friction in locating training data. Over time we can expect friction to be reduced and companies to get more comfortable with the legalities around data monetization. Technology vendors and new marketplace models are likely to emerge to continue to reduce complexity for all market participants.

But if you are looking to get training data fast and without breaking the bank, you can get started for free with Nomad Data. You can sign-up in seconds, describe the data you’re looking for, and we will find you the data you’re looking for in less than 24 hours.