Highlights:

  • Managing datasets presents a significant challenge because generative AI models exhibit biases stemming from prejudicial patterns within their training datasets, which are often challenging for humans to identify.
  • Datalogy AI aims to fix this by leveraging its expertise in data curation to help companies identify the right information to compile their datasets and present this data correctly. Top of Form

Recently, Datalogy AI, a startup focused on data curation to simplify the creation of extensive training datasets for generative artificial intelligence models, announced the successful closure of its USD 11.65 million seed funding round.

In the path of Datalogy AI’s seed funding round, Amplify Partners led the round, with involvement from Radical Ventures, Conviction Capital, Outset Capital, and Quiet Capital, underscoring the importance of Datalogy AI data curation. Angel investors, including Google LLC Chief Scientist Jeff Dean and Meta Platforms Inc., participated in the round. Chief AI Scientist Yann LeCun, Quora Inc. founder and OpenAI board member Adam D’Angelo, Cohere Inc. Co-founders Aidan Gomez and Ivan Zhang, and former Intel Corp. AI Vice President Naveen Rao also joined as angel investors.

With an impressive roster of backers, DatologyAI is poised to tackle one of the most significant challenges in generative AI development. In a recent blog post, Datology AI founder and CEO Ari Morcos elaborated that the startup offers the necessary tooling to automate the curation of datasets utilized in training large language models like ChatGPT and Google’s Gemini.

The process operates by identifying the crucial information within a dataset, tailored to the model’s application, and determining its significance. Additionally, it can propose methods for enhancing datasets with supplementary information and optimize batch processing or segmentation to streamline the model training process.

Managing datasets presents a significant challenge because generative AI models exhibit biases stemming from prejudicial patterns within their training datasets, which are often challenging for humans to identify. This underscores the importance of Datalogy AI data curation in addressing such issues. Training datasets are also immensely large, comprising various data formats that may encompass significant noise and extraneous information. According to a recent Deloitte Touche Tohmatsu Ltd. survey, 40% of companies cited data-related challenges, such as data preparation and cleaning, as significant hurdles in AI model development.

Morcos is perhaps the ideal candidate to address this challenge, given his extensive experience of over five years at Meta’s AI lab. During his tenure there, he specialized in developing neurology-inspired techniques to enhance the capabilities of the company’s AI models, primarily through adjustments to their underlying training data.

According to Morcos, today’s generative AI models reflect the data used to train them, meaning that “models are what they eat.”

He emphasized that training AI on the appropriate data and executing it accurately can significantly enhance the model’s overall quality. This is because training datasets influence nearly every facet of the resulting model, encompassing its performance, overall size, and depth of domain knowledge. Employing a more efficient training dataset makes it feasible to substantially reduce training times and develop a more compact model, thereby economizing on time and computing expenses.

The latter point is pertinent because some companies invest millions of dollars in computing resources to train and operate their AI models. Some of these companies have amassed petabytes of data—so vast that it becomes daunting to determine where to commence. Consequently, it has become customary practice to select a random subset of data merely, Morcos explained.

However, randomly selecting data presents challenges as it results in models being trained on redundant data, leading to slower training processes and increased costs. Furthermore, certain types of data might be misleading and detrimental to the model’s performance, while others may exhibit imbalance with “long tails,” potentially introducing biases into the resultant AI model.

Morcos stated, “The bottom line is: training on the wrong data leads to worse models, which are more expensive to train. And yet it remains standard practice.”

Datalogy AI aims to fix this by leveraging its expertise in data curation to help companies identify the right information to compile their datasets and present this data correctly.Top of Form It’s particularly beneficial for companies dealing with petabytes of unlabelled data, which would otherwise require manual labeling.

Morcos explained, “Our vision at DatologyAI is to make the data side of AI easy, efficient, and automatic, reducing the barriers to model training and enabling everyone to make use of this transformative technology on their own data.”

The startup claims that Datalogy AI data curation can handle petabytes of data in almost any format, be it text, video, images, audio, tabular, genomic, or geospatial, and deploy the datasets it compiles on its AI training infrastructure. That sets it apart from existing data preparation tools, which are often more limited in scope and the types of data they can support.

Furthermore, Datology AI data curation can ensure the utilization of superior-quality samples and identify the most intricate concepts contained within each dataset. It is also capable of identifying data types that may be detrimental and cause models to behave otherwise than anticipated by the designer.

It’s not the first startup to address the challenge of training data, and past attempts at automation haven’t always yielded the desired results. For instance, a German nonprofit AI research group called Large-scale Artificial Intelligence Open Network recently had to remove one of its algorithmically curated datasets after discovering images of child abuse within it.

For this reason, DatologyAI does not fully aim to automate every aspect of dataset curation but instead intends to assist data scientists by suggesting methods to refine their existing datasets. “[Our approach] leads to models that train dramatically faster while simultaneously increasing performance on downstream tasks.”

DatalogyAI stated that it is presently collaborating with a select group of customers to refine its data curation tools before launching its platform more widely later this year.