Snorkel AI Debuts Products for Model Training, Development

Snorkel AI is offering new capabilities to help companies curate and prep data for generative artificial intelligence. Formed in 2015, Snorkel AI has been developing software for data-centric AI. Its best known product is Snorkel Flow, which helps enterprise clients build and deploy AI applications efficiently using programmatic labeling to automate the process of creating training data for AI models. Now Snorkel AI’s Foundation Model Data Platform is going beyond programmatic labeling with two new core solutions: Snorkel GenFlow for building generative AI applications and Snorkel Foundry for developing custom LLMs with proprietary data.

“How you curate, sample, filter and clean data ends up having a tremendous impact on the resulting foundation model that you get out,” Snorkel AI CEO and co-founder Alex Ratner told VentureBeat, adding “you can’t just dump in a random mix of garbage data, and expect these models to turn out well.”

Snorkel Foundry aims to help enterprises use their proprietary data as a differentiator in model training, adapting powerful but generic base models into domain-specific specialist models that can provide the basis of all predictive and generative AI applications.

“Today, everyone uses nearly the same models, algorithms, and approaches for training FMs and LLMs — but it’s the data that they train on at all stages which is the differentiator, and the secret sauce that AI-first companies are investing in and guarding most heavily,” Ratner said in an announcement.

VentureBeat writes that Snorkel Foundry can potentially stop the “hallucination” problem that plagues consumer-facing generative AI. “Hallucinations are just another kind of error that is a result of not training the model to do a specific task in the first place,” Ratner told VentureBeat, explaining most models “are trained out of the box to say statistically plausible-sounding things given an input prompt.”

Snorkel Foundry helps create custom FMs and LLMs by programmatically sampling, filtering, cleaning and augmenting proprietary data for domain-specific pre-training.

“After pre-training an LLM, a common step is to execute additional instruction tuning,” typically using an approach called RLHF, or reinforcement learning from human feedback, VentureBeat explains. But when the end goal is generative AI, that type of labeling “is not what’s needed.”

“GenFlow is about providing the right tooling and management capability to provide feedback to help filter out poor-quality data points in an effort to help generative AI generate an optimal output,” explains VentureBeat.

“Better data almost always has a greater impact than fancier models or algorithms in AI — yet data development has been undersupported by AI formalisms and technology,” Ratner writes in an introduction to Snorkel’s Foundation Model Platform.

“For every model development step in the modern journey of building AI applications, there is a critical but often underappreciated data development step, where the data that actually informs the model is selected, labeled, cleaned, shaped and curated.”

Snorkel AI has collaborated with Microsoft Azure and Wayfair, and also lists Memorial Sloan Kettering and BNY Mellon among its clients.

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.