By Ian Cairns, CEO of Freeplay
Generative AI offers incredible opportunity to create new and better software products for customers, but the way it works, the kinds of things it’s good at, and the demands it makes on product development teams are significantly different from traditional software. This is forcing businesses to radically rethink the way they do software development when it comes to building AI applications.
In the past, most software development followed a predictable script. Someone wrote a spec, someone designed a user interface, and engineers wrote code and then tested it before it was deployed. Generally, things would continue working OK until a decision was made to change something. This process still works well enough for most types of applications.
Generative AI requires a paradigm shift, however. At Freeplay, where our platform allows teams to manage the end-to-end large language model (LLM) product development life cycle, we’re seeing that many of the assumptions that traditional software development rests on do not apply to building products around LLMs. These models are nondeterministic, which means they may produce different results, even when given the same prompt. They can also fail in wide-ranging and unexpected ways. LLMs might be infamous for “hallucinating”—making up facts—but at times, they also provide vague, incomplete, off-brand, poorly formatted, or simply uninteresting responses. Plus, models are constantly changing, and customers’ use of artificial intelligence systems can be surprising. For all these reasons, you shouldn’t build and test an AI system up front and then walk away from it once it’s up and running and move on to the next thing.
The businesses that are creating generative AI applications most effectively today are the ones that have built a system—including both tools and processes—to continuously learn about and optimize their use of LLMs. At a high level, there are a few key aspects of building with generative AI that are especially important for any technology leader to know about.
Evaluation: With generative AI, you have to think more deeply about the outcome you want and what “good” looks like for a product feature. Say you’re building an email draft generator. You want to know a lot more than whether it produces some text. You likely want to know whether it’s factually accurate, whether the tone and format are appropriate to the author, and whether the right names and greetings are included. These can be harder things to measure. In the context of AI products, each of those criteria is referred to as an evaluation—or “eval” for short—and can be conducted on both test and live data using a mix of code, human review, and even other LLM models. A custom panel of contextually relevant evaluations forms the backbone of analyzing AI products.
Data labeling and curation: Closely related to evaluation, you need people with sufficient domain expertise constantly looking at data. There’s no such thing as full automation when it comes to building great generative AI products. Trusted humans who are well-trained to understand what good looks like play a critical role in any AI feedback loop. Not only can they label data as part of an evaluation process, but they can also spot new issues that are not being tracked yet, as well as curate relevant examples into data sets that can be used for fine-tuning or testing.
Testing: Testing AI products looks very different from traditional software testing. In the past, testing was simply about whether a feature performed the function it was intended to perform—for example, does a button work or not? Testing a generative AI product requires coming up with a representative list of all the possible types of interactions and edge cases that may occur for customers, and making sure each behaves reasonably. This is where those expert-curated data sets and a good panel of evaluations become essential. When testing any change to an LLM product, you likely want to run through hundreds or thousands of examples and complete your custom evaluations against each of them. Automation here is key.
Those are a few of the critical aspects of building successful generative AI products, and the teams that have systematized each of them have the ability to rapidly test new changes, monitor real-world customer use in production, and turn learnings into future optimizations. They are also able to report clear metrics to compliance teams and business owners, and they can say they know and can quantify what AI is doing in their products.
These process changes are also leading to changes in job roles and responsibilities. Product engineers are transforming into “AI engineers” who know how to stitch these systems together. Product managers are becoming more involved as the result of their proximity to customer needs and domain challenges, and the more technical ones are getting hands on with prompt engineering and model experimentation. Domain experts from product development teams are getting pulled into the software creation process, since they provide critical insights to evaluate and improve model outputs.
Generative AI will be a huge competitive advantage for companies, but only if they’re able to make the jump to operate successfully in these new ways. The folks who haven’t yet stepped into that process change often find themselves stuck experimenting and trying to get the confidence they need to even ship to production. As with any major platform shift, the businesses that succeed will be the ones that can rethink and adapt how they work and build software for a new era.