Software should be simple. We all know that the best software applications (and suites) are the ones that are so intuitive that no manual is needed i.e. drop-down menus, application wizards and macros, auto-complete functions and easy-to-navigate help menus make modern software usage simpler than it was in the late eighties when those services were just crystallising. But now we have AI workflows driving workplace task and role execution, can we say that things will become simpler, or is there a new complexity through over-engineering about to surface?
Leonardo da Vinci said, ‘simplicity is the ultimate sophistication.’ This certainly applies to AI workflows for visual media, where creating a simple user experience requires an immense engineering effort. This is the opinion of Yaron Vaxman, senior director of deep learning at media asset lifecycle automation company Cloudinary
Though AI workflows all differ, they share one thing in common: between a simple click or prompt and the almost instantaneous output, complexity hides in plain sight. A single click to remove an image background, for example, seems effortless. But under the hood, specialised AI models, complex orchestration layers, and scalable GPU servers are all working together to make it happen.
Single atomic tasks
Taking a closer look at the types of AI workflows for visual media (images and video) that Vaxman and team build at Cloudinary, each ready-to-use AI model performs a very specific, atomic task.
“One of these on its own isn’t production-ready or useful. But when we orchestrate multiple models together, we can build workflows that handle more complex operations, such as removing a background or extending an image to adjust its aspect ratio,” explained Vaxman. “We took this approach because it’s very hard to build one AI that can do everything well. Even the big LLMs today, which go beyond text into tasks like image editing, use specialised models behind the scenes.”
Let’s say we ask an LLM to remove a background from an image. It’s likely to be calling another tool, potentially through an MCP Server, that relies on a dedicated background removal model. The point Vaxman wants to make is that accuracy comes from specialisation. Instead of one “all-knowing” AI, it’s much more effective to have expert models, each highly accurate at a specific task and then orchestrate them together at scale.
Even though the lesson here is drawn from Cloudinary’s work in images and video, the lessons showcased here stand for AI workflow deployments in petrochemical plants and bespoke cake bakery businesses (i.e. anywhere), however ,this orchestration task for AI entities is not simple.
Moving AI to scale
“Each AI model in a workflow comes with unique capabilities and accuracies and is only as strong as its weakest link. The need for mitigation at every link makes these workflows very difficult to engineer. Failures are common, so it’s never as simple as training a model and putting it straight into production,” said Vaxman. “Protections are also needed to ensure AI is used safely. For example, moderation models run alongside the workflows to detect unsafe or NSFW content.”
If something inappropriate is flagged, the system blocks it, for example, by returning a blurred or blacked-out result, with guidance for the user to adjust their parameters. The goal is to prevent abuse without breaking the flow, so the system stays both reliable and safe to use.
Vaxman says that a company’s AI software system must be able to host these models across a cluster of servers and allow them to communicate efficiently. Building such a platform isn’t as simple as downloading an AI model from the web i.e. it demands a full hosting infrastructure.
GPU power & payoff
In the case of visual media, he suggests that infrastructure requires more than just regular CPU servers. It needs GPU servers which are more expensive (obviously), are harder to manage and are more complex to scale.
“On top of that, your API layer must be able to allow these models sufficient time to run, yet still deliver the appearance of a fast, reliable, scalable single API call. That means supporting both large batches of images and large numbers of customers without bottlenecks,” said Vaxman. “The hosting platform must dynamically scale AI models with demand while efficiently sharing them across workflows. When usage spikes, it spins up new instances; when demand drops, it scales back to free GPU resources. The goal is cost efficiency, running the fewest GPU servers possible while still meeting demand. But with asynchronous workflows, the platform also has to coordinate communication seamlessly across servers.”
This combination makes the backend engineering of such a platform highly complex.
Building smarter, not harder
Cloudinary built its hosting platform on top of AWS because it wanted high scale while keeping costs efficient for customers. But, advises Vaxman, this is a “huge effort” that “probably doesn’t make sense” for other use cases. He recommends starting with existing hosting platforms like RUN:AI, AWS SageMaker, or Hugging Face before building. Off-the-shelf solutions can take a team from zero to something usable very quickly. Only later should you consider custom optimisations.
“Also, remember that optimisation is critical. Not just for the hosting platform, but for the models themselves. Costs are high, and usage patterns are often bursty rather than steady. To keep API costs low, you need a system that can scale up and down quickly, but you also need to optimise the models so they run efficiently. You need a dedicated team focused on performance and optimisation if you want to operate at scale. Having said that, for staffing, ideally, you should have three types of teams in place: a strong backend team to manage the hosting platform, an engineering team to optimise workflows, and a research team to develop the AI models. It’s a significant investment in human resources,” said Vaxman.
Fortunately, there are lots of resources available today, including many commercially viable open source models. To help orchestrate complex workflows, Cloudinary’s own MediaFlows tool and platforms like n8n is popular among AI engineers. There are also lots of community-shared workflows online to handle everything from mundane tasks to highly complex processes.
Building AI workflows today is certainly much more accessible now, but the work appears to still require careful attention to engineering details. As da Vinci himself might say, true simplicity comes from complexity done well.