At a time when cloud-based databases are becoming increasingly burdened, efficient memory management is crucial to achieving better performance and cost savings. Chinese tech company Alibaba Cloud introduces Eigen+.
Eigen+ is a cluster management system that enables memory over-subscription without compromising reliability or service quality.
Memory over-subscription is a technique in which more memory is allocated to workloads than is physically available, assuming that not all applications will utilize their maximum memory simultaneously. Although this approach leads to more efficient use of infrastructure, it also carries risks, such as Out of Memory (OOM) errors that lead to downtime and failure to meet service level objectives (SLOs).
Better predictability
Traditional methods predict memory consumption based on historical data and apply bin packing algorithms to distribute workloads across servers optimally. However, under high loads, predictions are often inaccurate, with all the consequences that entails. Alibaba Cloud takes a different approach with Eigen+. According to the Pareto principles, a small proportion of database instances are responsible for the majority of OOM errors. By identifying these so-called transient instances and treating them separately, Eigen+ manages to increase predictability while safely increasing oversubscription.
Instead of using complex time series models, Eigen+ transforms the problem of prediction into classification. Machine learning is used to analyze database instances based on both runtime (such as memory usage and queries per second) and non-runtime characteristics (such as customer level and instance specifications). Based on this profile, the system determines whether an instance is stable enough for oversubscription. Transient instances, whose behavior is difficult to predict, are excluded from oversubscription to minimize risks.
Eigen+ employs a hybrid scheduler that considers this classification and simultaneously models the impact of memory usage on SLO compliance using logistic regression. This determines a maximum safe memory threshold per machine, ensuring that oversubscription does not compromise service reliability.
Impressive results
The results in the production environment are impressive. In an online MySQL cluster, the average memory allocation increased by more than 36%, from 75.67% to 111.88%, without any OOM errors being recorded. Service availability remained within the agreed SLO parameters. This success is made possible in part by fallback mechanisms such as live migration of database instances when memory thresholds are about to be exceeded.
Eigen+ demonstrates that intelligent memory management in cloud environments can be robustly configured by identifying and isolating critical deviations, rather than relying on fragile predictions.