Red Hat has had a busy week wrapping up its annual Summit. During our visit to the event in Denver, the company unveiled InstructLab, a new initiative that aims to advance the development of open-source AI.
Fundamentally, InstructLab should advance the development of large language models by simplifying and strengthening their training. Indeed, according to Red Hat, which works closely with parent company IBM for InstructLab, companies face challenges adapting an LLM. The challenges make adapting a trained LLM to a company’s needs difficult.
Zooming in a little further on companies’ obstacles, Red Hat sees that organizations typically break down an existing open model to add knowledge or skills. They then rely on expensive and resource-intensive methods for training. In addition, it is difficult to make improvements, something that a community could contribute to via open source.
Also read: Red Hat optimizes OpenShift for hybrid AI and cloud
With InstructLab, there should be a way to address the limitations. It reinforces a large language model by using less human-generated information. Also, re-training the model requires fewer resources than before. The approach ensures that an LLM can be continuously improved by anyone who would want to within a company.
Red Hat’s big goal is to support companies in retraining models on a regular basis. Companies can also use InstructLab to train their own private LLMs who have proprietary skills and knowledge.
How does InstructLab achieve progression?
To make this possible, InstructLab contains three components. First, something Red Hat describes as taxonomy-driven data curation. This is a set of diverse training data curated by humans. The dataset can serve as an example of new knowledge and skills for the model.
The second component of InstructLab is called large-scale synthetic data generation. Synthetic data is not yet very widely known as a term, but this is data generated by AI. The model is used to generate new samples based on the training data from real business situations. In doing so, InstructLab aims to ensure quality by adding an automated step that refines the responses. This should make the model’s response much more reliable.
Finally, Red Hat has made iterative, large-scale alignment tuning part of InstructLab. This step involves re-training the model based on the synthetic data. In it, the knowledge and skills of the model are refined. The two things are independently important for an LLM; for example, a text-focused model requires the knowledge of how to produce good text before the model can actually produce good text (skills).
Future Steps
Looking a little further into InstructLab, one notices that the project shows well the relationship between IBM and Red Hat. Indeed, it uses IBM’s Large-scale Alignment for chatBots (LAB), which was created to address scalability challenges in the training phase in LLM training. It also features an enhanced version of Granite, IBM’s foundational models. As far as we are concerned, the deployment of these technologies shows that IBM is doing a good job of leveraging its Red Hat subsidiary to bring promising technologies to a wider community via the open source route. That’s where Red Hat’s open source background can help.
Therefore, in the conversations we had about InstructLab during Red Hat Summit, it appears that a community in particular should play a central role in the further development of the AI project. After all, with the many developments around AI, InstructLab’s needed features may change in just a few months. But Red Hat especially wanted to bring the project out now, to address the challenges of LLMs.