Scientists at IBM Research have developed an artificial intelligence (AI) model that can generate various, creative and convincing titles for photographs. The scientists describe the model in a paper, which was presented at the Conference on Computer Vision and Pattern Recognition.
In order to build the system, solutions had to be found for problems with automatic title creation systems, writes Venturebeat. Such systems often produce sentences that are syntactically correct, but homogeneous, unnatural and semantically irrelevant.
IBM’s scientists solved this problem with an ‘attention captioning model’, which allows the creator of the titles to use fragments of scenes in the photo it is observing to make sentences. At each step of title generation, the AI model has the choice of using visual or textual hints from the previous step.
To ensure that the generated heads do not sound too much like a robot, the team has deployed a generative adversarial network (GAN) to train the model. A GAN is a two-part neural network consisting of generators that produce monsters and of discriminators that try to see the difference between generated monsters and monsters from the real world. A co-attention discriminator calculates how natural the sentences are via a model that combines pixel-level scenes with generated words.
Preventing prejudices
Another common problem in such systems is having prejudices. For example, systems make an analysis that is too close to a specific set of data, after which they cannot handle scenes where objects they know occur in contexts that are unknown to the model.
To prevent that, IBM Research had to make a diagnostic tool. The researchers proposed using a test corpus with images with titles designed in such a way that a model’s poor performance indicates that the analysis is too close to the dataset.
The final model was tested with people from Amazon’s Mechanical Turk. They had to indicate which titles were generated by the AI model and determine how well each title described the corresponding image. People were shown both real and generated examples. The researchers state that their model had a “good” performance. The researchers believe that the model could be the beginning of powerful new computer vision systems.
This news article was automatically translated from Dutch to give Techzine.eu a head start. All news articles after September 1, 2019 are written in native English and NOT translated. All our background stories are written in native English as well. For more information read our launch article.