Generative AI, like other areas of AI, is in constant flux and as such you have to follow what is happening quite closely. I want to cover some new developments in this area happening in 2023 that I was not able to cover in the 8 articles of my Generative AI series.
OpenAI updates
OpenAI has released several updates to its products. The flash news item in November was the ousting of OpenAI’s CEO, Sam Altman by the Board of Directors and a few days of absolute chaos after which Altman came back and some board members had to resign.
ChatGPT
OpenAI has optimised the user interface to make it more usable.
OpenAI has updated the Pro version (using GPT 4) to include browsing and analysis in real-time.
OpenAI has introduced GPTs, which are “Custom versions of ChatGPT that combine instructions, extra knowledge and capabilities for a specific purpose.” according to OpenAI. Pro users are able to create their own models.
OpenAI has already deployed its customized GPTs ahead of opening it to Pro users. There are versions for DALL-E, classic ChatGPT (with no extensions), Data Analysis (to analyse your data files), Game Time (an instructional system teaching board games or card games), and Math Mentor which teaches Math and other models.
I think one of the advantages of these customized models will be to reduce the probability of “hallucinations”, basically GPT making up things since it has no notion of “truth”. By introducing custom training (most likely reinforcement learning from human feedback), they make sure the response is based on truth and not solely on the word vectors previously used for the LLM.
GPT-4 Turbo is a new GPT-4-based LLM with a 128K context window (thus in theory will be able to hold longer dialogues without hallucinations) and a training cutoff date of April 2023. It also has vision-related capabilities such as image recognition. The released versions are labelled as previews.
Data Partnerships
OpenAI has announced that it wants to have a partnership with other organisations in the AI field in order to curate data in domains that are hard to reach through the Internet.
Q* and AGI
Information about a much different model called Q* has been leaked (according to a report from Reuters) and could have been the main reason for the OpenAI Board’s attempted coup against Sam Altman. AGI, as defined by OpenAI, refers to autonomous systems that outperform humans in most 'economically valuable tasks'.
Stability AI
Video Diffusion
Stable Video Diffusion is a new diffusion model for video generation from Stability AI. It can generate 14 and 25 frames with a customizable frame rate between 3 to 30 frames per second.
SDXL Turbo
A new text-to-image model built on top of the SDXL model was released in late November. This is one of the fastest text-to-image models released up to now.
Mistral AI
This small French company was started in August 2023 and got more than $100M in investment in the few months after its launch. They claim to have an open-source model (Mistral 7B) that is much better than the previous best (Llama 2). Their model has 7.3 Billion parameters and is already fine-tuned on chat, but being open-source, it can be fine-tuned on any other specific task.
It uses a sliding window attention (SWA) mechanism (see the paper on generating long sequences with sparse transformers and another one on long-document transformers). Since one of the limitations of transformer models was the context window size, these changes improve the performance and memory requirements for the model.
Some reviewers claimed that it was the best generative AI model produced, so we’ll see if the claims hold in the near future.
RealFill
This project is a collaboration between Google and Cornell University. Its specific purpose is to be able to complete a photograph with some missing information from other photographs taken in the same place. Thus this generation task (called generative in-painting) is very different from other image generation tasks, which could result in a photograph being largely modified in normal generation.
This is how the project website describes the method: “Given a few reference images (up to five) and one target image that captures roughly the same scene (but in a different arrangement or appearance), we aim to fill missing regions of the target image with high-quality image content that is faithful to the originally captured scene. Note that for the sake of practical benefit, we focus particularly on the more challenging, unconstrained setting in which the target and reference images may have very different viewpoints, environmental conditions, camera apertures, image styles, or even moving objects.”
There is a detailed paper explaining the theory and results.
Grok
Elon Musk’s AI company xAI has announced a new generative AI model Grok. Musk himself posted on X to declare that Grok can have real-time information through X. According to the model card that can be reached through the x.AI page, Grok is an autoregressive transformer-based model (just like any of the other big players in this field). Its training data goes until Q3 2023 and it has 63 billion parameters (twice that of GPT 3.5). xAI claims that Grok-1 is better than most LLMs, except for GPT-4. Beta access is now being granted in the U.S. I’ll be watching this carefully.
Midjourney
Midjourney is one of the successful models but one handicap it has is that it is used through the Discord application. Discord is a very popular tool used by developers especially, but it is a bit cumbersome for the regular user. Midjourney now has a separate website (in Beta) that allows users to request tasks with text prompts (ChatGPT-style), so it is easier and more intuitive to use.
Runway
GEN-2 of this video generation software has improved a lot. Text-to-video and Image-to-Video options generate quite impressive videos with built-in style and camera motion options, although I could only use them in trial mode with short sequences. I’ve included an 8-second clip generated with Runway below.
Claude 2.1
Anthropic has released a new version of their Claude 2.1 model. Claude has a 200K token context window and reduces hallucination rates by half.
Microsoft Orca 2
Microsoft is very much involved in LLMs, having a big stake in OpenAI and actively using GPT-4 in Bing. But it is also developing the so-called Small Language Models (SLM), namely those models that are much smaller than LLMs but can be fine-tuned for special tasks. Microsoft has now released Orca 2, their latest SLM, which is rumoured to run Microsoft Copilots, Microsoft’s new AI assistants that are used for purposes like Chat (Bing Chat) or supporting business systems (Microsoft Copilot for Microsoft Office 360) and so on.
The first version of Orca was able to match the performance of models like GPT-3. It uses several techniques that are attempting to learn/copy the reasoning of a “teacher model”, typically a much larger LLM. The motivation is to avoid the training and running of huge LLMS, which is extremely costly.
Among the techniques used by Orca, we can count instruction tuning, explanation tuning and prompt erasing, which is omitting prompts generated by a teacher model (LLM) explaining in detail how a task was accomplished, thus forcing the student model (SLM) to learn the task itself. Microsoft names Orca a cautious reasoner.