Google
Gemma-2
Google has released the family of LLMs announced in their Google IO 2024 meeting to developers and scientists, with a 9B-parameter and a 27B-parameter versions. There will be a 2.6B-parameter model in the future. You can see the white paper about Gemma 2 here.
V2A
Google announced a new model that generates audio for a video. The model can generate unlimited audio output for a video, but the user can guide the model to certain sounds by using descriptive text prompts. It uses a diffusion model for the audio. It is not yet available for the public but there are many good samples on the web page for the model. It is to be matched with Google’s video-generation model Veo.
Here is how Google describes the process they use for Video-to-Audio.
Perplexity
Perplexity has updated their AI Search tool to deliver new functionality. It now offers multi-step reasoning, thereby approaches complex problems by planning, working through goals step-by-step and synthesising in-depth answers. It now integrates the Wolfram/Alpha engine, so it can solve advanced mathematical questions with greater accuracy and speed.
Odyssey
A new AI company, Odyssey plans to build models that Hollywood professionals could use. The two founders come from the self-driving car area, so they are no stranger to the AI area. Odyssey’s web site explains that they use four separate models, each of which controls one aspect of visual story making. These models address high-quality geometry, photorealistic materials, lighting, and controllable motion. All of these aspects are highly customisable, so the experience becomes more than just text-to-video.
There are some impressive clips on the site but the tool is not yet available. Something to keep an eye on.
Groq
Groq has improvements in their LLM Groq (not to be confused with Elon Musk’s xAI-built model Grok) and lets users try out their ChatBot with several models, ranging from Meta’s Llama 70b and 8b, Google’s Gemma2 9b and Gemma 7b, Mixtral 8x7b and OpenAI’s whisper model, also accessible through the Groq API. I tried the chatbot and found it to be quite responsive. However, it seems to have only text mode, when I asked it to generate an image reflecting the main themes of a certain book, it gave me a detailed - and quite meaningful - description of how such an image might look like, which I later realised using OpenAI’s DALL-E 3.
SenseTime
China’s SenseTime (based in Hong Kong) has announced the new version of their first multi-modal model SenseNova, namely version 5.5, reported to beat OpenAI’s GPT-4o in some benchmarks. I checked their website, but the model is usually found embedded in specific applications for vertical business sectors.
OpenAI
Board Membership
Both Apple and Microsoft have relinquished their Observatory seats on the OpenAI Board due to potential regulatory aspects of keeping such links.
GPT-4o mini
OpenAI has announced the GPT-4o mini model. OpenAI is targeting developers and users who have been using GPT 3.5, emphasizing that this new model, a smaller version of the GPT-4o is 60% cheaper than GPT 3.5 (in the API, since GPT 3.5 is free for users).
GPT-4o mini supports text and vision in the API, with support for text, image, video and audio inputs and outputs coming in the future. It has a 128K context window and a 16K output window.
SearchGPT
OpenAI has announced a search engine based on their GPT-4 model. It is currently a prototype and it is supposed to be integrated into a future ChatGPT chatbot. There is a waitlist at the moment. OpenAI says that SearchGPT will present results with clear explanations and sources.
Meta
3DGen
Meta’s 3DGen offers 3D asset creation with high prompt fidelity and high-quality 3D shapes and textures in under a minute. It supports physically-based rendering (PBR), necessary for 3D asset relighting in real-world applications. Additionally, 3DGen supports generative retexturing of previously generated (or artist-created) 3D shapes using additional textual inputs provided by the user.
I was not able to try it out, but this is something I would check in the future.
Llama 3.1
Meta announced “the world’s largest open-source LLM” also known as Llama 3.1. The largest version has 405B parameters, with two other versions with 70B and 8B parameters. The context window size has now been increased to 128K. It seems to beat OpenAI GPT-4o in all but a few benchmarks. It has been trained with a massive set of 16,000 H100 GPUs (Nvidia’s AI-specific GPUs).
I tested Llama 3.1 through the Groq Playground. The 405B model was not available, so I had to use the 70B-parameter version. I gave Llama 3.1 a task to plan for the development of an application for Apple Vision Pro that will be a kind of multimodal Treasure Hunt. Initial paragraphs were okay, providing rather generic advice on steps to build such an application, but then the model went off a tangent and spewed out paragraphs such as the ones below:
I had to ask Llama about this. I laughed at the response and attempt at self-criticism:
As usual, proper prompting with timely intervention has once more saved a doomed task with an LLM. This happened in the playground for Groq. I then tried the same in Chat mode (the difference being that in the playground you use the API, whereas in Chat you use the chatbot itself). This was more successful, namely the hallucination or reverie pattern did not repeat itself even in the first call.
Anthropic
Anthropic has updated its Claude 3.5 Sonnet model to enable users share the artifacts they generate in a chat session with others.
Kling
The Chinese video generation model Kling is now available for free outside China as well.
Mistral
Mistral has announced their newest LLM, Mistral Large 2. According to Mistral, the new model “is significantly more capable in code generation, mathematics, and reasoning. It also provides a much stronger multilingual support, and advanced function calling capabilities.” With a 128K context window it performs on par with the most successful AI models in place such as GPT-4o, Claude 3 Opus and Llama 3 405B in coding and reasoning.
Udio
Udio has released their new version 1.5 adding new features including better audio quality, improved creation workflows, stem downloads (for different instrument and vocal tracks), key guidance, audio to audio generation.
When I tried the video generation, video generation did not work on tracks created with older versions, but it worked perfectly well with newly created tracks. I added below an example, a track about a researcher trying to fathom the limits of artificial intelligence.
Apple
Apple Intelligence
Apple has released its iOS 18.1, iPadOS 18.1 and MacOS 15.1 Developer Beta operating systems with Apple Intelligence for qualified developers in the U.S. Other developers will get it later and users will only get these if they have a qualified device (iPhone 15 Pro or better, iPadOS with an Apple silicon chip or a MacOS computer with an Apple silicon chip) in a future update after iPhone 16 is released initially lacking these features. Some of the improvements existing in the current Beta are described below (I am a developer but not in the U.S. and as such will have to wait for a future developer release). I am assuming that these are already using an on-device model Apple runs.
Writing Tools
Proofreading, rewriting text, summarising text are actions that can be done in most applications with editable text fields.
Siri
Siri has been greatly improved and it has knowledge about Apple devices and software, so you can get much improved help.
Mail
Any mail can be summarised. Mails can be prioritised based on their content.
Messages
Messages can be responded more or less automatically with Apple Intelligence analysing the content and providing suggestions.
Photos
Apart from already existing metadata searches, Photos can now do searches on the photos with free text prompts.
Transcription
In applications recording audio, a transcript can be automatically generated.
Web browsing
It is possible to summarise articles in Safari.