Google Gemini Is Here! Strengthening the Search Moat and Empowering the Whole Family, Gemini 1.5 Pro Is Upgraded to 2 Million Tokens

In mid-March, Google announced that Google I/O would be held at 1:00 a.m. Beijing time on May 15. As the opening time of the conference approached, which should have been the most discussed time, the "old enemy" OpenAI suddenly appeared and released the subversive masterpiece GPT-4o in just 27 minutes, one day ahead of schedule, bringing a new round of AI hegemony into the "Her era".
As Nvidia scientist Jim Fan said in his evaluation of GPT-4o,OpenAI's release of GPT-4o before the Google I/O conference was a wise move to buy more time.
Putting aside the game between the two parties in public relations strategy, OpenAI's sudden efforts may also indicate that Google's Gemni has also reached the voice interaction stage. Before the opening of Google I/O, Google's official account released a video of voice interaction with Gemini.In the demonstration, Gemini can not only recognize scenes in real time through the mobile phone camera, but also perform smooth voice interaction.
Google’s intention in releasing this demo is self-evident, but there are also voices in the comments questioning whether the video is fake, after all, it has a “previous record”. In today’s keynote speech, Pichai did not bring a test of the “voice interaction” function, but once again demonstrated capabilities similar to GPT-4o through a demo.
Specifically, Google CEO Sundar Pichai and a group of executives introduced in one breath in a nearly 2-hour keynote speech:
- Gemini 1.5 Pro Updates
- Gemini 2.5 Flash
- Project Astra
- AI Overviews
- Veo and Imagen 3
- …
Click to watch the full live replay:[Chinese-English] Full version of Google I/O 2024 Keynote Conference | Gemini 1.5 Pro reshapes the search engine and upgrades to 2 million tokens!】
Gemini 1.5 Pro: Expanding to 2 million tokens
After the release of GPT-4o in the early hours of yesterday morning, everyone has basically come out of the shock of "big model real-time call", which also means that OpenAI has successfully brought the industry competition to a new high point, so Google must keep up. As the "largest and strongest" AI model of its own, Gemini must build a ladder for the company.

In February this year, Google announced the launch of Gemini 1.5, of which Gemini 1.5 Pro can support up to 1 million tokens in ultra-long context, which once widened the gap between the number of tokens and the large models of the same period. Today, Google has once again broken through the context window limit.Pichai announced that the context window for Gemini 1.5 Pro will be expanded to 2 million tokens and will be available in private preview to developers.

At the same time, Pichai announced that the improved version of Gemini 1.5 Pro will be available to all developers worldwide, with the 1 million token context version now available to consumers directly in Gemini Advanced, which is available in 35 languages.
In addition, Pichai also said that Gemini 1.5 Pro has been enhanced in the past few months through improved algorithms, with great improvements in code generation, logical reasoning and planning, multi-round dialogue, and audio and image understanding. In Gemini API and AI Studio, in addition to images and videos,The Gemini 1.5 Pro can also infer audio and direct it through a feature called System Commands.

Later, Pichai also introduced the updates of Gemini in Google Workspace, including that Gemini in Gmail, Docs, Drive, Slides and Sheets will be upgraded to Gemini 1.5 Pro, Gmail mobile APP launched new features (summary emails, contextual smart replies, Gmail Q&A), and "Help me write" supports multi-voice writing.
Gemini 1.5 Flash: 1 million tokens, ultra-long context, multi-modality
Just when everyone thought that the update of Gemini 1.5 was "just that", DeepMind CEO Demis Hassabis slowly appeared and brought the first surprise of the day - Gemini 1.5 Flash.

Specifically, the lightweight model Gemini 1.5 Flash is a refined version of Gemini 1.5 Pro, optimized for large-scale tasks with high volume and frequency, more cost-effective services, and a breakthrough long context window.Gemini 1.5 Flash, like Gemini 1.5 Pro, is multimodal, meaning it can analyze audio, video, and images as well as text.
Demis Hassabis said that Gemini 1.5 Flash excels at tasks such as summarization, chat applications, image and video captioning, and extracting data from long documents and tables. This is because Gemini 1.5 Pro trained it through distillation, transferring the most basic knowledge and skills from a larger model to a smaller, more efficient model.

In addition, Demis Hassabis also introduced updates about Gemma. Google announced the launch of a new generation of open artificial intelligence innovation model Gemma 2, which adopts a new architecture to achieve breakthrough performance and efficiency, and will launch new sizes when it is officially released in June.
Project Astra: Real-time, multimodal AI Agent
Among the revelations and speculations before the opening of Google I/O, AI Assistant Pixie has been highly anticipated. According to media reports, Google is expected to launch a new Pixel AI Assistant powered by Gemini, named Pixie, which may have multimodal functions and can provide more personalized services through information on the user's device, such as maps or Gmail.
However, Pixie did not appear as expected, and was replaced by Project Astra, which has multimodal understanding and real-time conversation capabilities.

Demis Hassabis said that Google has made encouraging progress in developing AI systems that can understand multimodal information.But reducing response times to the point where real-time conversations can take place is challenging.Over the past few years, the team has been working hard to improve the way the model perceives, reasons, and converses, making the cadence and quality of interactions feel more natural.
Currently, the team has developed prototype agents based on Gemini, which speed up information processing by continuously encoding video frames, combining video and voice input into an event timeline, and caching this information for efficient call.

at the same time,Google used its speech models to enhance the Agent’s voice to have a wider range of intonations.This allows for quick responses in conversations after identifying the usage context.
This can't help but remind people of the new version of ChatGPT that OpenAI demonstrated early yesterday morning, which is also a real-time conversation and can change the tone according to the situation or user requirements. Unlike Google's video demonstration, ChatGPT was tested on the live broadcast site and answered many questions that were highly sought after online. Today, ChatGPT based on GPT-4o is free to all users, but the audio and video functions are still not online due to privacy concerns.
Veo and Imagen 3: Video + Image
Google also launched its latest video generation model Veo and high-quality text-to-image model Imagen 3.
in,Veo is Google's most powerful video generation model.I don’t know if it is designed to compete with Sora.
Veo can generate 1080p resolution videos in a variety of film and visual styles, with a video length of more than one minute. Google said that with a deep understanding of natural language and visual semantics, the videos it generates can perfectly present the user's creative ideas, accurately capture the tone of the prompts and present details in longer prompts.
At the same time, the footage Veo creates is consistent and coherent, so the movement of people, animals, and objects throughout the shoot appears more realistic.
On the technical level, Veo is based on Google's many years of experience in generating video models, integrating GQN, DVD-GAN, Imagen-Video, Phenaki, WALT, VideoPoet and Lumiere, combining architecture, scaling rules and other technologies to improve quality and output resolution.

Similarly, Imagen 3 is Google’s highest quality text-to-image model.Able to better understand natural language and the intent behind prompts, and incorporate small details into longer prompts,This high-level understanding also helps the model grasp a variety of styles.
AI Overviews: The era of big models in Google search
25 years ago, Google Search was born to help people better understand the complex information on the Internet, and people can search for answers to various questions on the engine. Today, the emergence of Gemini has pushed Google Search to a new level, redefining the way people acquire knowledge and answer questions.
In response, Google said at the conference: "Whatever your mind is, whatever you need to accomplish, just ask, and Google will search for you."

Google has more than one trillion real-time information about people, places and things, and with its trusted quality system, it can provide users with the best content on the Internet. The addition of Gemini further unlocks new Agent capabilities in search and expands the possibilities of Google search.
Among them, the most eye-catching feature is the launch of the AI Overviews feature. "With AI Overviews, users don't need to piece together all the information themselves after asking a question. Google Search will give you an overview of the information, including multiple viewpoints and links for deeper exploration."
Liz Reid, vice president of Google Search, said at the conference, “AI Overviews will be available to everyone in the United States starting today, and it is expected that by the end of this year, AI Overviews will serve more than 1 billion Google search users worldwide.”
“In fact, this is just the first step. We are making AI Overviews unlock more complex problems. To make this possible,We introduced multi-step reasoning in Google Search.

In simple terms, multi-step reasoning is about breaking down the user's overall problem into its components and determining which problems need to be solved in what order, then,Google Search uses the best information to reason about questions based on real-time information and rankings.
For example, when a user asks for a location, Google Search will respond based on real-world information, including more than 250 million places, as well as their ratings, reviews, business hours, etc. This information would take users minutes or even longer to research, but Google Search can complete it in seconds.

In addition to providing basic information retrieval, Google Search can also perform advanced reasoning and logical planning tasks to help users plan activities such as dining, travel, parties, dating, exercise, etc., making users' lives easier.
at last,For those questions that cannot be accurately expressed with text or pictures, Google has also provided a solution - the video question function will be launched soon.This means that the Google search interface will become more diversified in the future.
Trillium: 4.7 times more computing performance per chip
According to Reuters, Nvidia accounts for about 80% of the market share in the artificial intelligence data center chip market, and the remaining 20% are mostly Google's various versions of TPU. However, Google itself does not sell chips, but rents them through its cloud computing platform.

As an important business of the company, announcing a new generation of TPU seems to have become a tradition at Google I/O. Today, Pichai released Google's sixth-generation TPU Trillium, saying that this is the company's best-performing and most efficient TPU to date.Compared with the previous generation TPU v5e, the computing performance of each chip is improved by 4.7 times.It also promised to make Trillium available to cloud customers by the end of 2024.
Google achieved the performance boost in part by enlarging the chip's matrix multiplication unit (MXU) and increasing the overall clock speed, Tech Crunch reported. Google also doubled the memory bandwidth of the Trillium chip.
In addition, Pichai also added that the company released the new Axion processor last month, which is Google's first custom Arm-based CPU with industry-leading performance and energy efficiency.

Later, Pichai also announced that Google will cooperate with Nvidia and will launch the Blackwell chip in cooperation with Nvidia in 2025.
AI for Science: AlphaFold 3 may be open source
DeepMind founder Demis Hassabis said, “We founded DeepMind to explore whether computers can think like humans and build general artificial intelligence.”

Looking back at previous achievements, from RT-2, which converts vision and language into robot actions, to SIMA, a game AI agent that can follow natural language instructions to perform tasks in various video game environments, to AIphaGeometry, which can solve Olympic-level math problems, and even GNoME, which discovers new materials, Demis Hassabis said: "I have always believed that if we can build AGI responsibly, it will benefit humanity in incredible ways."

In addition, at the meeting, Demis Hassabis also emphasized the recently launched AlphaFold 3, which successfully predicted the structure and interactions of all life molecules (proteins, DNA, RNA, ligands, etc.) with unprecedented accuracy, and made major breakthroughs in simulating many different types of molecular interactions, which is crucial for research and development projects such as accurately determining drug targets.
In fact, when AlphaFold 3 was first released, Google had no plans to open source its complete code. It only released a public interface for the AlphaFold Server to support non-commercial research for the model, opening the door to researchers around the world.

However, less than a week before the release, the vice president of research at Google DeepMind suddenly announced: "We will release the AF3 model (including weights) within 6 months for academic use!" Google suddenly announced this open source plan the day before the I/O conference. Whether it was due to pressure from OpenAI or to build momentum for the conference, the open source of AlphaFold 3 has far-reaching significance for the development of the field of life and health.
In the near future, HyperAI will also track the latest layout of Google AI for Science. Those who are interested can follow the official account and wait for in-depth reports!
Final Thoughts
So far, the two-day AI carnival has come to an end. But the battle between OpenAI and Google will not stop - what is the performance ceiling of GPT-5? Can Gemini's ultra-long context limit be broken again? Will OpenAI's search engine impact Google's position...
Nvidia scientist Jim Fan commented, “Google is doing one thing right: they are finally working hard to integrate artificial intelligence into the search box. I feel Agent: planning, real-time browsing, and multimodal input, all from the landing page. Google’s strongest moat is distribution. Gemini doesn’t have to be the best model, it can also be the most commonly used model in the world.”
Indeed, looking back at the entire press conference, my biggest feeling is that "in the era of big models, search may still be Google's greatest confidence."