Google DeepMind's Bold Moves in AI Safety and Alignment

google

A.I

Google DeepMind’s AGI Safety & Alignment team has been making significant strides in the realm of artificial intelligence. This team, originally focused on addressing existential risks from AI systems, has evolved into a more comprehensive entity. Now known as the AGI Safety & Alignment team, it includes subteams dedicated to mechanistic interpretability, scalable oversight, and Frontier Safety. The team has seen substantial growth, expanding by 39% last year and 37% so far this year. Led by seasoned experts, the team operates under the broader AI Safety and Alignment organization at Google DeepMind. This article delves into their recent work, key motivations, and the importance of their initiatives.

The AGI Safety & Alignment team has concentrated on three main areas over the past 18 months: amplified oversight, frontier safety, and mechanistic interpretability. These areas are crucial for aligning models to prevent catastrophic risks and for evaluating whether models possess dangerous capabilities. Additionally, the team has been experimenting with new ideas to identify future areas of focus.

The Frontier Safety team aims to ensure safety from extreme harms by anticipating and evaluating powerful capabilities in frontier models. Their work has primarily centered around misuse threat models, but they are also exploring misalignment threat models. The recently published Frontier Safety Framework (FSF) is a cornerstone of their efforts. This framework follows a responsible capability scaling approach, similar to policies from other leading AI organizations. However, the FSF is uniquely tailored to Google's diverse frontier LLM deployments, affecting stakeholder engagement, policy implementation, and mitigation plans. The team is proud of leading the Google-wide strategy in this space, demonstrating that responsible capability scaling can work for large tech companies as well as small startups.

A key focus of the FSF is mapping critical capability levels (CCLs) to appropriate mitigations. This is high on the team's priority list as they iterate on future versions of the framework. Although the FSF does not include specific commitments due to the early stages of the science, the team emphasizes that the actual work being done is what truly matters. They have conducted and reported dangerous capability evaluations for Gemini 1.5, ruling out extreme risk with high confidence.

The team's paper on Evaluating Frontier Models for Dangerous Capabilities is the most comprehensive suite of evaluations published to date. These evaluations have informed the design of similar evaluations at other organizations. The team regularly runs and reports these evaluations on their frontier models, including Gemini 1.0, Gemini 1.5, and Gemma 2. Their work has set the bar for transparency around evaluations and FSF implementation, and they hope other labs will adopt a similar approach.

Mechanistic interpretability is another crucial part of the team's safety strategy. Recently, they have focused on Sparse AutoEncoders (SAEs), releasing new architectures like Gated SAEs and JumpReLU SAEs. These new architectures have significantly improved the Pareto frontier of reconstruction loss versus sparsity. A blinded study evaluated the architecture change, showing no degradation in interpretability.

The team also trained and released Gemma Scope, an open, comprehensive suite of SAEs for Gemma 2 2B and 9B. Gemma 2 is considered the sweet spot for academic and external mechanistic interpretability research, being large enough to show interesting high-level behaviors but small enough for academics to work with relatively easily. This release aims to make Gemma 2 the go-to model for ambitious interpretability research outside of industry labs.

Previous work in mechanistic interpretability includes analyzing multiple choice capabilities in Chinchilla, attempting to reverse-engineer factual recall on the neuron level, developing the AtP* algorithm for localizing LLM behavior, and creating Tracr, a laboratory for interpretability.

The team's amplified oversight work aims to provide supervision as close as possible to that of a human with complete understanding of an AI system's output. This is often referred to as "scalable oversight" within the community. On the theoretical side, the team has been working on debate protocols that enable polynomial-time verifiers to decide any problem in PSPACE given debates between optimal debaters. However, they acknowledge that real-world AI systems are not optimal, and the problem of obfuscated arguments remains a challenge. Their work on doubly-efficient debate provides a new protocol that enables a polynomial-time honest strategy to prove facts to an even more limited judge, even against an unbounded dishonest strategy.

Empirically, the team has run inference-only experiments with debate, challenging community expectations. They found that debate performs significantly worse on tasks with information asymmetry and that weak judge models with access to debates do not outperform those without debate. They are currently working on training LLM judges to be better proxies for human judges and plan to finetune debaters using the debate protocol.

A long-running stream of research in the team explores how understanding causal incentives can contribute to designing safe AI systems. Causality provides general tools for understanding agent behavior and offers explanations for their actions. The team has developed algorithms for discovering agents, which can help identify goal-directed agents and determine what they are optimizing for.

Their work has shown that causal world models are crucial for agent robustness, suggesting that causal tools are likely to apply to any sufficiently powerful agent. This research continues to inform the development of safety mitigations based on managing an agent's incentives and designing consistency checks for long-term agent behavior.

In addition to their core research areas, the team has been exploring emerging topics to identify potential new long-term agendas. This has led to several papers on diverse topics, including challenges with unsupervised LLM knowledge discovery, explaining grokking through circuit efficiency, and power-seeking behavior in trained agents.

Their work on unsupervised LLM knowledge discovery aimed to rebut the intuition that there are only a few "truth-like" features in LLMs. Although they did not fully achieve this goal, they demonstrated the existence of many salient features with properties similar to truth-like features.

The team's foray into the "science of deep learning" tackled the question of why a network's test performance improves dramatically upon continued training despite nearly perfect training performance. Their paper provided a compelling answer and validated it by predicting multiple novel phenomena. However, they have decided not to invest further in this area due to other more promising opportunities.

The team collaborates extensively with other teams at Google DeepMind and external researchers. Within Google, they work with Ethics and Responsibility teams to consider value alignment and issues around manipulation and persuasion in building artificial assistants. They have also explored detecting hallucinations in large language models using semantic entropy, a study published in Nature and covered in numerous media outlets.

Many researchers on the team mentor scholars on various topics, including power-seeking, steering model behavior, and avoiding jailbreaks. The team also collaborates externally to produce benchmarks and evaluations of potentially unsafe behavior, contributing to OpenAI's public suite of dangerous capability evaluations and the BASALT evaluations for RL with fuzzily specified goals.

The team is currently revising their high-level approach to technical AGI safety. While their bets on frontier safety, interpretability, and amplified oversight are key aspects of this agenda, they recognize the need for a systematic way of addressing risk. They are mapping out a logical structure for technical misalignment risk and using it to prioritize their research. This includes addressing areas like adversarial training, uncertainty estimation, and monitoring, which are crucial for ensuring alignment under distribution shifts.

Google DeepMind's AGI Safety & Alignment team is making significant strides in ensuring the safety and alignment of AI systems. Their work spans frontier safety, mechanistic interpretability, amplified oversight, and causal alignment, with a focus on anticipating risks and developing comprehensive safety strategies. As they continue to refine their approach and explore new frontiers, the team's efforts will undoubtedly play a crucial role in shaping the future of AI safety. Stay tuned for more updates and insights from this pioneering team.