AI Safety in China #16
AI safety raised in key domestic document, top leaders discuss governance and safety at World AI Conference, update on national AI law, and papers on mechanistic interpretability and unlearning
Key Takeaways
AI safety was mentioned in a comprehensive overview of China’s top domestic priorities for the next five years, the highest-level document in which this concept has been mentioned.
The World AI Conference (WAIC) featured a strong safety and governance theme this year and received substantial top Chinese leadership participation, including by the Premier, the Shanghai Party Secretary, and four additional ministerial or vice-ministerial level officials.
A top researcher for China’s legislature cautioned against excessive focus on AI safety in a speech on AI lawmaking. He noted risks of cybersecurity and automated decision-making while advocating for an incremental approach to lawmaking.
Over the past two months, Chinese researchers published one of the first papers in China on mechanistic interpretability, as well as papers on unlearning, risks in superalignment, and benchmarking honesty.
International AI Governance
Top government officials discuss AI safety and governance at the World AI Conference
Background: Chinese Premier LI Qiang (李强) gave opening remarks during the World AI Conference opening ceremony on July 4 (En, Ch). During the ceremony, Shanghai Party Secretary CHEN Jining (陈吉宁), one of the top 24 officials in China, also announced the “Shanghai Declaration on Global AI Governance” (En, Ch).
Premier Li’s remarks focused on three topics:
Cooperating on innovation such as through joint technological research, cross-border data flows, and personnel exchanges;
Bridging the digital divide, capacity building in developing countries, and providing AI services for small and medium-sized enterprises;
Strengthening “collaborative governance” and “AI for good” through harmonizing international standards, creating an inclusive global governance framework, and ensuring AI development prioritizes safety, reliability, and controllability.
The Shanghai Declaration posits that rapid AI development creates “unprecedented challenges, especially in terms of safety and ethics.” The declaration contains two points on AI safety and governance, as well as three points on AI development, public participation, and quality of life.
Maintaining AI safety: The declaration’s definition of AI safety focuses most strongly on data security and privacy protection. It also calls on countries to establish a “testing and assessment system based on AI risk levels,” prevent use of AI for hacking, combat AI-empowered disinformation, and prevent AI misuse by terrorists, extremist forces, and transnational organized criminal groups.
Developing the AI governance system: The document advocates for creating an AI governance mechanism covering the world, with the UN as the “main channel,” and increased representation of developing countries. It also urges sharing AI testing and assessment practices to ensure AI safety and controllability.
A ministerial roundtable with representatives from around 30 other countries and international organizations was chaired by the Ministry of Science and Technology.
MOST Minister YIN Hejun (阴和俊) noted that AI is an issue “of great importance for the fate of humanity” and that China is willing to deepen dialogue on AI governance and ensuring safe and reliable AI technology.
Shanghai Mayor GONG Zheng (龚正) stated that Shanghai has always heavily emphasized AI safety and governance, seeking to provide a “Shanghai Experience” for global AI governance.
Executive Vice Minister of Foreign Affairs MA Zhaoxu (马朝旭) called for using AI for good, adhering to safe development, closing digital gaps, and governing through the UN. On safety, he called for ensuring AI safety bottom lines, always keeping AI systems under human control, and promoting high-quality AI development alongside high-level safety and security.
Implications: Premier Li is the second highest ranked official in China, Shanghai Party Secretary Chen is in the top 24, and the ministerial roundtable included four ministerial or vice-ministerial level officials. The participation of so many top-level leaders in one event partially focused on AI safety shows continued high-level leadership attention to international AI safety and governance over the past 12 months. The official remarks did not necessarily reveal any new Chinese positions on AI safety and governance. However, the top officials continued to reference ideas such as increasing dialogue on global governance, developing safety bottom lines, and conducting AI safety testing and assessment.
China-proposed AI resolution unanimously adopted by UN General Assembly
Background: On July 1, the United Nations General Assembly (UNGA) unanimously adopted a China-proposed resolution titled “Enhancing International Cooperation on Capacity-building of Artificial Intelligence.” 140+ countries co-sponsored the resolution, including the United States. This resolution follows a US-proposed and China-co-sponsored resolution on safe, secure and trustworthy AI.
Content: The resolution has a number of references to AI safety and governance. It recognizes that malicious AI use without adequate safeguards poses potential risks and stresses that AI systems should be safe, secure, and trustworthy throughout the life cycle. Overall, the resolution’s primary focus is on capacity building for AI, with 15 proposals including:
Bridging digital divides through North-South, South-South, and other forms of cooperation;
Increasing policy exchanges, knowledge sharing, technology sharing, and joint international research capacities for developing countries;
Considering benefits and risks of open-source AI, preserving cultural diversity, and taking into account multilingualism in training data;
Voluntary and transparent cooperation initiatives in the UN system to promote developing country participation.
Implications: The unanimous passage of China-led and US-led UNGA resolutions on AI demonstrates that China and the US can agree on fundamental aspects of AI safety and development. In fact, Chinese Ambassador to the UN FU Cong (傅聪) noted that China’s resolution was “complementary” with the US-led resolution and that China is “very appreciative of the positive role that the US has played in this whole process.” The resolution also reflects China’s diplomatic focus on working with the Global South and ensuring a central role for the UN.
Domestic AI Governance
Key Chinese political meeting announces goal to build up AI safety supervision
Background: The Communist Party of China (CPC) 20th Central Committee held its Third Plenum on July 15-18. The meeting resolution (En, Ch) and President Xi Jinping’s explanation (En, Ch) of the resolution were published on July 21. The CPC Central Committee meets approximately once or twice a year, and these meetings are the most important meetings on China’s political calendar since they set the strategic direction for the government to follow. The Third Plenum continued the historical focus of previous Third Plenums on economic issues and reform, covering a wide-range of issues from innovation to the urban-rural divide and ecological conservation. The objectives in this document are intended to be enacted within five years, by 2029.
AI-relevant provisions: The meeting resolution made four direct references to AI and had several other provisions that relate to AI.
AI as an engine of economic growth. The document included AI among other “strategic industries” that will promote “high-quality economic growth.”
Cooperation with the developing world on AI. AI is one of the technologies China plans to cooperate on as part of the Belt and Road Initiative to further its global integration.
AI as an element of cultural governance. The document referenced generative AI as an aspect of cyberspace opinion management, which it classifies under “cultural governance.”
AI as a public safety/security and national security issue. The document called for “instituting oversight systems to ensure the safety of AI” as a “public security governance mechanism” to ensure national security. Other issues classified similarly include cybersecurity, biological security, and natural disasters. The use of AI “safety” in the official English translation contrasts with some previous documents such as the Global AI Governance Initiative, which used “security” rather than “safety.”
International security cooperation. The report supports full participation in global security governance, which includes AI safety and security governance.
Science and technology (S&T) governance. The report urged the creation of an S&T security risk warning system and strengthening S&T ethics governance. AI safety is a component of both of those issues.
Additionally, in President Xi’s explanation, the only explicit reference to AI was repeating the line on “instituting oversight systems to ensure the safety of AI.”
Implications: This is the strongest indication yet that top echelons of the Chinese system are concerned about AI safety. While the Chinese phrase for “AI safety oversight” had been used previously in the 2017 New Generation AI Development Plan, this is the first time to our knowledge that the phrase has been used in a top political document, such as a Party Congress work report, Plenum decision, or Government Work Report. It is notable that AI safety was categorized as a public safety and national security issue, rather than an ideological, economic, or military issue. At the same time, the document highlights that Chinese leadership has a multifaceted view of AI, as the technology also harbors implications for economic growth, cooperation with the developing world, and political stability.
Expert affiliated with China’s legislature shares insights on national AI law
Background: On July 11, the head of the National People’s Congress (NPC) Standing Committee Legislative Affairs Commission Research Office, WANG Hongyu (王洪宇), delivered a speech on lawmaking for AI at a major conference focused on internet law. The NPC Legislative Affairs Commission is a powerful body within China’s legislature that plays an integral role in the drafting, research, and explanation of laws. An English translation of Wang’s speech can be found on Geopolitechs.
The speech: Wang articulated the wide range of considerations that officials are factoring into policymaking on AI.
Challenges. Wang emphasized four primary challenges from AI: IP protection; legal responsibility; cybersecurity, personal information protection, and data security; and deeper moral and ethical issues. Notably, he referenced the complexity of ethical considerations around autonomous decision-making by AI systems.
Geopolitical considerations. Wang noted that AI is reshaping global defense and security, and other countries are seeking strategic advantages in AI. Therefore, he argued that lack of development is the biggest insecurity and called for achieving high-quality development alongside a high level of safety and security.
A careful and phased legislative approach. Wang suggested prioritizing flexible application of existing AI regulation, trialing policies in local government, or enacting minor legislation when difficulties arise. This implies a more incremental approach than rapidly adopting a national AI law. While he acknowledged risks from AI, Wang advocated against excessive worrying or overzealous pursuit of safety.
Implications: This is the most authoritative explanation yet of Chinese government thinking around the national AI law. References to cybersecurity and autonomous decision-making show that drafters have some awareness of frontier AI risks, though this is just one perspective influencing deliberations. Wang’s emphasis on preserving flexibility and using pilot zones for AI regulations implies that the drafting process for the national AI law will be lengthy, incremental, and deliberative.
Technical Safety Developments
Several Chinese research groups publish work on large model interpretability
LLM safety neurons: This preprint was published on June 20 by researchers from Tsinghua. The anchor author was LI Juanzi (李涓子), head of the Tsinghua Knowledge Intelligence Center and principal of the Tsinghua Knowledge Engineering Group. The paper sought to use a mechanistic interpretability approach to identify and analyze LLM safety neurons. Key findings include:
Safety neurons can be consistently identified, and they are stable across different random trials;
Such neurons are sparse and effective, in that 90% of safety performance can be reached by intervening on just 5% of neurons;
Safety neurons encode transferable safety mechanisms across multiple red teaming benchmarks;
The authors were able to build a detector based on safety neuron activation to predict unsafe content generation.
Concept representations in multimodal LLMs (MLLMs): This preprint was published on July 1 by researchers from the Chinese Academy of Sciences (CAS) and South China University of Technology. The anchor authors included CAS researchers HE Huiguang (何晖光) and CHANG Le (常乐). The researchers used behavioral and neuroimaging analysis to compare how object concept representations in LLMs correlate with humans. They instructed LLMs and MLLMs to identify which object in a set of three was the “odd-one-out” for 1,800+ objects. The researchers identified 66 sparse, non-negative dimensions underlying those similarity judgements, which have close correspondence to core dimensions seen in human cognition.
Information flow for explainable MLLMs: This preprint was the product of researchers from Alibaba Group, Carnegie Mellon University, and Shanghai Jiao Tong University, published on June 4. One of the anchor authors was Alibaba VP and DAMO Academy City Brain Lab head YE Jieping (叶杰平). The authors seek to understand the information flow between image and text tokens in MLLMs, namely how the image, user, and system tokens influence the answer token. The authors find that information flow converges primarily in shallower layers (1-11), but there is also redundancy in image token flow in these layers, so they propose a truncation strategy to improve performance.
Implications: Concordia AI’s The State of AI Safety in China Spring 2024 Report (slide 14) found that interpretability research was an underrepresented AI safety research direction among Chinese researchers since April 2023. These recent papers on cutting-edge large model interpretability show that this gap is reducing.
Researchers examine when superalignment can backfire
Researchers from Renmin University and WeChat published this paper on June 17. One of the anchor authors was LIN Yankai (林衍凯), assistant professor at Renmin University. The paper explores the risk of stronger AI models deceiving weaker models in the process of superalignment, a risk the authors dub “weak-to-strong deception.” They explore this problem in the case of alignment to multiple objectives that may conflict (e.g. helpfulness and harmfulness), find that the problem exists, and argue that the phenomenon may become stronger if there is a larger gap in capability between the two models.
Tsinghua researchers explore unlearning for preventing jailbreak attacks
This preprint was published on July 3 by researchers from Tsinghua University Conversational AI (CoAI) Group led by HUANG Minlie (黄民烈). Huang Minlie’s previous work was covered in newsletter issues #3 and #11. The authors argue that supervised fine-tuning (SFT) and similar methods are limited in preventing jailbreaks because of the difficulty of preparing against all possible jailbreak queries. Instead, they seek to train LLMs to unlearn harmful knowledge so that jailbreak attacks will not have an effect. They find that their unlearning process can reduce attack success rate for various jailbreak attacks to lower than 10% using just 20 harmful questions.
Researchers release new benchmark assessing honesty in LLMs
This preprint was released on June 19 by researchers from Shanghai Jiao Tong University Generative AI Research Lab (GAIR), Carnegie Mellon University, Fudan University, and Shanghai AI Lab. One of the anchor authors was GAIR head LIU Pengfei (刘鹏飞), who was previously covered in issue #3. The authors argue that evaluating LLM honesty is understudied and list concerns of LLMs spreading misinformation, committing fraud, or other dangerous behavior as systems approach superintelligence. Their benchmark, named BeHonest, assesses three primary aspects of honesty: self-knowledge, or models transparently communicating capabilities and limitations; non-deceptiveness, meaning models not lying if that it is aware something is a lie; and consistency, or models providing similar responses for similar prompts even with minor changes in phrasing.
Other relevant technical publications
Concordia AI’s Technical AI Safety Database, updated through April 2024, is available here.
Shanghai AI Lab, Fudan University, and University of Science and Technology of China, SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model, arXiv preprint, June 17, 2024.
Peking University and Cornell University, ProgressGym: Alignment with a Millennium of Moral Progress, arXiv preprint, June 28, 2024.
Huawei Noah’s Ark Lab and University of Hong Kong, Jailbreaking as a Reward Misspecification Problem, arXiv preprint, June 20, 2024.
ByteDance Research, Fudan University, and Northwestern University, Toward Optimal LLM Alignments Using Two-Player Games, arXiv preprint, June 16, 2024.
Expert views on AI Risks
Top experts discuss AI safety in WAIC appearances
Background: During the World AI Conference, several leading AI experts gave speeches focused on AI safety issues.
ZHOU Bowen (周伯文) is the new Director and Chief Scientist of Shanghai AI Lab, and he was formerly Senior Vice President of e-commerce giant JD.com and president of its AI division. Director Zhou gave speeches at the WAIC plenary session and also at Concordia AI’s Frontier AI Safety and Governance Forum. In both instances, he suggested a “45-degree law” for AI development, which would entail simultaneous advancement of AI capabilities and safety measures. Zhou proposed developing AI systems capable of reflection, value-based training, causal interpretation, and counterfactual reasoning. He stressed the importance of global collaboration on AI safety, technology sharing, and balancing investments between AI capabilities and safety measures.
Andrew Yao, Turing Award Winner and Tsinghua Dean, participated in a dialogue with fellow Turing Award Winners Raj Reddy and Manuel Blum and separately gave a speech at a forum on governance of frontier AI technology. In the first dialogue, he expressed concern about AI exacerbating data and cybersecurity risks. He also compared AI development to the creation of a new “species” that humans may not be able to coexist with. In his speech, Dean Yao proposed two AI safety research directions. First, he suggested “beneficial AGI,” which would align AI to human interests through mathematical rules and human-AI communication. He also advocated for pursuing “provably safe AGI,” or AGI that can be shown to be mathematically safe based on white box proof-checkers. On the governance side, Dean Yao suggested evaluating AI systems using red teaming and creating an ID system of AI models for full supply chain monitoring.
Implications: SHLAB has been one of the most prolific producers of frontier AI safety research in China over the past year, and remarks by new director Zhou suggest that interest in AI safety and governance could increase under his leadership. Dean Yao’s comments mark the first time we are aware that a top Chinese AI scientist is publicly advocating for pursuing a quantitative AI safety approach, a new research direction pursued by top AI safety experts including Yoshua Bengio, Dawn Song, and Max Tegmark.
What else we’re reading
Ryan McMorrow and Nian Liu, Zhang Hongjiang, founder of BAAI: ‘AI systems should never be able to deceive humans,’ Financial Times, June 27, 2024.
Billy Perrigo, Exclusive: U.S. Voters Value Safe AI Development Over Racing Against China, Poll Shows, Time, July 8, 2024.
Concordia AI’s Recent Work
For the full recap of Concordia AI’s participation in the World AI Conference, please see our previous Substack post.
Feedback and Suggestions
Please reach out to us at info@concordia-ai.com if you have any feedback, comments, or suggestions for topics for the newsletter to cover.