AI Safety in China #7

Government funding for safety and alignment research, research on preventing AI misuse in chemistry and biology, paper on LLM “unlearning,” and Alibaba’s foray into generative AI governance

Dec 22, 2023

The National Natural Science Foundation of China announced that it would fund six projects for fundamental research on generative AI, including on alignment and evaluations, in 2024.
A Chinese research group published a preprint proposing a system to reduce risk of AI misuse in scientific research and creating a benchmark to ensure safety for AI use in chemistry and biology.
ByteDance researchers published a paper proposing to use “unlearning” – training LLMs to “forget” undesirable behavior – to increase model safety.
Alibaba published an extensive white paper on generative AI governance, discussing robustness, interpretability, and preventing misuse in the form of cyberattacks.

Domestic AI Governance

Chinese government foundation to issue grants on generative AI research including safety and alignment

Background: On December 13, the National Natural Science Foundation of China (NSFC) announced an application guide for grants on fundamental research on generative AI.1 In 2021, NSFC funded a total of 48,900 programs for 31.073 billion RMB, which is an average of 635,000 RMB or $89,000 USD (at current exchange rates) per program.

Application guide: The guide notes the importance of generative AI and states that it is one of the routes that might be able to realize artificial general intelligence (AGI). It argues that some problems with current generative AI are lack of understanding of intelligent emergence, high computational costs, content security, and lack of automatic evaluation.2 It proposes funding six projects at 500,000 RMB ($70,000) per project, in any of six research directions. Two of those directions are directly relevant to AI safety: “research on value and safety alignment strategy for large models” and “research on automated evaluation methods for generative models.” According to the guide, alignment research includes work on improving model harmlessness, improving safety and ethics, and ensuring greater stability. Evaluation research includes work on correctness of information, usefulness, logic, safety, security, harmlessness, and following orders.

Implications: This appears to be the first time NSFC is funding AI safety and alignment projects. Prominent scholars in China and elsewhere have called for governments to spend at least one-third of their AI R&D funding on safety and ethics issues. While the amount of this funding is not large in absolute terms, it is a sign that public funders in China are becoming more aware of AI safety and alignment risks and are willing to fund research to address such risks. The projects selected will likely be publicized after the application period closes on January 22, 2024.

Alibaba leads publication of white paper on generative AI governance

Background: The Alibaba Artificial Intelligence Governance Research Center (AAIG) published a lengthy Generative Artificial Intelligence Governance & Practice White Paper between October 31 and December 7.3 The report is co-authored by Alibaba Group, China Electronics Standardization Institute (CESI), and Alibaba Cloud.4

Report content: The report discusses the state of AI development risks from AI and provides extensive suggestions for creating a multi-stakeholder, coordinated, and agile governance system. The risks focused on in the report are personal information protection, content security, model safety and security, and IP protection. Model safety and security includes the topics of robustness, fairness, explainability, and preventing misuse. The report contained suggestions for managing AI risk, many of which focus on content moderation and control. However, it also suggests building benchmarks and evaluations for role-playing and hallucinations, embedding human values in models via technical means, and pursuing watermarking and provenance mechanisms. The authors discuss protecting model security through increasing robustness to out of distribution attacks, visualizing internal structures of models for interpretability, screening training data for unfair content, and testing models to avoid misuse for malicious code and fraud.

Implications: This is likely the most comprehensive public report written by a Chinese tech company that discusses frontier AI safety issues and mechanisms for dealing with those risks. The discussion of evaluation benchmarks, alignment, interpretability, and cyberattack misuse risks are all extremely pertinent to frontier safety. This is a strong, initial indication that governance researchers in some Chinese AI companies may share similar concerns to researchers in Western AI labs, creating potential for mutual learning and dialogue.

Technical Safety Developments

MSRA and USTC researchers study risks of AI misuse in science

Background: On December 11, a research group primarily from Microsoft Research Asia (MSRA) and the University of Science and Technology of China (USTC) released a preprint on controlling the misuse of AI in science. The group was led by Dr. ZHENG Shuxin (郑书新), a Principal Researcher at MSRA and leader of the Scientific Foundation Model project. Other prominent co-authors include YU Nenghai (俞能海), director of the Information Processing Center at USTC, ZHANG Weiming (张卫明), a USTC professor who has worked on watermarking and adversarial defense, and MSRA Societal AI lead XIE Xing (谢幸).

Risks of AI misuse in science: The paper categorizes risks of AI misuse in scientific research into nine different types. The authors discuss in detail potential for misuse and abuse of AI in chemical science, focusing on synthesis planning models, toxicity prediction models, and LLMs and scientific agents. For instance, they find that an AI model, LocalRetro, was capable of proposing alternative pathways to synthesize a known toxic chemical.

SciGuard and SciMT-Safety: The authors note the difficulty yet importance of aligning scientific AI models with human values. To achieve this goal, they propose a “middleware” system called SciGuard that could mediate between users and scientific AI systems, intercepting user queries and processing them based on safety and ethical principles. The research group also built a benchmark dataset called SciMT-Safety to assess safety of AI systems in scientific domains, containing “hundreds of red-teaming queries that span the fields of chemistry and biology.”

Implications: Concerns around the potential for frontier AI systems to aid malign actors in acts of bioterrorism, development of chemical weapons, or similar worries have been expressed by the UK government, are mentioned in a US executive order on AI, and have been referenced by leading AI companies like Anthropic and OpenAI. This paper is one of the first indications that Chinese researchers are developing similar concerns and making concrete efforts to mitigate such threats, with what they assert is the “first benchmark dataset that is specifically oriented towards safety issues related to AI systems in science.” A common perception of threat from terrorist misuse of AI systems for chemical and biological threats would increase the chances for successful international cooperation on mitigating those threats.

ByteDance conducts novel work on LLM unlearning

Background: On October 18, a team at ByteDance Research released a preprint titled “Large Language Model Unlearning.” Some members of that team had previously published a preprint survey on LLM alignment in August.

Their approach: The paper studies using LLM unlearning, aka “forgetting undesirable (mis)behaviors” for three applications: “(1) removing harmful responses, (2) erasing copyright-protected content as requested, and (3) eliminating hallucinations.” The authors use gradient ascent and train the LLM on a dataset of negative examples of what prompt-output pairs are undesirable. The authors argue that LLM unlearning is advantageous over other safety measures because it only requires collecting negative examples, is more computationally efficient, and can better target misbehaviors. The authors find that unlearning can achieve comparable results on harmlessness to reinforcement learning from human feedback (RLHF), despite using only 2% of the computational power.

Implications: The authors claim that this is among the first works to explore LLM unlearning and believe these results show its potential for improving AI alignment. Their work was listed among “some of the best ML Safety papers of 2023” by the ML Safety Newsletter. LLM unlearning is one of the AI safety mechanisms that has lower dual-use potential, in terms of usefulness for improving overall model performance, which makes it a promising path for further research.

Tsinghua and Kuaishou explore safety of LLM-based agents

Background: On November 20, a research team primarily from Tsinghua University published a preprint titled “Evil Geniuses: Delving into the Safety of LLM-based Agents.” The team is led by Tsinghua University assistant professor SU Hang (苏航), who researches trustworthy and explainable machine learning, computer vision, and reinforcement learning.

The research: The researchers conducted jailbreaking attacks to test the robustness of LLM-based agents to adversarial attacks. They created a “virtual, chat-powered team” called Evil Geniuses (EG) to develop plans for simulating threats across multiple levels and roles, attacking the entire system. The researchers found that since LLM-based agents are composed of multiple LLMs, they are more vulnerable to adversarial attacks than regular LLMs. In addition, additional risk arises from multiple ongoing conversations within LLM-based agents and an effect where successful jailbreak of one agent can trigger similar behavior in others. The authors suggested several strategies, including pursuing a multi-tiered alignment framework for LLM-based models, to defend against such attacks.

Implications: As LLMs become more popular and widely used, they will likely also be used more in autonomous settings. This paper highlights some of the risks posed by such applications and the need for more research to increase their safety. It also shows Chinese researcher attention to this specific subset of safety research.

Other relevant technical publications

Fudan University Generative AI Research Lab (GAIR), Carnegie Mellon University, and Fudan University Natural Language Processing Group, Alignment for Honesty, arXiv preprint, December 12, 2023.
Peng Cheng Lab and Tencent AI Lab, On Diverse Preferences for Large Language Model Alignment, arXiv preprint, December 12, 2023.
Tsinghua University CoAI Group, Zhipu AI, and Tsinghua KEG, CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation, arXiv preprint, November 30, 2023.

Expert views on AI Risks

Chinese scholars Gao Wen and Zhang Bo discuss AI safety risks at Tsinghua-HKUST conference

Background: At the International AI Cooperation and Governance Forum 2023, Peng Cheng Laboratory Director GAO Wen (高文) and Tsinghua University Institute for Artificial Intelligence honorary dean ZHANG Bo (张钹) gave keynote speeches that discussed AI safety. Both are also academicians — Gao of the Chinese Academy of Engineering, and Zhang of the Chinese Academy of Sciences

Source: Tsinghua Institute for AI International Governance

Their speeches: Professor Gao’s speech was titled “Safety and Security Problems in the Development of New Generation AI.” He called for emphasizing AI ethics problems, taking systematic precautionary measures towards those who would maliciously misuse AI systems, and paying close attention to technical questions around open-source and pre-trained models. He also referenced his previous article on “Technical Countermeasures for Security Risks of Artificial General Intelligence” and noted three main risks: lack of explainability, lack of reliability, and risk that AI will gain self-awareness and escape human control. Meanwhile, Professor Zhang’s speech was titled “AI Governance in the era of Generative AI.” He argued that it is important to govern model behavior as well as prevent misuse. He suggested that the former can be accomplished through alignment, which he thought may not be too difficult to conduct using human-machine interaction. He argued that misuse problems are more difficult, requiring creating explainable and stable AI theories to achieve safe, secure, controllable, trustworthy, and reliable AI.

Implications: These two scholars are among China’s most well-respected technical AI experts, and Concordia AI has translated their previous statements on AI risks. Their apparent focus on safety rather than capabilities development in this high-profile fora could increase the credibility and salience of AI concerns among Chinese experts.

Concordia AI’s Recent Work

Concordia AI CEO Brian Tse attended the inaugural Singapore AI For Global Good Conference and co-chaired the session on Mitigating Catastrophic Risks & Ongoing Harms from AI.
Concordia AI Senior Governance Lead FANG Liang (方亮) participated in a seminar on AI Safety and Security Risks and Legal Rules, co-hosted by SFC Compliance Technology Institute and the Chinese Academy of Social Sciences Law Department on December 18.
Concordia AI Technical Program Manager Yawen Duan participated in the December 2023 Alignment Workshop, attended NeurIPS 2023, and was on the Organising Committee of the NeurIPS Socially Responsible Language Modelling Research (SoLaR) workshop.
Concordia AI CEO Brian Tse and Senior Program Manager Kwan Yee Ng participated in the Regulating Generative Artificial Intelligence Conference at the Philip K.H. Wong Centre for Chinese Law at the University of Hong Kong (HKU).

Feedback and Suggestions

Please reach out to us at info@concordia-ai.com if you have any feedback, comments, or suggestions for topics for the newsletter to cover.

NSFC is 国家自然科学基金委员会

智能涌现 translates directly to “intelligent emergence,” and likely refers to emergent capabilities of models. 内容安全性 translates directly to “content security” and likely refers to ensuring that generated content adheres to China’s regulations around content control and moderation. 自动评价 translates directly to “automated evaluation,” though it is unclear why the guide mentions automated model evaluations in particular.

AAIG is 阿里巴巴人工智能治理与可持续发展研究中心. The report was published in six parts: Part 1, Part 2, Part 3, Part 4, Part 5, Part 6.

CESI is a public institution overseen by the Ministry of Industry and Information Technology (MIIT) and often assists in drafting Chinese standards. For more information, see Concordia AI’s State of AI Safety in China report.

AI Safety in China