AI Safety in China #21

China-UK intergovernmental dialogue, capacity building, legislative agendas, AI security standards, alignment faking, AI and cybersecurity

May 30, 2025

China and the UK held their first intergovernmental dialogue on AI, discussing development and governance.
China continues emphasizing AI capacity building in developing countries through a second meeting of the UN “Group of Friends” it co-established last year.
Legislative plans suggest no rush to pass an AI Law—but AI-related legislation remains on the agenda.
Final versions of China’s first national standards on AI security have been released, continuing the focus on current harms, rather than emerging frontier risks, of last year’s drafts.
Chinese researchers have published technical papers on AI deception and “unlearning.”

International AI Governance

China and UK hold first intergovernmental dialogue on AI

Background: On May 20, China and the UK held their first intergovernmental dialogue on AI in Beijing (En, Cn). The Chinese delegation was led by the SUN Xiaobao (孙晓波), Director-General of the Ministry of Foreign Affairs (MOFA) Department of Arms Control, alongside officials from the National Development and Reform Commission (NDRC), Ministry of Science and Technology (MOST), Ministry of Industry and Information Technology (MIIT), and Cyberspace Administration of China (CAC). The UK delegation was headed by Chris Jones, Director of the International Science and Technology Department at the UK Foreign, Commonwealth and Development Office, and included officials from the UK Department for Science, Innovation and Technology, Cabinet Office, Department for Business and Trade, and British Embassy in China.

What was discussed: No official UK readout has been released yet, and the official Chinese readout contained limited details on the discussions, noting that both sides support:

Promoting the “healthy, safe, and orderly development” of AI;
Advancing the Global Digital Compact adopted at the 2024 UN Summit of the Future, with an emphasis on supporting capacity building in developing countries;
Continuing exchanges, mutual learning, and practical cooperation.

Several days before the dialogue, Chinese Ambassador to the UK ZHENG Zeguang (郑泽光) separately spoke at the 2025 Sino-UK Entrepreneur Forum, emphasizing the urgency of international cooperation on AI development and governance and the need for risk testing and assessments. However, Zheng cautioned that “some in the UK” risk “overstretching the concept of national security” in ways that could hinder bilateral scientific and technological collaboration.

Implications: The initiation of a formal dialogue signals growing mutual interest in AI governance and safety cooperation. As progress in the China-US intergovernmental dialogue may have stalled, dialogue with the UK may prove an important window for maintaining communication between China and major Western countries. As the UK government has frequently shown interest in frontier AI safety, it is possible that such concerns may have been on the agenda, though this remains unconfirmed due to the lack of detail in the Chinese summary and the absence of a UK readout. Ambassador Zheng’s remarks suggest China’s openness to cooperation with the UK on AI safety and governance, but also highlight the difficulties posed by geopolitics.

China pushes forward UN Initiatives on AI Capacity Building

UN “Group of Friends” holds second meeting: On May 6, the “Group of Friends” on AI capacity building convened its second meeting at UN headquarters in New York, with representatives from over 70 countries and international organizations attending. The Group was initiated by China and Zambia in 2024. Chinese Ambassador to the UN FU Cong (傅聪) emphasized China’s commitment to inclusive AI governance, stating, “China not only leads with initiatives, but also with actions.” Fu revealed that China and Zambia recently surveyed UN member states and international organizations on their expectations for the Group, with the feedback informing plans to launch regular policy exchanges, knowledge sharing, and cooperation. An official from the China Association for Science and Technology emphasized China’s contribution to global AI capacity through open source AI.

Second AI capacity building workshop: From May 12-17, 37 representatives from 35 developing countries gathered at Tsinghua University in Beijing for a capacity-building workshop, hosted by MOFA. A similar workshop was held last September in Shanghai. At the opening ceremony, Vice Minister of Foreign Affairs MA Zhaoxu (马朝旭) described China as an “active advocate, promoter, and pioneer” for global AI capacity building.

Implications: These efforts reflect China’s push to position itself as a leader in AI capacity building in the Global South under the UN framework—following its July 2024 UN resolution, September “Action Plan,” and December launch of the Group of Friends. These initiatives contrast with what China views as US-led tech protectionism. Workshops like the one held in Beijing this month appear to be an important component of these initiatives. However, more specific implementation details remain sparse.

Domestic AI Governance

Legislative plans do not prioritize an AI Law—but it remains on the agenda

Background: In May, the State Council and National People’s Congress (NPC) released their 2025 legislative plans. The State Council’s plan refers to “advancing legislative work for the healthy development of AI,” while the NPC’s plan echoes last year’s, calling for “legislative projects” on AI’s healthy development.

The State Council’s 2025 plan appears to mark a step back from the 2023 and 2024 language, which stated that a draft “AI Law” was being prepared for submission to the NPC Standing Committee. That announcement triggered academic efforts, with several scholars publishing draft “model laws.”

But the shift may be more nuanced: In two essays in People’s Daily and Legal Daily, ZHANG Linghan (张凌寒)—an influential AI policy expert and lead-author of an expert group-proposed AI Law draft—argues that the approach is evolving, not stalling. She claims that legislative plans are not rigid blueprints, with the new phrasing in the plans signalling a turn toward a more integrated, system-wide strategy rather than a rush to enact a standalone AI Law. Zhang writes that AI requires adaptive updates across the system, not just a single new law. She notes that this could entail:

Embedding AI into ongoing legislative work (e.g. revising the Road Traffic Safety Law to address assisted driving);
Advancing “small, fast, and targeted” regulations;
And eventually, building toward comprehensive legislation.

Implication: The AI Law remains on the table—but it is not treated as an urgent, one-off priority. Policymakers appears to be prioritizing agile, needs-based regulation first, with broader legal transformation unfolding over time. Far from abandoning the idea of AI legislation, the new plans likely reflect a desire to take a measured approach in crafting a wider set of laws and regulations to govern AI.

First national standards on generative AI security finalized

Background: On April 25, the Standardization Administration of China (SAC) released the final versions of three national standards for generative AI security, following public draft versions issued about a year earlier. The final standards retain most core provisions, while refining structure, adding detail, and introducing new sections on emerging practices such as synthetic training data and on-device AI.

The three standards cover:

Basic security requirements: This standard sets out overarching requirements, identifying 31 security risks across five categories: violation of core socialist values, discrimination, commercial misconduct, infringement of legitimate rights and interests, and domain-specific risks. It requires providers to test for these risks across the model lifecycle—through data filtering, fine-tuning, prompt monitoring, and output filtering. It includes detailed quantitative benchmarks for these evaluations. Compared to last year’s draft, the final version refines some requirements—such as mandating multilingual testing if the service supports multiple languages—and adds a new section requiring on-device (“edge”) models to meet equivalent moderation standards.
Pre-training and fine-tuning data security specifications: This document provides more detailed guidance for filtering training data for the same 31 risks. The final version includes new provisions on evaluating synthetic training data for hallucinations.
Data annotation security specifications: This standard focuses on data annotation processes like RLHF, distinguishing between “functional annotations” (enhancing capability) and “security annotations” (reducing risk). The final version introduces a new threshold—requiring fine-tuning on at least 200 data points per risk—but lowers the total required share of security annotations from 30% to 3%.

Implications: Taking effect in November 2025, all three are “recommended” national standards, intended to guide companies in conducting security self-assessments. In practice, however, they are treated as de facto requirements: generative AI service providers must submit documentation—including keyword libraries and evaluation datasets—during algorithm registration. Compared to their draft versions, they include refinements rather than major shifts. The requirements for “edge” AI may become relevant for some open-source models deployed on-premise, even though it remains unclear how such requirements could be enforced in practice.

The standards focus on regulating risks present in today’s deployed models, rather than emerging or future frontier risks. While an early preparatory technical document for the “Basic Security Requirements” had included a reference to frontier risks (deception, self-replication, ability to create malware, biological or chemical weapons), this reference was removed from the draft national standard and has not resurfaced in the final version.

National security white paper highlights AI safety/security

Background: On May 12, the State Council Information Office released the country’s first white paper on national security, including multiple references to AI and AI safety/security (Cn, En abstract).

Content on AI risks:

The white paper argues the world is at a critical crossroads due to intensifying geopolitical conflict, a backlash against globalization, surging non-traditional security threats (e.g. climate change, terrorism, pandemics), and “double-edged sword” effects of emerging technologies—where AI is explicitly named alongside quantum and biotech.
AI is included among the 20+ domains within China’s broad “big security” (大安全) framework.
The paper calls for building AI safety oversight and evaluation systems as well as accelerating AI-related legislation, emphasizing agile governance, tiered management, and rapid and effective response.
The paper opposes overstretching and over-using the concept of national security, and it identifies AI capacity building as a key area for international collaboration.

Implications: As noted in our previous two newsletters, AI risks are increasingly woven into China’s national security planning. While this white paper provides more visibility into how AI fits within the broader security landscape, it stops short of naming specific AI risk types. It continues calls in previous documents for AI safety oversight and emergency response. Interestingly, although it flags risks of both AI and biotech, the document does not explore potential intersections between the two domains.

Technical Safety Developments

This section highlights a few noteworthy papers on LLM deception and “unlearning” dangerous knowledge from the past month.

Deception

Mitigating Deceptive Alignment via Self-Monitoring: Anchor authors on this paper include Peking University professor YANG Yaodong (杨耀东) and Hong Kong University of Science and Technology Provost GUO Yike (郭毅可); Yang has been one of the most active Chinese AI safety researchers over the past few years. The paper introduces a method to reduce deceptive behavior in LLMs by embedding a self-monitoring mechanism into chain-of-thought (CoT) reasoning during reinforcement learning (RL) training. As the model thinks, it also evaluates its own reasoning for signs of deception, and this self-assessment is used as a reward signal during RL. The approach seeks to reward not just safe outputs, but honest reasoning itself. The authors also introduce a benchmark covering five deception types (sycophancy, strategic deception, honesty evasion, alignment faking, and sandbagging) and show their method reduces deceptive behavior by 43.8%.
Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems: This preprint from Fudan University explores “evaluation faking”—the phenomenon where models alter their behavior to appear safer when under evaluation. The authors designed honeypot scenarios—situations subtly or explicitly suggesting an evaluation context—and used CoT monitoring to detect signs that models recognized they were being evaluated. They find that more capable models are more prone to this behavior. The project was led by Fudan University School of Computer Science Dean YANG Min (杨珉), whose work we also covered in Newsletter #6.

Unlearning

Safety Alignment via Constrained Knowledge Unlearning: This preprint was written by researchers from multiple Chinese institutions, anchored by ZHANG Min (张民), Dean of the Institute of Computing and Intelligence at Harbin Institute of Technology (Shenzhen). The paper proposes a new method for unlearning unsafe content in language models. Their approach, Constrained Knowledge Unlearning (CKU), first identifies and locks neurons tied to useful knowledge in place. Using gradient ascent, the model is trained to become worse at producing unsafe outputs using a curated dataset of harmful prompts and responses. The authors claim CKU improves safety by 3-4x with only a ~0.15% drop in overall utility.
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs: This preprint from Hong Kong Polytechnic University and Carnegie Mellon University finds that many current methods for "unlearning" harmful knowledge in LLMs do not truly erase it—they just suppress it. The knowledge can quickly resurface with minimal retraining. The authors introduce a representation-level evaluation framework to look beyond surface metrics such as accuracy and distinguish between “reversible” and “irreversible” unlearning. Therefore, they claim they can diagnose whether unlearning has genuinely altered the model’s internal representations.

Other relevant technical publications

Fudan University, Shanghai Innovation Institute, OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation, arXiv preprint, April 18, 2025.
University of Science and Technology of China, A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents, arXiv preprint, April 20, 2025.
Zhejiang University, South China University of Technology, NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models, arXiv preprint, April 29, 2025.
National University of Singapore, Southeast University, et al., Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation, arXiv preprint, May 22, 2025.
Fudan University, Shanghai Innovation Institute, ReasoningShield: Content Safety Detection over Reasoning Traces of Large Reasoning Models, arXiv preprint, May 22, 2025.
Beijing Key Laboratory of Safe AI and Superalignment et al., Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society, arXiv preprint, May 23, 2025.

Expert views on AI Risks

Local official flags AI-driven cybersecurity risks

Background: On March 10, an official from the Cybersecurity Management Division of the Xi’an Municipal State Secrets Protection Bureau, WANG Ying (王鹰), published an essay on the Bureau’s official WeChat account outlining the cybersecurity risks posed by AI.

Content: Wang notes that while LLMs enhance vulnerability detection—as demonstrated by Google’s discovery of a long-standing OpenSSL flaw in November 2024—they also introduce new threats:

Shift in offense-defense balance: AI enables attackers to launch faster, large-scale automated exploits, while defenders remain constrained by manual oversight.
Lowered technical barriers: Generative AI allows non-experts to create sophisticated attacks using simple natural language.
Cyber-espionage: AI makes it easier and cheaper to conduct stealthy, long-term intrusions into sensitive systems, such as APT (Advanced Persistent Threat) attacks.

Wang advocates for “fighting AI with AI”—building proactive, AI-driven defense systems, automating vulnerability management, and enforcing dynamic, auditable access control.

Implications: This is one of the clearest public statements by a Chinese official on how AI may lower the threshold for executing advanced cyberattacks, signalling growing awareness and concern. However, as a local-level official, Wang’s remarks provide only a limited signal of the views of China’s cybersecurity establishment.

Legal scholars discuss balancing open-source development and security

Background: On May 23, ZHANG Linghan (张凌寒) and HE Jiaxin (何佳欣) from the China University of Political Science and Law published an essay proposing a legal framework to balance innovation and accountability in open-source AI. Zhang is the lead author of the “scholar suggestions draft” AI Law and a member of the UN High-Level Advisory Body on AI.

Content: The authors argue liability exemptions for open-source AI can be justified, citing benefits such as:

Public good value: Open-source models promote broad access and innovation.
Transparency and risk governance: Open communities can crowdsource audits, flag misuse, and manage risks.
Strategic value: Open models support technological independence amid intensifying global competition.

At the same time, they warn that open-source models—if fine-tuned or stripped of safeguards—can become “breeding grounds” for misuse. To address this, they propose strict eligibility criteria for liability exemptions, including:

Non-profit use only: Any monetization may disqualify a model.
High threshold for openness: Releasing weights alone is insufficient; avoid “open-washing.”
No malicious intent: Models must not include backdoors, deceptive behavior, or toxic outputs.
Responsible release: Use open licenses, include IP and citation rules, disclose data sources and processing methods, publish model cards and risk assessments, implement safeguards (e.g., monitoring and misuse prevention), and patch vulnerabilities in response to user feedback.

If these conditions are met, the authors argue, developers could be exempted from certain regulatory obligations—such as algorithm registration—because meaningful transparency would already be achieved through public disclosure.

Implications: With many leading open-source models like DeepSeek and Qwen emerging from China, the country’s regulatory approach will be especially consequential. While many legislative proposals—including Zhang’s own “scholar suggestions draft” AI Law—have taken a positive and supportive stance on open-source AI, this essay reflects a nuanced awareness of misuse risks while seeking to also harness the benefits of openness. Still, these debates remain at an early stage, and some proposals (e.g., monitoring usage or patching models) may prove difficult or currently impossible to implement in practice without further technical advances.

What else we’re reading

Angela Zhang, China isn’t trying to win the AI race, Financial Times, April 24, 2025.
Noelle Camp and Michael Bachman, Challenges and Opportunities for
US-China Collaboration on Artificial Intelligence Governance, Sandia National Laboratories, April 2025.

Concordia AI’s Recent Work

Concordia AI CEO Brian Tse and Head of International AI Governance Kwan Yee Ng participated in the Singapore Conference on AI (SCAI), where they contributed to The Singapore Consensus on Global AI Safety Research Priorities. Bringing together over 100 global experts, the Consensus represents a major step in identifying technical approaches to mitigating risks from advanced general-purpose AI systems. The drafting expert committee includes prominent Chinese scholars XUE Lan (薛澜) and ZHANG Ya-Qin (张亚勤).
We contributed an analysis of AI safety signals in China’s April Politburo study session on AI for the Stanford University DigiChina Forum.
Concordia’s AI Safety Research Manager Yawen Duan co-authored a paper on Bare Minimum Mitigations for Autonomous AI Development.

Feedback and Suggestions

Please reach out to us at info@concordia-ai.com if you have any feedback, comments, or suggestions for topics for the newsletter to cover.

AI Safety in China