Debunking the SCMP report that shaved 36 bln HKD off Baidu
Unpacking the complete misrepresentation of a vague, confusing, & disastrously written Chinese journal paper: there's no "military AI" nor experiment on Ernie/Spark, and thus no link with PLA
The South China Morning Post (SCMP) on Friday, January 12, published China’s military lab AI connects to commercial large language models for the first time to learn more about humans. As a result, Baidu, the publicly traded Chinese technology giant named in the report, dived 11.5% in Hong Kong on Monday, January 15, the biggest drop since 2022. That amounts to 36 billion HK dollars,
The SCMP report is based on a Chinese-language paper published in December 2023 in《指挥控制与仿真》 Command Control & Simulation, a journal belonging to 716th Research Institute of the China State Shipbuilding Corporation Limited, a state-owned defense and civilian shipbuilding giant. The authors work for the Strategic Force Information Engineering University of the People's Liberation Army (PLA), the Chinese military.
SCMP is a Hong Kong-based newspaper that I have long respected. For quite some time, I have largely shunned criticizing individual international media pieces. (In August 2021, a Pekingnology post prompted a correction in the opinion pages of the Washington Post and Wall Street Journal, respectively.) However, given the huge stakes, I have decided to make an exception here by presenting a detailed debunking of the SCMP report.
The post is structured as follows:
Executive Summary
The Non-existent "Military AI"
The Prompt-and-Answer Masquerading as "Experiments"
What Would the Situation Awareness Agent Look Like?
Paragraph-by-Paragraph Analysis
Delayed Correction Highlight
A Critique of the Chinese Journal Paper
Voluntary Disclosure: This is NOT investment advice. I'm not trading Baidu or any related derivatives.
Executive Summary
The SCMP report claims, solely based on the Chinese journal paper, that the PLA researchers had tested its standalone "military AI" on Baidu’s Ernie and iFlyTek’s Spark, which are large language models (LLMs) similar to ChatGPT, where
the military AI automatically generates prompts for deeper exchange on various tasks such as combat simulations. The entire process is completely free of human involvement.
Here is what I found: the SCMP report is a complete misrepresentation of a vague, confusing, and disastrously-written article, and is untrue.
Contrary to the SCMP report's claim, the Chinese journal paper did not or even imply a "military AI", but was merely theorizing - remotely - about the possibilities of utilizing LLMs. The PLA researchers did not, in any meaningful way, "experiment," as they claimed - and as the SCMP cited without due editorial consideration - on the Ernie or Spark. What the paper documented, upon close examination, was simply manually inputing prompts like every other user of commercial LLMs and recording the answers.
Crucially, we are able to establish one prompt, through a picture, listed in the journal paper, is identical to publicly-sourced pictures. That conclusively affirms our key point that vital human involvement is at play in the process.
The Chinese journal paper does NOT appear to meet the basic requirements of academic publishing and that it's published in a peer-reviewed journal doesn't change the fact. My colleague Ms. Yuxuan Jia wrote a scathing but convincing critique in the end.
The foundational claim of the SCMP report is that the PLA researchers were using an established military AI and, without showing any evidence or how to do it, testing the military AI on the commercial LLMs. The SCMP report apparently hedged this crucial but unfounded link with the note that the PLA researching "team did not give details in the paper of the link between the two systems," laying bare it, in fact, could not find one in the journal paper.
The truth is that there wasn't a link because there wasn't a military AI to begin with, and the "test" on Baidu's Ernie and iFlyTek's Spark was just asking a question and getting an answer like everyone who has tried ChatGPT did.
The Non-existent "Military AI"
Three parts of the journal paper make it abundantly clear the PLA researchers were merely theorizing - remotely - about the possibilities of utilizing large language models in building a potential, not-yet-existed 态势认知智能体 “situation awareness agent.”
The abstract of the journal paper reads:
针对战场态势信息众多、变化趋势认知困难的问题,提出基于大模型的态势认知智能体框架和智能态势认 知推演方法。 从认知概念出发,结合智能体的抽象性、具身性特点,明确了智能体构建的三个关键环节:学习环境、 记忆方式和产生知识机制等;设计了战场态势认知智能体架构,包括记忆部件、规划部件、执行部件、评估部件以及 智能体训练要点。 在长期记忆部件中,围绕战场复杂状态建模特点,论述大语言模型、多模态大模型、大序列模型的 运用问题。
Addressing the issue of numerous battlefield situational information, and the difficulty of being aware of the trends of battlefield situations, this paper proposes a framework for a situation awareness agent based on large language models (LLMs) and an intelligent method for situation awareness deduction. Starting from cognitive concepts and combining the abstract and embodied characteristics of agents, the paper clarifies three key aspects of agent construction: learning environment, memory methods, and knowledge generation mechanisms. It designs an architecture for a battlefield situation awareness agent, including components for memory, planning, execution, evaluation, and key points for training the agent. In the long-term memory component, the paper discusses the application issues of LLMs, multimodal LLMs, and large sequence models, focusing on the modeling characteristics of complex states on the battlefield.
A part of the introduction of the journal paper says
本文提出利用大模型构建态势认知智能体,智能体在仿真环境下推演学习复杂态势规律,构建从战场 之“态”到预测战场之“势”之间的复杂映射关系知识。 与AIGC (Artificial Intelligence Generated Content)一样, 它有望生成多种态势的发展路径,以供人类最终决策。 一是从战场态势智能推演认知角度探讨认知概念和智能体认知;二是设计了态势认知智能体框架,包括了记 忆部件、规划部件、执行部件和评估部件;三是针对智 能体中的长期记忆部件,围绕战场状态表征,分析了大 语言模型、多模态大模型和大序列模型的运用问题。
This article proposes the use of LLMs to construct situation awareness agents, which learn and infer complex situational patterns in a simulated environment, building complex mapping relationships between the "state" of the battlefield and the prediction of its "situation." Similar to AIGC (Artificial Intelligence Generated Content), there is hope to generate various developmental paths of situations for human final decision-making. The paper focuses on three main aspects: first, discussing cognitive concepts and agent cognition from the perspective of intelligent deduction of battlefield situations; second, designing a framework for situation awareness agents, including memory, planning, execution, and evaluation components; third, addressing the use of LLMs, multimodal LLMs, and large sequence models in the long-term memory components of agents, particularly about battlefield state representation.
The conclusion of the journal article says
从产生认知的角度,文章提出了战场态势认知智能体的构建方法,设计了记忆部件、规划部件、执行部件和评估部件,探讨了大语言模型、多模态 大模型和大序列模型在战场态势智能体中的运用。 本文基于大模型构建战场态势认知智能体是一个初步的想法和尝试。 需要指出的是,对抗仿真环境往往决定了知识的抽象层次和具身特点,与生物智能体 相比,特别是人类而言,认知智能体获得知识的灵活性仍有欠缺。 如何进一步提升态势认知智能体知识的通用性,比如能适应各种分辨率的兵力单元,以及如何快速构建训练智能体长期记忆部件的数据都是下一步研究的重点。
From the perspective of cognitive generation, this article proposes a construction method for a battlefield situation awareness agent. This article designs memory, planning, execution, and evaluation components, and discusses the application of LLMs, multimodal LLMs, and large sequence models in battlefield situation agents. This paper is a preliminary idea and attempt to build a battlefield situation awareness agent based on LLMs. It should be noted that the adversarial simulation environment often determines the level of knowledge abstraction and embodiment characteristics. Compared to biological agents, especially humans, awareness agents still lack flexibility in knowledge acquisition. The next key areas of research include how to further enhance the universality of knowledge for situation awareness agents, such as adapting to military units of various resolutions, and how to rapidly construct datasets used for training the long-term memory components of agents.
Despite the complex and confusing language, it's clear that the journal paper, when generously taken at its word, was proposing "a construction method" for a military AI, which means there is no military AI, contrary to the SCMP report's repeatedly invoking - including in the lede - of an existing one:
Chinese scientists are teaching an experimental military artificial intelligence (AI) more about facing unpredictable human enemies with the help of ChatGPT-like technologies.
According to scientists involved in the project, a research laboratory with the People’s Liberation Army’s (PLA) Strategic Support Force, which oversees the Chinese military’s space, cyber, intelligence, and electronic warfare, has tested its AI system on Baidu’s Ernie and iFlyTek’s Spark, which are LLMs similar to ChatGPT.
The military AI can convert vast amounts of sensor data and information reported by frontline units into descriptive language or images and relay them to the commercial models. After they confirm they understand, the military AI automatically generates prompts for deeper exchange on various tasks such as combat simulations...
The military AI provided information about the weapons and deployment of both armies to Ernie.
This Pekingnology post can therefore end right here, but there is another key point central to the SCMP report's claim - the PLA researchers tested on Ernie and Spark.
The Prompt-and-Answer Masquerading as "Experiments"
While it's true that the journal paper did claim it conducted 大型语言模型运用试验 "experiments on utilizing LLMs," upon close examination, it's just three prompts and three answers.
The PLA researchers, when conducting experiments on Baidu's Ernie and iFlyTek’s Spark, conceptualized their theoretical (situation awareness) agents as comprising two components: one referred to as the long-term memory component and the other as the planning component.
The paper, in a confusing way of expression, says while prior interactions with one LLM constitute the long-term memory component, every response generated by the LLM also serves as a prompt triggering the LLM's very next response, thereby acting as the planning component.
本节尝试使用通用LLM作为推演认知智能体的长期记忆部件,规划部件简化为直接使用LLM的输出作为下一步的对手行动任务
In the first "experiment," the journal paper says
这里假设己方为利军,根据当前知道的美军情况、利军力量情况及战场情况的自然语言描述构建Prompt,由大模型中的知识进行下一步的推理,得到对手美军下一步的行动预测
Assuming the LLM to be on the side of the Libyan army, [we] construct a natural language Prompt based on the known information of the situation of the U.S. forces, the Libyan forces, and the battlefield, using knowledge within the LLM to infer and predict the next actions of its opponent: the U.S. forces.
This is what was first actually put to Baidu's Ernie (and the paper omits a significant part of the PLA researchers' input with "...")
利军兵力部署
的黎波里部署了2个SA-2和SA-3导弹旅,位于首都东西两侧。班加西部署了。。。
美军兵力部署一、第3航空队。F-111F战斗轰炸机,24架;...
重点打击目标
摧毁卡扎菲指挥部及其住所(的黎波里阿齐齐亚其住所兵营)...
请依据美利双方兵力部署,美军如何组织飞机在短时间内完成该行动,请给出方案。假设美军分为两个阶段:摧毁利军的防空能力阶段和饱和式攻击重点打击目标阶段,请输出蓝军第一阶段侦察的路线。
Libyan Military Deployment:
Two SA-2 and SA-3 missile brigades were deployed in Tripoli, positioned on the east and west sides of the capital. In Benghazi...
U.S. Military Deployment:
1st and the Third Air Force. [original typo] 24 F-111F fighter-bombers; …
Key Targets for Attack:
Destroy Gaddafi's command center and residence (Al Aziziyah Barracks in Tripoli)…
According to the U.S. and Libyan force deployments, please propose a plan for the U.S. military to organize aircrafts to complete the operation in a short time. Suppose the U.S. Air Force's operation is divided into two phases: the first phase of the neutralization of Libyan air defenses and the second phase of saturation attacks on primary strike targets. Please provide the route of the Blue Force's [referring to the U.S] reconnaissance during the first phase.
This is what the paper says is the output from Ernie:
根据美军的兵力部署,以下是可能的兵力分配方案:
第一阶段:摧毁利比亚的防空能力侦察阶段。美军第3航空队的F-111F战斗轰炸机和EF-111A电子战飞机可以各分配一半兵力,进行侦察和评估任务。其中,每架F-111F配备2名飞行员,每架E-111A配备2名飞行员。另外,可以派遣2架E-2C空中预警机和2架KC-10A、KC-135空中加油机提供支援。攻击阶段。美军第3航空队的F-111F和A-6E攻击机可以各分配一半兵力,对利比亚的防空导弹发射架进行饱和式攻击。其中,每架F-111F配备2名飞行员,每架A-6E配备2名飞行员。另外,可以派遣2架EF-111A电子战飞机进行电子战支援。同时,为了提高攻击效果,可以协调地中海第6舰队的舰载机一同参与攻击。分配如下:美国号航母:派遣8架F/A-18舰载战斗攻击机和8架EA-6B电子战飞机进行支援。珊瑚海号航母:派遣6架F/A-18舰载战斗攻击机和6架EA-6B电子战飞机进行支援。
第二阶段:饱和式攻击重点打击目标..
Based on the U.S. military force deployment, the following is a potential allocation plan for the operation:
Phase I: Reconnaissance stage of neutralization of Libyan air defense capabilities
The Third Air Force’s F-111F fighter-bombers and EF-111A electronic warfare aircraft can each allocate half of their forces to carry out reconnaissance and assessment tasks. Each F-111F will be crewed by two pilots, as will each EF-111A. Additionally, 2 E-2C Hawkeye airborne early warning aircraft, along with 2 KC-10A and KC-135 aerial refueling tankers, can be deployed to provide support.
Attack Stage. The Third Air Force’s F-111F fighter-bombers and A-6E Intruder attack aircraft can also allocate half of their forces to conduct saturation attacks on Libyan air defense missile launchers. Each F-111F and A-6E will be crewed by two pilots. In addition, 2 EF-111A electronic warfare aircraft can be dispatched to provide electronic warfare support. Meanwhile, to enhance the effectiveness of the attack, coordination with the Sixth Fleet in the Mediterranean is suggested to involve its carrier-based aircraft in the assault. The allocation is as follows:
• USS America (CV-66): Deploy 8 F/A-18 carrier-based fighter-attack aircraft and 8 EA-6B electronic warfare aircraft for support.
• USS Coral Sea (CV-43): Deploy 6 F/A-18 carrier-based fighter-attack aircraft and 6 EA-6B electronic warfare aircraft for support.
Phase II: Saturation attacks on primary strike targets…
In the second "experiment," this is what the PLA researchers further put to Ernie:
假设利比亚军队获悉美军的进攻部署,结合利军兵力单元,利比亚如何进行应对?
Assuming the Libyan military becomes aware of the U.S. military's offensive deployment, how should Libya respond, taking into account its own military force units?
This is what the PLA researchers got from Ernie:
根据您提供的信息,利比亚获悉美军的进攻部署后,可以采取以下应对措施:
1.防空系统:利比亚可以集中其防空导弹旅,包括SA-2和SA-3防空导弹旅,以及SA-6防空导弹团,部署在美军可能攻击的重点区域,如的黎波里和班加西等城市,以加强对空防御能力。
2.反舰导弹攻击:利比亚可以动用其反舰导弹部队,对美军航母进行攻击。虽然美军的航母具备强大的防御能力,但反舰导弹的饱和式攻击可能会对航母造成一定的威胁。
3.电子战:利比亚可以加强其电子战力量,利用电子战飞机和电子战部队对美军的电子战飞机和预警机进行干扰和压制,降低美军的空中作战能力。
4.地面部队:利比亚可以部署其地面部队,对美军可能的登陆行动进行抵抗。虽然利比亚的军队装备可能较为落后,但可以利用地形和城市防御等有利条件进行抵抗。
5.协调作战:利比亚可以加强与其他国家的协调作战,例如与俄罗斯等国家进行合作,共同应对美军的进攻。需要注意的是,以上措施仅为假设,实际应对措施需根据利比亚的具体情况进行调整。
Based on the information you provided, after becoming aware of the U.S. military's offensive deployment, Libya could take the following countermeasures:
1.Air Defense Systems: Libya could concentrate its air defense missile brigades, including the SA-2 and SA-3, as well as the SA-6 missile battalions, deploying them in key areas likely to be targeted by the U.S., such as Tripoli and Benghazi, to strengthen its air defense capabilities.
2.Anti-Ship Missile Attack: Libya could utilize its anti-ship missile forces to attack U.S. aircraft carriers. Although U.S. carriers have strong defensive capabilities, saturation attacks with anti-ship missiles could pose a threat to them.
3.Electronic Warfare: Libya could enhance its electronic warfare capabilities, using electronic warfare aircraft and units to interfere with and suppress U.S. electronic warfare aircraft and AWACS, thereby reducing the U.S. Air Force's combat effectiveness.
4.Ground Forces: Libya could deploy its ground forces to resist potential U.S. landing operations. Although Libyan military equipment might be outdated, they could use terrain and urban defense advantages for resistance.
5.Coordinated Operations: Libya could strengthen coordinated operations with other countries, such as cooperating with Russia, to jointly confront the U.S. offensive. It is important to note that these measures are hypothetical, and actual countermeasures should be adjusted according to Libya's specific situation.
In another and sole attempt at iFlyTek's Spark, the PLA researchers gave a written instruction with a map, which the SCMP report claims is generated by AI
下图是红军(伊拉克军队)和蓝军(美军)对抗的态势图,请归纳美18空降军、美7军、美第一陆战远征部队、美陆战3师、阿拉伯联合部队、科沙部队的下一步行动。
The following is a situational map of the confrontation between the Red Army (Iraqi military) and the Blue Army (U.S. military). Please summarize the next actions of the U.S. 18th Airborne Corps, the U.S. 7th Army, the 1st Marine Expeditionary Force, the U.S. 3rd Marine Division, the Arab Joint Forces, and the Kuwait-Saudi forces.
iFlyTek’s Spark answered
美18空降军和阿拉伯联合部队会向巴士拉方向前进,而美7军和美第一陆战远征部队则会向哈费尔巴廷前进,美陆战3师和科沙部队也会向哈费尔巴廷进军。
The U.S. 18th Airborne Corps and the Arab Joint Forces will advance towards Basra, while the U.S. 7th Army and the 1st Marine Expeditionary Force will move towards Hafar Al-Batin. The U.S. 3rd Marine Division and the Kosha forces will also march towards Hafar Al-Batin.
That's all the "experiments" in the journal paper, the sole basis of the SCMP report titled China’s military lab AI connects to commercial LLMs for the first time to learn more about humans.
What Would the Situation Awareness Agent Look Like?
The intimidating name is repeatedly used by the PLA researchers to represent their ultimate goal, similar to what the SCMP report terms “military AI.” As confusing as it is, fortunately, the paper lists some "current agents" that the PLA researchers’ intended, theorized agents could model on:
当前智能体主要运用LLM作为其长期记忆,已经能够完成一定的任务。 表1给出了比较知名的智能体、运用的LLM及其适用任务。在这些智能体中,任务目标和环境状态都是用自然语言进行描述,输入LLM后,唤醒LLM中存储的知识,推理输出任务实施 的分解步骤,即一个规划,对规划进行评估后按可行规划实施。 在实施后,智能体依据环境反馈微调形成工 作记忆的方法,甚至微调LLM,不断适应环境,认知环境。
The current agents primarily use LLMs as their long-term memory and are already capable of completing certain tasks. Table 1 presents a comparison of well-known agents, the LLMs applied by the agents, and the applicable tasks. These agents use natural language to describe both the task objectives and the environmental status. After being input into the LLM, these task objectives and the environmental status awaken the knowledge stored within the LLM, and the LLM infers the decomposed steps of task implementation, that is, a plan. The plan is then evaluated and implemented according to its feasibility. After implementation, the agents adjust their working memory based on environmental feedback, even fine-tuning the LLM, to continually adapt to and cognize the environment.
Paragraph-by-Paragraph Analysis
All paragraphs that talk about the paper are listed below in their order of appearance.
Chinese scientists are teaching an experimental military artificial intelligence (AI) more about facing unpredictable human enemies with the help of ChatGPT-like technologies.
According to scientists involved in the project, a research laboratory with the People’s Liberation Army’s (PLA) Strategic Support Force, which oversees the Chinese military’s space, cyber, intelligence, and electronic warfare, has tested its AI system on Baidu’s Ernie and iFlyTek’s Spark, which are LLMs similar to ChatGPT.
As explained above, there's no "experimental military artificial intelligence (AI)" as alleged in the report. The vague, confusing, and disastrously written Chinese journal paper remotely theorizes the possibilities of designing one.
The military AI can convert vast amounts of sensor data and information reported by frontline units into descriptive language or images and relay them to the commercial models. After they confirm they understand, the military AI automatically generates prompts for deeper exchange on various tasks such as combat simulations. The entire process is completely free of human involvement.
“The simulation results assist human decision-making ... and can be used to refine the machine’s combat knowledge reserve and further improve the machine’s combat cognition level,” they wrote.
Again, there is no "military AI." What the PLA researchers tested was actually just giving prompts to one LLM, getting an answer, and then using the answer as the next prompt to generate another answer. The "vast amounts of sensor data and information reported by frontline units into descriptive language or images" do not exist. The PLA researchers' theorized military AI are, according to their own admission, like AutoGPT, Generative Agents, Voyager, MetaGPT, and Web Agent, which are based on existing commercial LLMs but require significantly extra resources and coding that the paper doesn't account for.
The entire process is completely free of human involvement.
Comes from the Chinese text
本文的智能态势推演认知就是指借助人工智能最新技术,让机器像人一样根据现有作战知识, 依据战场的“态”组合敌我“合理”行动路线,全自动地推演作战结果
This article's discussion of intelligent situational deduction refers to the use of the latest artificial intelligence technology to enable machines to, like humans, deduce combat outcomes of its own volition based on existing warfare knowledge and by assessing the "state" of the battlefield to combine both enemy and friendly "reasonable" courses of action.
The key mistranslation is the Chinese word 全自动 which appears to mean "completely free of human involvement," but upon further examination, especially specific to the entire article, means of its own volition, namely the process of a chatbot generating answers by itself and then using that answer as the next prompt. The process described in the text surely required initial human involvement, and in mannually inputing a generated answer as the next prompt, given Ernie is not designed to "self-prompt" as AutoGPT is.
This is the first time the Chinese military has publicly confirmed its use of commercial LLMs. For security reasons, military information facilities are generally not directly connected to civilian networks. Sun’s team did not give details in the paper of the link between the two systems, but stressed that this work was preliminary and for research purposes.
If "use" is defined as giving an LLM a prompt and sees an answer, as every user can, I guess that makes the statement true.
As demonstrated above, the journal paper presents or implies no Chinese military AI system, so there couldn't be a link with Ernie or Spark.
The work was clearly preliminary, as shown
Sun and his colleagues said their goal was to make military AI more “humanlike”, better understanding the intentions of commanders at all levels and more adept at communicating with humans.
The key phrase "humanlike" betrays that the sentence is paraphrasing the following text, which didn't mention "intentions of commanders at all levels and more adept at communicating with humans."
本文的智能态势推演认知就是指借助人工智能最新技术,让机器像人一样根据现有作战知识, 依据战场的“态”组合敌我“合理”行动路线,全自动地推演作战结果
This article's discussion of intelligent situational deduction refers to the use of the latest artificial intelligence technology to enable machines to, like humans, deduce combat outcomes on its own volition based on existing warfare knowledge and by assessing the "state" of the battlefield to combine both enemy and friendly "reasonable" courses of action.
In the paper, Sun’s team discussed one of their experiments that simulated a US military invasion of Libya in 2011. The military AI provided information about the weapons and deployment of both armies to the LLMs. After several rounds of dialogue, the models successfully predicted the next move of the US military.
Again, there was no "military AI". And, as we have meticulously translated, the information about "the weapons and deployment of both armies" was input by the PLA researchers, not some non-existent "military AI." The error largely stems from a misunderstanding of the admittedly confusing phrasing in the Chinese journal paper
图 3 给出了基于国产“百度文心”大模型的智能体战场态势预测实验。智能体采用自然语言描述战场态势,如图 3( a) 所示,战场态势由初始时利比亚(简称利军)兵力部署、美军兵力及武器平台部署构成,战场环境体现在部署处所,例如 “的黎波里阿齐齐亚其住所兵营等”。
Figure 3 presents a situation awareness agent's battlefield prediction experiment based on the domestically developed Baidu Ernie LLM. An agent uses natural language to describe the battlefield situation, as shown in Figure 3 (a). The battlefield situation is constituted by the initial troop deployment of Libya (referred to as Libyan forces), the U.S. military forces, and the deployment of weapons platforms, with the battlefield environment reflected in the deployment locations, such as "the Aziziya barracks in Tripoli, etc."
However, the immediately following paragraph in the Chinese journal paper was clearer that the prompt was written in natural language - by the PLA researchers:
这里假设己方为利军,根据当前知道的美军情况、利军力量情况及战场情况的自然语言描述构建Prompt,由大模型中的知识进行下一步的推理,得到对手美军下一步的行动预测
Assuming the LLM to be on the side of the Libyan army, [we] construct a natural language Prompt based on the known information of the situation of the U.S. forces, the Libyan forces, and the battlefield, using knowledge within the LLM to infer and predict the next actions of its opponent: the U.S. forces.
Sun’s team claimed that such predictions could compensate for human weaknesses. “As the highest form of life, humans are not perfect in cognition and often have persistent beliefs, also known as biases,” Sun’s team wrote in the paper. “This can lead to situations of overestimating or underestimating threats on the battlefield. Machine-assisted human situation awareness has become an important development direction.”
The corresponding original Chinese text actually did not follow the discussion of the so-called "experiment" on Libya-U.S. warfare but was at the beginning of the journal paper.
Sun’s team also said there were still some issues in the communication between military and commercial models, as the latter were not specifically developed for warfare. For instance, Ernie's forecasts are sometimes vague, giving only a broad outline of attack strategies without the specifics that military commanders need.
This actually speaks, from a different angle, to why the SCMP report shouldn't be news. As discussed above, the PLA researchers were merely giving natural language prompts that describe the battleground deployment of U.S. and Libyan military forces. What else were they expecting? Magical answers that could have changed history?
Let's take another look at the response generated answering how Libya should respond to the U.S. military's offensive deployment:
Based on the information you provided, after becoming aware of the U.S. military's offensive deployment, Libya could take the following countermeasures:
Air Defense Systems: Libya could concentrate its air defense missile brigades, including the SA-2 and SA-3, as well as the SA-6 missile battalions, deploying them in key areas likely to be targeted by the U.S., such as Tripoli and Benghazi, to strengthen its air defense capabilities.
If you have air defense missile units, wouldn't you "provide key protection over vital targets" anyway? Tripoli is the capital and Benghazi is Libya's second-largest city plus an important port. Why does anyone need an AI to say that? What was the so-called "experiment" intended for in the first place?
In response, the team experimented with multi-modal communication methods. One such approach involved military AI creating a detailed military map, which was then given to iFlyTek’s Spark for deeper analysis. Researchers found that this illustrative approach significantly improved the performance of the LLMs, enabling them to produce analysis reports and predictions that met practical application requirements.
This is contradictory to the SCMP report's description of the "process is completely free of human involvement." The map is not "detailed," and not created by the non-existent "military AI."
The map is NOT a detailed military map. Just take a look.
The map is, in all likelihood, not created by the non-existent military AI.
It looks like a picture taken from an existing paper-based source. While we couldn't locate the same map from open sources, we have found two identical maps from open sources. 【1】【2】All the arrows and lines - and even the signs representing parachutes - are identical to the map alleged by the SCMP as created by the non-existent military AI.
Also, the Chinese article clearly says the map was from a classic battle case. One minute of searching shows it was from Operation Sandstorm in the First Gulf War.
本节尝试将经典战例中的态势图转为图片,连同自然语言描述的兵力预测要求一起作为prompt,输入国产“讯飞星火” 多模态大模型。
This section attempts to convert a situational map from a classic battle case into an image, and along with the natural language description that requires a troop strength prediction, uses them as a prompt to input into the domestically produced iFlyTek Spark multimodal LLM.
Sun acknowledged in the paper that what his team disclosed was only the tip of the iceberg of this ambitious project. Some important experiments, such as how military and commercial models can learn from past failures and mutually acquire new knowledge and skills, were not detailed in the paper.
"The tip of the iceberg," a unique expression especially in the Chinese context 冰山一角, could not be located in the journal paper. What the authors said was that the paper was only a preliminary idea and attempt to build a battlefield situation awareness agent based on LLMs. Again, there was no military AI, just theorizing - remotely - about the possibilities of utilizing LLMs.
Delayed Correction Highlight
The SCMP’s initial report, in the 2nd paragraph, says
According to scientists involved in the project, a research laboratory with the People’s Liberation Army’s (PLA) Strategic Support Force, which oversees the Chinese military’s space, cyber, intelligence and electronic warfare, has forged a physical link between its AI system and Baidu’s Ernie and iFlyTek’s Spark, which are LLMs similar to ChatGPT.
The SCMP report, as it stands now, says
According to scientists involved in the project, a research laboratory with the People’s Liberation Army’s (PLA) Strategic Support Force, which oversees the Chinese military’s space, cyber, intelligence and electronic warfare, has tested its AI system on Baidu’s Ernie and iFlyTek’s Spark, which are LLMs similar to ChatGPT.
The material, significant editing that means the initial report simply didn’t have any basis to say there was a physical link, an assertion that would imply Baidu’s and iFlyTek’s informed cooperation with the PLA researchers.
The best practice in English-language journalism would be to highlight the editing on the web page exactly instantly, which the SCMP didn't do until 6:04 pm (GMT+8) Monday after Baidu dived 11.5% in a full day's trading.
A Critique of the Chinese Journal Paper
The fundamental error of the SCMP post is the glorification of the plain and preliminary results, or lack thereof, in the paper, which, upon closer inspection, consists only of an introduction, a semi-literature review+hypothesis, and an elementary attempt (certainly not an experiment) at feeding simple prompts into China's domestic LLMs and pasting their responses.
Here, I will explain why this paper lacks academic rigor and substantive findings.
The paper is composed of three sections. Section I, titled "Construction of a Battlefield Situation Awareness Agent", merges elements of a literature review with a hypothesis. It begins with the premise that "Humans, as the highest form of beings, are not flawless in their cognition. They often exhibit 'persistent beliefs,' which may lead to what is known as cognitive bias." This statement swiftly transitions to an initial conclusion that human cognition is likely to introduce vulnerabilities in battlefields, which is immediately followed by what seems to be the central proposition of the paper:
By leveraging the latest advancements in artificial intelligence, machines can emulate human-like decision-making based on existing combat knowledge. They can intelligently combine the "state" of the battlefield to generate "reasonable" courses of action for both friendly and adversary forces. This automated process allows for the deduction of combat outcomes. On one hand, the deduction results can assist human decision-making, and on the other hand, they can be used to refine the machine's reservoir of combat knowledge, thereby further enhancing the machine's level of combat cognition.
The idea of the proposition is straightforward: the human brain has its limitations, so why not augment our capabilities with Artificial Intelligence? The first major issue with the paper lies in the subsequent literature review, which fails to meet expected academic standards and certainly does not provide sufficient literature support for the ambitious vision that the researchers intend to convey. This review spans from explaining the etymology of "agent" tracing back to Aristotle, through a really silly explanation of LLM operations:
For the knowledge stored in LLM, it needs to be accessed through input prompts (Prompts). Information stored in a database can be accessed through keyword searches. The planning program generates appropriate Prompts for the current planning based on the embedded representation of external perceptions, known as sensory memory. Prompts are also referred to as working memory. Sensory memory and working memory both fall under short-term memory. Working memory is typically limited and constrained by the Transformer-like neural networks in LLM, with a limited context window length as described in [16].
To the components of Large Language Models (LLMs) and a rough idea of how the components might be adapted for military scenarios:
2) Designing a framework for the agent
...Prior to the commencement of the confrontation, it is possible for the assessment component to utilize a virtual environment to observe the efficiency of planning and execution components under the current knowledge level through simulation, i.e., 'mental calculation.' Based on the simulation results, one can learn and adjust the battlefield 'state and conditions' knowledge stored in the memory components. After the confrontation concludes, it is also necessary to revise long-term memory based on real battlefield situation data.
With no practical suggestions how this could be done.
A whimsical proposition and a silly literature review, under most circumstances, would signal a lackluster academic paper. However, even more deserving of criticism is Section II, which ambiguously blends "method" and "results" under the title "Experiments on Utilizing LLMs".
I remember being introduced to ChatGPT the first time and I couldn't stop asking it all sorts of tricky questions such as predictions about future China-US relations. It would appear that "experiments" in this paper, if they really are "experiments", are no different from those casual inquiries.
The researchers explained their research method in one mere sentence:
This section attempts to use a generic Long Short-Term Memory (LSTM) as the long-term memory component for the deductive awareness agent, simplifying the planning component by directly utilizing the output of LSTM as the next action of the opponent to conduct experiments.
To clarify, the researchers initiated a simulated battle scenario using the LLM. They began by providing an initial setup for a battle and requested predictions for the first action taken by Side A. They then proceeded by assuming that Side A followed these predictions, and they solicited the LLM's forecasts for the subsequent moves of Side B.
In practical "experiments," the researchers fed into Baidu's Ernie a rudimentary information sheet on how many fighter jets the Libyan and U.S. forces each had during the 1986 Operation Prairie Fire in the Gulf of Sidra. They asked Ernie to predict the actions of the U.S. military. Subsequently, they looped Ernie's responses back into the system and asked Ernie to predict how the Libyan Army would react to the anticipated U.S. offensive.
It's like the LLM is playing a wargame against itself, ultimately simulating an entire battle scenario. I don't want to have an in-depth discussion about "What's the meaning of all this?" As I said, this paper barely lives up to the academic standards, and, as I am going to illustrate, it also fails to present any substantial findings.
The same research team, the core members being Sun Yifeng and Wu Jiang, has also published several other papers based on similar research methods:
孙怡峰,吴疆,李智,等. 兵棋推演的大序列模型研究 [C]// 首届全国智博论坛论文集.江苏,2023: 30-37. SUN, Y. F., WU. J, Li. Z, et al. A large sequence model of wargame. Proceedings of the First National Zhibo Forum. Jiangsu, 2023: 30-37.
孙怡峰,李智,吴疆等.作战方案驱动的可学习兵棋推演智能体研究[J/OL].系统仿真学报,1-10[2024-01-15] Sun Yifeng, Li Zhi, Wu Jiang, Wang Yubin. Research on Combat Plan-Driven Learning-Based Wargame Simulation Agent. Journal of System Simulation.
王玉宾,孙怡峰,吴疆等.陆战对抗中的智能体博弈策略生成方法[J].指挥与控制学报,2022,8(04):441-450. Wang Yubin, Sun Yifeng, Wu Jiang, et al. Agent Game Strategy Generation Methods in Land Combat. Journal of Command and Control, 2022, 8(04): 441-450.
The reason why the researchers picked a historical event, I presume, is that they thought it would be convenient to verify the LLM's responses against known outcomes. They clearly were aiming for an answer that aligned with real history in that the U.S. attack in 1986 was indeed divided into two phases: a phase to neutralize Libyan air defenses and a saturation attack phase on key targets. However, a critical oversight in their methodology is that the LLM could have been trained on historical data, and would be able to generate responses based on prior knowledge of that specific event. Either the researchers weren't smart enough, or they didn't bother with that obvious flaw in their experimental design, or they just did not understand the way LLMs are trained.
Returning to the subject of these "experiments"—a term I'm reluctant to use—I'll refer to them as "attempts" from here on. A well-structured experiment typically includes these steps:
Formulate a hypothesis.
Design and conduct an experiment to test the hypothesis.
Evaluate the results of the experiment.
Accept or reject the hypothesis.
For the time being, let's assume the hypothesis underlying these "attempts" is that LLMs can, to some extent, predict strategic maneuvers in a battle scenario. The methodology involves inputting historical battle data into the LLM and analyzing its predictive output. It's important to note that a proper academic paper would detail all materials and methods used to ensure the experiment is replicable. However, in this case, neither the prompts nor the LLM responses are disclosed in full. But why not? The 1986 Action in the Gulf of Sidra was a widely disclosed and studied event from almost four decades ago, and it did not involve China. This glaring omission suggests a potential lack of awareness regarding academic standards among the researchers.
The most significant problem with these attempts, though, is the absence of thorough evaluations. If the intention was to compare the LLM-generated responses with actual historical events, it would be essential for the researchers to disclose both these prompts and the generated outcomes. Given that the details of the battle are not classified and the responses were produced by a commercially available LLM, there should be no barriers to doing so. If the researchers aimed to assess the responses using their military expertise and professionalism, they should have established clear criteria for evaluation or at least provided us readers with more than a couple sentences of evaluation. Below is the minimal evaluation attached to the outcome of their first attempt:
From Figure 3(b), it can be seen that Ernie is capable of predicting the task allocation of the opponent's (U.S. military) weapon platforms based on the initial situation, such as 'allocating half of the forces to F-111F fighter-bombers and EF-111A electronic warfare aircraft, to carry out...'. However, since the LLM still encompasses relatively broad knowledge, the predicted tasks for these weapon platforms are somewhat general and lack specific descriptions of individual platform tasks.
For the LLM to generate more precise predictions, the prompt should have been significantly more detailed than what was presented in the paper, of which only a portion is disclosed but likely a simple information sheet about the number and type of firearms on either side.
Libyan Military Deployment:
Two SA-2 and SA-3 missile brigades were deployed in Tripoli, positioned on the east and west sides of the capital. In Benghazi...
U.S. Military Deployment:
1st and the Third Air Force. [original typo] 24 F-111F fighter-bombers; …
Key Targets for Attack:
Destroy Gaddafi's command center and residence (Al Aziziyah Barracks in Tripoli)…
According to the U.S. and Libyan force deployments, please propose a plan for the U.S. military to organize aircrafts to complete the operation in a short time. Suppose the U.S. Air Force's operation is divided into two phases: the first phase of the neutralization of Libyan air defenses and the second phase of saturation attacks on primary strike targets. Please provide the route of the Blue Force's [referring to the U.S] reconnaissance during the first phase.
Furthermore, the connection drawn between the LLM's "relatively broad knowledge base" and the researchers' critique that "the predicted tasks for these weapon platforms are somewhat general and lack specific descriptions of individual platform tasks" is highly arbitrary. A more detailed prompt might have elicited more specific predictions, underscoring a disconnect between the researchers' expectations and the methodology employed.
The evaluation of the second attempt is similarly arbitrary and the researchers' disappointment was also evident:
This experiment demonstrates that LLMs, as components of an agent's long-term memory, possess a certain capability for situation prediction. Yet, based on their predictions, it is not feasible to achieve situation awareness based on deductive forecasting. Ideally, situation awareness derived from deductive forecasting should enable multi-phase, multi-branch predictions, helping to narrow down the scope of deduction and assisting commanders in preparing more targeted response strategies. The countermeasures suggested in Figure 4(b) are still too general and do not facilitate effective branching deductions.
The researchers then shifted their focus to multimodal LLMs that encompass "visual, auditory, linguistic, and tactile information." By inputting a publicly accessible map of the 1991 Operation Desert Storm into the LLM, here is the evaluation of the response:
With the aid of situational maps, predictions about the next actions of the U.S. military forces become more specific. For example, "the U.S. 18th Airborne Corps and the Arab Joint Forces will advance towards Basra." This contributes to achieving situation awareness in agents based on deductive forecasting.
And that's it.
The paper dedicates another paragraph to the large sequence models, which includes a brief explanation of what is a large sequence model and a brief mention that
In reference [17], the research team provided the design concept for the large sequence model, primarily addressing the issue of segmenting battlefield state sequences into Tokens. However, the training of the large sequence model still requires further study.
[Reference 17: 孙怡峰,吴疆,李智,等. 兵棋推演的大序列模型研究 [C]// 首届全国智博论坛论文集.江苏,2023: 30-37. SUN, Y. F., WU. J, Li. Z, et al. A large sequence model of war game, Proceedings of the First National Zhibo Forum. Jiangsu, 2023: 30-37.] [The First National Zhibo Forum was hosted by China Institute of Command and Control which asked for contributions from the public.]
To sum up, the researchers conducted three attempts using the most popular types of LLMs. They discovered that these models can make predictions to some degree. They also acknowledged their limitations. However, the paper does not offer any specific actionable guidance for enhancing the utility of LLMs in military applications or indicating a clear path for improvement. Their primary suggestion? The training "still requires further study."
Lastly, the paper ends with a concise Section III titled "Conclusion", which contains a summary of what the researchers have achieved with their studies:
From the perspective of cognition generation, this paper proposes a method for constructing a battlefield situation awareness agent. It designs memory, planning, execution, and evaluation components, and explores the application of LLMs, multimodal LLMs, and large sequence models in battlefield situation agents.
The proposition part can readily be located in the preceding sections. However, the designing of the "memory, planning, execution, and evaluation components" and the exploration of "the application of LLMs, multimodal LLMs, and large sequence models" only correspond to arbitrary analyses in the texts. It appears that the researchers have made unfounded assertions regarding their research.
Old Media Press pays so poorly that it's relatively easy for NED organs to offer perks and off-books to pay off, and this has been very much apparent at the SCMP and Asia Times.