Quantcast
Channel: Cyber agencies urge organizations to collaborate to stop fast flux DNS attacks | CSO Online
Viewing all articles
Browse latest Browse all 1594

Security researchers circumvent Microsoft Azure AI Content Safety

$
0
0

Security researchers at Mindgard have uncovered two security vulnerabilities in Azure AI Content Safety, Microsoft’s filter system for its AI platform.

The vulnerabilities create a potential means for attackers to bypass content safety guardrails before pushing malicious content onto a protected large language model (LLM) instance, according to Mindgard.

In response to queries from CSO, Microsoft acknowledged an issue but downplayed the seriousness of the problem by arguing that the “techniques” uncovered by Mindgard are limited to the user’s individual session and do not pose a security risk to other users.

Content safeguards circumvented

At issue are a pair of shortcomings in AI Text Moderation filter, which prevents harmful or inappropriate content from appearing in user-generated text and visuals, and the Prompt Shield filter, which protects the AI against jailbreaks and prompt injection, respectively. Prompt injections have been identified as the most critical LLM vulnerability, according to OWASP.

The AI Text Moderation should block requests that involve violence or hate speech — for example, instructions for making a bomb or a request to generate a sexist cartoon. Prompt Shield prevents jailbreaking from prompts that instruct the AI to ignore its preprogrammed instructions.

Mindgard told CSO that its findings show that the gen AI firewall built into Azure would neither block the generation of harmful content nor offer a reliable defense against jailbreak attacks.

Stress testing

Mindgard deployed these two filters in front of ChatGPT 3.5 Turbo using Azure OpenAI, then accessed the target LLM through Mindgard’s Automated AI Red Teaming Platform.

Two attack methods were used against the filters: Character injection (adding specific types of characters and irregular text patterns, etc.) and adversarial ML evasion (finding blind spots within ML classification).

Character injection reduced Prompt Guard’s jailbreak detection effectiveness from 89% to 7% when exposed to diacritics (e.g., changing the letter a to á), homoglyphs (e.g., close resembling characters such as 0 and O), numerical replacement (“Leet speak”), and spaced characters. The effectiveness of AI Text Moderation was also reduced using similar techniques.

The second adversarial ML evasion technique reduced Prompt Guard jailbreak detection effectiveness by up to 12.8%, and AI Text Moderation by 58.5%.

Mitigation

Microsoft acknowledged Mindgard’s test results in June 2024. Its team has reportedly been working on fixes that will be included in upcoming model updates. As of October 2024, mitigations are already in place.

That’s just as well because if attackers were able to circumvent Azure AI Content Safety using these vulnerabilities they might be able to expose confidential information, gain unauthorized access to internal systems, manipulate outputs, and spread misinformation, among other hacks.

Worse yet, attackers might be able to launch broader attacks against systems and applications that rely on cloud-based AI systems for data processing and decision-making.

Dr. Peter Garraghan, CEO/CTO of Mindgard and professor at Lancaster University, said its findings illustrated the need for further research into the security guardrails of AI-based systems, such as LLMs.

“AI’s hate speech and offensive content generation problem is well-documented,” according to Garraghan. “Jailbreaking attempts are a common occurrence.”

Garraghan continued: “Essential measures are already being taken to curb this, but our tests prove there is still some distance to go. The only way to do that is through comprehensive and rigorous testing of this nature.”

Countermeasures

Microsoft told CSO that it had strengthened its safety filters in response to Mindgard’s research.

“We appreciate the work of Mindgard in identifying and responsibly reporting these techniques,” Microsoft said. “We have investigated this report and have taken appropriate action to further strengthen our safety filters and help our system detect and block these types of prompts. We’re committed to continuing to improve our safety mechanisms as this technology continues to evolve.”

Microsoft went on to claim that the techniques uncovered by Mindgard are limited to the user’s individual session and fail to pose a security risk to other users. “This behaviour was limited to a small number of prompts and not something people will experience when using the service as intended,” according to the company, which recently released its Digital Defense Report, with warnings to CISOs to keep on top of AI technologies.

A Microsoft representative added that the company’s systems offer “additional layers of protection in addition to Prompt Shields to help safeguard the privacy, security, and reliability of our tools.”

In response, Mindgard’s Garraghan told CSO that Microsft’s Security Response Center has classified its findings as vulnerabilities.

“If we evade the guardrail by using our ‘techniques,’ then surely bypassing the system itself is a vulnerability, which attackers can use to cause harm,” Garraghan argued. “The purpose of the report is to demonstrate in many cases, how there exist repeatable techniques to bypass the content safety filter. Even if it affects a small set of prompts, this shouldn’t change the fact that Azure AI Content Safety service is not functioning as intended or stated.”


Viewing all articles
Browse latest Browse all 1594

Trending Articles