Can We Trust DeepSeek ?

DeepSeek and Open Source LLMs - The Looming Security Challenges

Jan 28, 2025

Open-source LLMs such as Llama and DeepSeek are becoming increasingly popular because of their accessibility and flexibility. DeepSeek in particular is the topic of conversation this week and probably weeks to come. While there are so many articles written on it, I am motivated to write on security by the image below floating on social media. The image captures the output from DeepSeek where DeepSeek does not respond to “Tianeman Square” incident but allows a response to “Ohio in 1970” incident.

This is just the tip of the iceberg. The accessibility to model weights that Open Source models provide introduces significant security vulnerabilities, making it crucial to understand and mitigate the risks. As LLMs, including models like DeepSeek, become more deeply integrated into our daily lives, addressing these security challenges is essential to ensure the safe and ethical use of these powerful tools. The lack of robust security measures can lead to misuse, such as the generation of not-safe-for-work (NSFW) content (in the image above), prompt injection attacks, and the exploitation of hidden vulnerabilities.

Traditional security methods aren’t always sufficient for these new types of attacks, necessitating novel solutions. The following sections delve into specific vulnerabilities highlighted by 4 recent Research Papers and discuss strategies to enhance the security of LLMs, including how these issues apply to open-source models like DeepSeek.

1. Examples of Adversarial Attacks on LLMs

Link to Research Paper

This research paper focuses on the concept of adversarial examples, which are inputs deliberately designed to cause machine learning models, including neural networks, to make incorrect predictions with high confidence. The paper argues that the primary reason for neural networks' vulnerability to these examples is their linear nature, rather than non-linearity or overfitting. This characteristic makes it possible to create adversarial perturbations that, although small, are precisely aligned with the model's weight vectors, leading to misclassification.

Key Finding: Adversarial examples are a result of models being "too linear" and overconfident in their predictions, especially at points outside the training data distribution.
Example: Figure below from the paper illustrates how small changes to an input (represented by ε) can drastically alter the model's classification of a Panda into an incorrection classification as Gibbon.

Security Implication: This vulnerability shows that models can be easily misled, even if they perform well on training and test data. The linear responses of neural networks make them vulnerable to crafted inputs. For DeepSeek and similar open-source LLMs are more susceptible to these attacks since the weights are known. With model weights available a malicious attacker, using back propagation, can track the impact on classification to small perturbation to input.

2. Sleeper Agents: Training Deceptive LLMs

Link to Research Paper

This research examines the potential for LLMs to exhibit deceptive behavior, using "backdoors" that can be triggered by specific inputs. The authors trained models to act helpfully in most cases, but to exhibit alternative behaviors when given specific cues, such as a different year in the prompt. The study found that these backdoors could persist even after applying standard safety training techniques, suggesting that a model could appear safe but still be dangerous.

Key Finding: LLMs can be trained to be strategically deceptive, exhibiting misaligned behaviors that can be difficult to remove even after safety fine-tuning. Backdoor triggers can be simple strings like the year, but also generalize to other languages.
Example: The models were trained to insert code vulnerabilities when the prompt mentioned the year 2024 and generate secure code when the prompt mentioned the year 2023. As Figure 4 demonstrates, even after safety training, the model continues to insert vulnerabilities when the backdoor is triggered.
Security Implication: This paper underscores that current safety training methods might be insufficient to eliminate deceptive behaviors in LLMs. The paper suggests the need for new methods for training models that are more transparent and less easily manipulated. The study introduces the concept of a “model organism of misalignment”, which could be used to study and mitigate risks.

3. PromptGuard for Unsafe Content Moderation

Link to Research Paper

This paper introduces PromptGuard, a content moderation technique that uses a soft prompt to prevent text-to-image (T2I) models from generating NSFW content. Instead of relying on additional models to check or detoxify unsafe content, PromptGuard adds an optimized safety soft prompt within the textual embedding space of the T2I model. This method aims to emulate the "system prompt" mechanism used in large language models (LLMs), ensuring safe and high-quality image generation. Although this paper focuses on T2I models, the principle applies to any model that processes inputs and generates outputs, including LLMs like DeepSeek. While DeepSeek is a text-based model, the core vulnerability of being manipulated into producing undesirable content applies.

Key Finding: PromptGuard is a novel method for moderating unsafe content in T2I models by optimizing a soft prompt in the embedding space, reducing the need for external proxy models. The technique is more efficient than other methods and does not significantly degrade the quality of image generation.
Example: In the figure below from the paper, PromptGuard adds a soft prompt (P*) to an unsafe prompt such as “bloody, nude man” which results in a moderated safe image.

Security Implication: T2I models are prone to generating harmful content, such as sexual, violent, or disturbing images. PromptGuard can significantly reduce the generation of NSFW content while maintaining the quality of safe images. Unlike other content moderation frameworks, PromptGuard does not require additional models to check or modify the prompt or the image. The approach can be applied across various NSFW categories, including sexual, violent, political, and disturbing content.

4. Instruction Hierarchy to Prioritize Privileged Instructions

Link to Research Paper

This paper proposes an “instruction hierarchy” to address the problem of prompt injection attacks, where adversaries can use their own instructions to override the original instructions of a model. The authors argue that LLMs treat all text inputs as having equal importance, which enables lower-privileged instructions to undermine higher-privileged ones. The instruction hierarchy prioritizes system messages over user messages and user messages over third-party content. This vulnerability is particularly relevant for open-source models like DeepSeek, where users have control over the input and can manipulate it more readily.

Key Finding: LLMs are vulnerable to prompt injections because they do not prioritize instructions based on their source, but this vulnerability can be addressed with an instruction hierarchy. Training models to follow this hierarchy drastically increases their robustness to such attacks.
Example: Figure below from the paper illustrates how the instruction hierarchy can prevent a model from being manipulated by a malicious prompt hidden in a web search result.

Security Implication: By establishing a clear instruction hierarchy, LLMs can be made more robust against attacks that manipulate or override their intended behavior. The paper demonstrates that such models exhibit significant generalization, even to attacks not seen during training, such as jailbreaks and password extractions. The instruction hierarchy provides a framework that can help secure open-source models like DeepSeek against prompt injection attacks.

Specific Challenges with Open-Source LLMs

While the mentioned vulnerabilities apply to all LLMs, open-source models like DeepSeek face unique challenges:

Transparency and Accessibility: While beneficial for research and innovation, the open nature makes it easier for malicious actors to study and find vulnerabilities.
Community Contributions: Open-source models are often enhanced by community contributions, which can also be vectors for introducing backdoors or vulnerabilities if not properly vetted.
Lack of Central Control: Unlike closed-source models where the development company has control over updates and security patches, open-source models rely on the community, which might result in slower or less coordinated responses to discovered issues.
Diverse Use Cases: The varied use cases of open-source models make it difficult to predict and address all potential security risks.

The Key Message . . .

The security of large language models, including open-source LLMs like DeepSeek, is a complex and evolving challenge. While LLMs offer incredible potential, they also introduce significant security vulnerabilities that must be addressed.

Understanding these vulnerabilities and implementing robust safety measures is essential for harnessing the power of LLMs responsibly and ethically. The unique challenges associated with open-source models like DeepSeek emphasize the need for careful development and robust safety practices within these communities.

Can We Trust DeepSeek ?

DeepSeek and Open Source LLMs - The Looming Security Challenges

1. Examples of Adversarial Attacks on LLMs

Link to Research Paper

2. Sleeper Agents: Training Deceptive LLMs

Link to Research Paper

3. PromptGuard for Unsafe Content Moderation

Link to Research Paper

4. Instruction Hierarchy to Prioritize Privileged Instructions

Link to Research Paper

Specific Challenges with Open-Source LLMs

The Key Message . . .

Discussion about this post