Can Watermarks Survive Translation?

On the Cross-lingual Consistency of Text Watermark for Large Language Models

1Shanghai Jiao Tong University, 2Tencent AI Lab, 3Tsinghua University
*Equal Contribution, Corresponding Author
intro

Abstract

Text watermarking technology aims to tag and identify content produced by large language models (LLMs) to prevent misuse. However, can watermarks survive translation?

We complete a closed-loop study on this question:
(1) Evaluation: We define the concept of cross-lingual consistency in text watermarking, which assesses the ability of text watermarks to maintain their effectiveness after being translated into other languages. Unfortunately, we find that current watermarking methods struggle to maintain their watermark strengths across different languages.

(2) Attack: We propose the cross-lingual watermark removal attack (CWRA). By wrapping the query to the LLM in a pivot language, it efficiently eliminates watermarks (AUC: 0.95 -> 0.67) without compromising performance.

(3) Defense: We analyze two key factors that contribute to the cross-lingual consistency in text watermarking and propose a potential defense method. Despite its limitations, it increases the AUC from 0.67 to 0.88 under CWRA.

Evaluation:
Cross-lingual Consistency of Watermark

Cross-lingual Consistency of Text Watermark

Fig 1: Watermark strength score (detection score) after translation.


Our tested text watermarking methods include KGW[1], UW[2], and SIR[3]. Fig 1 shows significant drops in the watermark strengths after translation. That is, current watermarking methods lack cross-lingual consistency.

Attack:
Cross-lingual Watermark Removal Attack (CWRA)

Method

CWRA pipeline

Fig 2: The pipeline of CWRA. We select Chinese as the pivot language as an example.


An attacker that wants to remove the watermark from LLM-generated response typically do not want to change the language of the response. To bridge this gap, we introduce the Cross-lingual Watermark Removal Attack (CWRA), as shown in Fig 2. It wraps the query to the LLM in a pivot language, which is then translated back into the original language.

Experiment

AUC of the watermark strength score under different attack methods.

Fig 3: ROC curves of the watermark detection under different attack methods.


Text quality

Fig 4: Text quality under different attack methods.


Fig 3 and Fig 4 show the detection AUC curves and text quality under different attack methods, respectively. Among all the compared methods, CWRA stands out for its superior performance, as it achieves the lowest AUC and the highest text quality.

Defense:
Improving Cross-lingual Consistency

AUC of the watermark strength score under different attack methods.

Fig 5: ROC curves of X-SIR and SIR under CWRA and no attack.

Text quality

Fig 6: Text quality of X-SIR and SIR.


We first analyze two key factors essential for achieving cross-lingual consistency (detailed in the paper). By addressing these two factors, our defense method, X-SIR, significantly improves the AUC and text quality under CWRA (Fig 5 and Fig 6). However, the defense method still has limitations, and we call for further research to improve cross-lingual consistency in text watermarking.

BibTeX

@article{he2024can,
  title={Can Watermarks Survive Translation? On the Cross-lingual Consistency of Text Watermark for Large Language Models},
  author={He, Zhiwei and Zhou, Binglin and Hao, Hongkun and Liu, Aiwei and Wang, Xing and Tu, Zhaopeng and Zhang, Zhuosheng and Wang, Rui},
  journal={arXiv preprint arXiv:2402.14007},
  year={2024}
}

Reference

[1] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. ICML 2023.

[2] Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, Heng Huang. Unbiased Watermark for Large Language Models. arXiv 2023.

[3] Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, Lijie Wen. A Semantic Invariant Robust Watermark for Large Language Models. ICLR 2024.