Recent research has uncovered that up to 46.6% of the Chinese vocabulary in advanced large language models, such as GPT-4, contains offensive terms, primarily related to pornography, gambling, and other inappropriate content. These terms have a significant impact on model performance, resulting in a performance loss of approximately 50% in tasks involving understanding and repetition. The research team has meticulously defined and classified these terms, developed an automated recognition model, and introduced a streamlined solution to deduce the contamination of training data from the vocabulary. Furthermore, the study highlights that a controlled amount of contaminated data might aid in the model's safe alignment, emphasizing the need to strike a balance between ensuring safe alignment and preventing excessive contamination in the future.