基于DeepSeek推理的文本聚类

2025-03-31 08:28

译者 | 李睿审校 | 重楼开发人员需要开发和理解一种新的文本聚类方法，并使用DeepSeek推理模型解释推理结果。本文将探索大型语言模型（LLM）中的推理领域，并介绍DeepSeek这款优秀工具，它能帮助人们解释推论结果，构建能让终端用户更加信赖的机器学习系统。在默认情况下，机器学习模型是一种黑盒，不会为决策提供开箱即用的解释（XAI）。

基于DeepSeek推理的文本聚类

译者 | 李睿

审校 | 重楼

开发人员需要开发和理解一种新的文本聚类方法，并使用DeepSeek推理模型解释推理结果。

本文将探索大型语言模型（LLM）中的推理领域，并介绍DeepSeek这款优秀工具，它能帮助人们解释推论结果，构建能让终端用户更加信赖的机器学习系统。

在默认情况下，机器学习模型是一种黑盒，不会为决策提供开箱即用的解释（XAI）。本文介绍如何使用DeepSeek模型，并尝试将解释或推理能力添加到机器学习世界中。

方法

首先构建自定义嵌入和嵌入函数来创建向量数据存储，并使用DeepSeek模型来执行推理。

以下是展示整个流程的一个简单的流程图。

基于DeepSeek推理的文本聚类

数据

（1）选择一个新闻文章数据集来识别新文章的类别。该数据集可在Kaggle网站上下载。

（2）从数据集中，使用short_description进行向量嵌入，并使用类别特征为每篇文章分配适当的标签。

（3）数据集相当干净，不需要对其进行预处理。

（4）使用pandas库加载数据集，并使用scikit-learn将其拆分为训练和测试数据集。

1 import pandas as pd
2
3 df = pd.read_json('./News_Category_Dataset_v3.json',lines=True)
4
5 from sklearn.model_selection import train_test_split
6 # Separate features (X) and target (y)
7 X = df.drop('category', axis=1)
8 y = df['category']
9
10 # Split data into training and testing sets
11 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
12
13 train_df = pd.concat([X_train, y_train], axis=1)
14 test_df = pd.concat([X_test, y_test], axis=1)

生成文本嵌入

使用以下库进行文本嵌入：

langchain—用于创建示例提示和语义相似性选择器。
langchain_chroma—用于创建嵌入并将其存储在数据存储中。

1 from chromadb import Documents, EmbeddingFunction, Embeddings
2
3 from langchain_chroma import Chroma
4 from langchain_core.example_selectors import SemanticSimilarityExampleSelector
5 from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate

接下来，将构建自定义嵌入和嵌入函数。这些自定义函数将允许查询部署在本地或远程实例上的模型。

阅读器可以为部署在远程实例上的实例合并必要的安全机制（HTTPS、数据加密等），并调用REST端点来检索模型嵌入。

1 class MyEmbeddings(Embeddings):
2
3 def __init__(self):
4 # Server address and port (replace with your actual values)
5 self.url = ""
6 # Request headers
7 self.headers = {
8 "Content-Type": "application/json"
9 }
10
11 self.data = {
12 # Use any text embedding model of your choice
13 "model": "text-embedding-nomic-embed-text-v1.5",
14 "input": None,
15 "encoding_format": "float"
16 }
17
18 def embed_documents(self, texts):
19 embeddings = []
20 for text in texts:
21 embeddings.append(self.embed_query(text))
22 return embeddings
23
24 def embed_query(self, input):
25 self.data['input'] = input
26 with requests.post(self.url, headers=self.headers, data=json.dumps(self.data)) as response:
27 res = response.text
28 yaml_object = yaml.safe_load(res)
29 embeddings = yaml_object['data'][0]['embedding']
30 return embeddings
31
32
33
34 class MyEmbeddingFunction(EmbeddingFunction):
35
36 def __call__(self, input: Documents) -> Embeddings:
37 return MyEmbeddings()

将定义一个简单的函数，它将为新闻文章创建一个语义相似性选择器。选择器将用于使用训练数据集创建向量嵌入。

1 def create_semantic_similarity_selector(train_df):
2
3 example_prompt = PromptTemplate(
4 input_variables=["input", "output"],
5 template="Input: {input}\nOutput: {output}",
6 )
7
8 # Examples of a pretend task of creating antonyms.
9 examples = []
10
11 for row in train_df.iterrows():
12 example = {}
13 example['input'] = row[1]['short_description']
14 example['output'] = row[1]['category']
15 examples.append(example)
16
17 semantic_similarity_selector = SemanticSimilarityExampleSelector.from_examples(
18 # The list of examples available to select from.
19 examples,
20 # The embedding class used to produce embeddings which are used to measure semantic similarity.
21 MyEmbeddings(),
22 # The VectorStore class that is used to store the embeddings and do a similarity search over.
23 Chroma,
24 # The number of examples to produce.
25 k=1,
26 ) 
27
28 return semantic_similarity_selector

调用上面的函数来生成新闻文章的嵌入。需要注意的是，训练过程可能很耗时，可以将其并行化以使其更快运行。

1 semantic_similarity_selector = create_semantic_similarity_selector(train_df)

色度向量数据存储用于存储各种新闻文章及其相关标签的向量表示。然后使用数据存储中的嵌入来执行与测试数据集中文章的语义相似性，并检查该方法的准确性。

将调用DeepSeek REST端点，并将从语义相似性选择器接收到的响应和实际结果传递给测试数据集。随后，将创建一个包含DeepSeek模型进行推理所需信息的上下文。

1 def explain_model_result(text, model_answer, actual_answer):
2 # REST end point for deepseek model.
3 url = ""
4 
5 # Request headers
6 headers = {
7 "Content-Type": "application/json"
8 }
9
10 promptJson = {
11 "question": 'Using the text, can you explain why the model answer and actual answer match or do not match ?',
12 "model_answer": model_answer,
13 "actual_answer": actual_answer,
14 "context": text,
15 }
16 prompt = json.dumps(promptJson)
17
18 # Request data (replace with your prompt)
19 data = {
20 "messages": [{"role": "user", "content": prompt}],
21 "temperature": 0.7,
22 "stream": True
23 }
24 captured_explanation = ""
25 with requests.post(url, headers=headers, data=json.dumps(data), stream=True) as response:
26 if response.status_code == 200:
27 for chunk in response.iter_content(chunk_size=None):
28 if chunk:
29 # Attempt to decode the chunk as UTF-8
30 decoded_chunk = chunk.decode('utf-8') 
31 # Process the chunk as a json or yaml to extract the explanation and concat it with captured_explanation object.
32 captured_explanation += yaml.safe_load(decoded_chunk)['data']['choices'][0]['delta']['content']
33 else:
34 print(f"Request failed with status code {response.status_code}")
35
36 return captured_explanation

以下解析测试数据集，并从DeepSeek模型中获取解释。

1 results_df = pd.DataFrame()
2 results_df['input'] = None
3 results_df['model_answer'] = None
4 results_df['actual_answer'] = None
5 results_df['explanation'] = None
6 for row in test_df.iterrows():
7 example = {}
8 example['input'] = row[1]['short_description']
9 model_result_category = semantic_similarity_selector.select_examples(example)
10 example['explanation'] = explain_model_result(example['input'],model_result_category[0]['output'],row[1]['category'])
11 example['model_answer'] = model_result_category[0]['output']
12 example['input'] = row[1]['category']
13 results_df.loc[len(results_df)] = example

结果

以下将分享测试数据集中一些场景的结果。

第一个例子是当从语义相似性中捕获的答案与测试数据集中的实际答案不匹配时。这种洞察力使人们能够了解模型在做出预测时是否合理。think标签提供了DeepSeek模型的初始思维过程，以理解问题陈述，然后为最终用户整理答案。

在第一个例子中，DeepSeek可以解释为什么模型得到了错误的答案。该模型只考虑句子中的旅游（TRAVEL）方面，而没有考虑娱乐（ENTERTAINMENT）方面。这意味着嵌入模型需要更多的娱乐示例进行训练。

1 <think>
2
3 Okay, so I'm trying to figure out why the model answer is "TRAVEL" and the actual answer is "ENTERTAINMENT." The context given is about going to Central Park and the Delacorte Theater to see Shakespeare under the night sky, with rain expected.
4
5 First, I need to understand what each term refers to. TRAVEL relates to visiting places, moving around, or tourism in general. ENTERTAINMENT is more about enjoyable activities, like theater, concerts, movies, etc., that provide pleasure.
6
7 Looking at the context: the person went to Central Park, which is a place they traveled to. Then they went to the Delacorte Theater for a Shakespeare performance. So, the main activities are both TRAVEL and ENTERTAINMENT because they're going out, seeing a play.
8
9 But why would there be a mismatch between model and actual answers? Maybe the model didn't consider that travel can also include attending theater or other performances as part of the entertainment aspect. So the actual answer is broader, including both aspects, while the model focuses only on TRAVEL.
10
11 Alternatively, perhaps the context doesn't emphasize one over the other. The key points are moving to a place (TRAVEL) and attending an event for enjoyment (ENTERTAINMENT). Depending on the focus, either could be considered the main answer, but typically, when someone goes somewhere to attend an event, both travel and entertainment elements come into play.
12
13 So, the model answer might have only picked TRAVEL, not considering ENTERTAINMENT. The actual answer includes both aspects, hence the mismatch.
14
15 </think>
16
17 The discrepancy between the model answer "TRAVEL" and the actual answer "ENTERTAINMENT" arises from the context provided. While travel is indeed part of the narrative—visiting Central Park—the primary focus is on attending a theater performance, which falls under entertainment. The actual answer captures both elements: moving to a place (travel) and engaging in an enjoyable event (entertainment). Thus, the model may have overlooked the entertainment aspect, considering only travel, whereas the actual response encompasses both activities.
18
19 **Answer:**
20
21
21 The model answer "TRAVEL" focuses on visiting places, while the actual answer "ENTERTAINMENT" includes attending a theater performance. The context highlights both elements, leading to the mismatch in answers.

需要考虑模型的答案和实际答案不匹配的场景。然而，该模型在预测方面是合理的。

在第二个例子中，发生在法国的持械抢劫是一种犯罪（CRIME），因此将用户文章标记为犯罪的模型不一定是错误的。DeepSeek的推理能够为结果提供合理的理由。

1 <think>
2
3 Okay, so I'm trying to figure out why the model answer is "CRIME" and the actual answer is "WORLD NEWS" given the context: "French police recovered the merchandise stolen in an armed robbery on Wednesday."
4
5 First, I need to understand what each part of this question is asking. The user wants me to explain why these two answers match or don't match based on the provided text. So, maybe I should start by analyzing the context and seeing how it relates to both CRIME and WORLD NEWS.
6
7 Looking at the context: It's about French police recovering stolen merchandise from an armed robbery. An armed robbery is definitely a crime, so that makes sense why "CRIME" is an answer. The model answer being CRIME aligns with the event described because it's a criminal act—a robbery.
8
9 Now, why is WORLD NEWS also an answer? Well, if I think about news in general, events like this are newsworthy. So perhaps the actual answer is considering that this incident is part of world news because it involves police action and theft, which can be significant enough to be reported internationally.
10
11 But wait, does the context specify whether it's considered world news or not? It just mentions French police recovering stolen goods in an armed robbery on Wednesday. Without more details about the scale or international implications, I'm not sure if it's explicitly world news. Maybe WORLD NEWS is a broader category that includes such events as part of their reporting.
12
13 So, why do they match and why don't they? They both are relevant because CRIME is directly related to the event, while WORLD NEWS is considering the newsworthiness of the event. But perhaps the model answer is more focused on the specific event being a crime, whereas WORLD NEWS is taking a broader view of news events.
14
15 I'm also thinking about how each answer approaches the context differently. The model answer is zeroing in on the type of event (crime), while the actual answer is considering it as a newsworthy event that might be reported in world news sections or channels.
16
17 So, to sum up, both answers are relevant but approach the context from different angles—Crime specifically relates to the event's nature, whereas WORLD NEWS broadens it into the realm of news reporting.
18
19 </think>
20
21 The model answer "CRIME" and the actual answer "WORLD NEWS" both address the provided context, but they do so from different perspectives.
22
23 **CRIME**: The context explicitly describes an armed robbery, which is a criminal act. Therefore, "CRIME" directly relates to the nature of the event.
24
25 **WORLD NEWS**: While the event is indeed newsworthy, the term "WORLD NEWS" broadens the scope beyond the specific details, considering it as part of international or general news reporting.
26
27 In essence, both answers are relevant but focus on different aspects: "CRIME" highlights the event's nature as a criminal act, while "WORLD NEWS" recognizes its potential newsworthiness.

当模型和实际答案匹配时，将考虑推理。

匹配结果背后的推理增强了人们的信任，并建立了对模型预测的信心。

1 <think>
2
3 Okay, so I'm trying to understand why the model answer and the actual answer both say "MEDIA" when explaining the context. The question is asking me to explain whether they match or not based on the given text.
4
5 First, let me look at the context provided: "The newsroom is awash with fears about the casino mogul's intentions after his refusal to admit ownership." This sentence mentions a casino mogul and their refusal to own up to something. That refusal is causing fear in the newsroom regarding their intentions.
6
7 Now, looking at both the model answer and the actual answer, they both are "MEDIA". I think this might be an abbreviation or shorthand for media-related reasons. The context talks about the newsroom, which is a place associated with media organizations or journalists. So, when there's fear about someone's intentions, especially related to ownership, it makes sense that the media would be concerned and involved.
8
9 Wait, but why specifically does it say "MEDIA"? Maybe it's short for something else or maybe it's just a coincidence. However, given that the context is in a newsroom setting, it's likely referring to media-related issues.
10
11 Let me break it down: The casino mogul refusing ownership could imply some kind of hidden agenda or control. Since this is in a newsroom, which deals with reporting and information, the refusal might be seen as trying to hide something. This fear could lead to media coverage or reporting on it, hence the mention of MEDIA.
12
13 So, putting it together, both answers refer to the involvement or concern of media outlets regarding the casino mogul's actions. Therefore, they match because the context clearly ties into a media-related situation.
14
15 </think>
16
17 The model answer and actual answer both state "MEDIA" because the context provided refers to a newsroom setting, where concerns about a casino mogul's intentions are likely related to media involvement or reporting. The refusal to admit ownership sparks fear, suggesting possible media coverage or concerns, thus aligning both answers with the media-related context.
18
19 **Answer:** Both answers match because the context involves a newsroom, indicating media-related concerns over a casino mogul's intentions.

结论

推理的力量使用户对模型的功能有了宝贵的了解，并凸显了其优势与不足。它增强了对系统的理解，并提供了改进系统的路径。

设想一个只提供推理决策而不提供任何解释的模型，那么用户将永远无法理解为什么模型会提供这些决策，并且可能难以对其产生信心。

原文标题：Text Clustering With Deepseek Reasoning，作者：Kalpan Dharamshi

2025年的五大AI趋势：智能体、开源和多模型

随着AI技术的飞速发展，2025年的AI领域将迎来前所未有的变革。从开源AI的崛起，到多模态AI的普及，再到本地AI的蓬勃发展，以及AI成本的显著降低和智能体的广泛应用，这五大趋势将深刻影响企业和个人的未来发展。 2025年，AI领域不再仅仅局限于大型语言模型（LLM），而是聚焦于更智能、更廉价、更专业和更本地的AI解决方案，这些方案能够处理多种数据类型，并实现自主行动。

3/3/2025 11:16:18 AM

Sol Rashidi

大型语言模型是否解决了搜索问题？

译者 | 李睿审校 | 重楼尽管LLM在内容生成方面表现出色，但需要采用语义分块和向量嵌入等技术来解决复杂数据环境中的搜索问题。大型语言模型（LLM）的涌现推动了信息检索和人机交互的范式转变。这些模型在大量的文本语料库上进行训练，并针对预测语言任务进行了优化，在响应查询、总结文本内容和生成上下文相关信息方面展现出了显著成效。

3/13/2025 12:09:27 PM

李睿