首页 hanlp文本解析
文章
取消

hanlp文本解析

在有的业务中,我们常常需要接受用户录入的一串文本,解析成程序需要的数据模型,用于传输或存储。例如快递收货地址填写、订单录入,甚至是大数据个性化推荐等等。

hanlp 是国内一个比较流行的自然语言处理工具,本文以词性分析为例进行实战。词性分析能够帮助开发者从用户输入的文本,包括语音转换的文本,提取关键信息,从而获得结构化的数据。

我们知道一个完整清晰的句子是由不同词汇组合成的,如 “我非常喜欢 Python,它很酷”,这里面就包含了代词“我”、副词“非常”、动词“喜欢”,名词“Python”,标点符号“,”、代词“它”、副词“很”以及形容词“酷”。

hanlp 支持很多机器学习模型来对句子进行划分和词性分析,我们现在使用 SIGHAN2005_PKU_BERT_BASE_ZH 来进行分词处理,PKU_POS_ELECTRA_SMALL 来进行词性标注。

hanlp 用法很简单,先用核心功能定义一个工具类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class TextParser:
    def __init__(self):
        # 分词
        self.tok = hanlp.load("SIGHAN2005_PKU_BERT_BASE_ZH", output_key="tok")
        # 词性标注
        self.pos = hanlp.load(
            "PKU_POS_ELECTRA_SMALL", output_key="pos", input_key="tok"
        )
        print("inited TextParser")


    def parse(self, text: str):
        """
        解析并分词处理
        :param text: 待解析文本
        :return:分词与词性
        """
        # 将文本拆成词汇
        tok_res = self.tok(text)
        
        # 分词词汇的词性
        pos_res = self.pos(tok_res)

        result = {"tok": tok_res, "pos": pos_res}
        return result

由于机器学习模型文件较大,加载到内存后占用空间较多,为了节约内存,我们定义一个装饰器实现单例模式:

1
2
3
4
5
6
7
8
9
def singleton(cls):
    instances = {}

    def _wrapper(*args, **kwargs):
        if cls not in instances:
            instances[cls] = cls(*args, **kwargs)
        return instances[cls]

    return _wrapper

由于语言很灵活,词汇不停地被创造出来,机器学习模型无法记录到所有领域和最新的词汇,我们需要根据需要自定义新词汇来满足业务的需要,比如“芭比Q”、“喜茶”、“茶百道”、“给力”等等。

关于词性请查看文末参考链接。

词性约束:

annotations = {
    "Ag",
    "a",
    "ad",
    "an",
    "Bg",
    "b",
    "c",
    "Dg",
    "d",
    "e",
    "f",
    "h",
    "i",
    "j",
    "k",
    "l",
    "Mg",
    "m",
    "Ng",
    "n",
    "nr",
    "ns",
    "nt",
    "nx",
    "nz",
    "o",
    "p",
    "q",
    "Rg",
    "r",
    "s",
    "Tg",
    "t",
    "u",
    "Vg",
    "v",
    "vd",
    "vn",
    "w",
    "x",
    "Yg",
    "y",
    "z",
}

我们给类添加个方法:

def set_tok(self, words: dict):
    """
    设置自定义词汇
    :param words:{词汇:词性}形式,如{"白菜":"n"}
    :return:
    """
    w = set()
    for word, anno in words.items():
        if anno not in annotations:
            raise Exception(f"unrecognized annotation in words:{anno}")
        w.add(word)
    if words:
        self.tok.dict_combine = w
        self.pos.dict_tags = words

为了避免设置自定义词汇后,其它线程调用时产生词汇污染情况,我们还需要清空自定义词汇:

1
2
3
4
5
6
7
def clear_tok(self):
    """
    清空自定义词汇
    :return:
    """
    self.tok.dict_combine = None
    self.pos.dict_tags = {}

hanlp 有个缺点,解析过程不是线程安全的,因此我们加个线程锁:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class ThreadLock:
    """
    线程锁
    """

    def __init__(self, meta_info):
        """
        初始化
        :param meta_info:标记线程锁的元信息
        """
        self._lock = threading.Lock()
        self._meta_info = meta_info

    def __enter__(self):
        self._lock.acquire(blocking=False)

    def __exit__(self, exc_type, exc_val, exc_tb):
        self._lock.release()
        if exc_type is None:
            print(f"lock:{self._meta_info},released")
        else:
            # 发生了异常
            print(f"lock:{self._meta_info}, released with err:{exc_val}")

最后重新整合下工具类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import threading
import hanlp

annotations = {
    "Ag",
    "a",
    "ad",
    "an",
    "Bg",
    "b",
    "c",
    "Dg",
    "d",
    "e",
    "f",
    "h",
    "i",
    "j",
    "k",
    "l",
    "Mg",
    "m",
    "Ng",
    "n",
    "nr",
    "ns",
    "nt",
    "nx",
    "nz",
    "o",
    "p",
    "q",
    "Rg",
    "r",
    "s",
    "Tg",
    "t",
    "u",
    "Vg",
    "v",
    "vd",
    "vn",
    "w",
    "x",
    "Yg",
    "y",
    "z",
}


def singleton(cls):
    instances = {}

    def _wrapper(*args, **kwargs):
        if cls not in instances:
            instances[cls] = cls(*args, **kwargs)
        return instances[cls]

    return _wrapper


class ThreadLock:
    """
    线程锁
    """

    def __init__(self, meta_info):
        """
        初始化
        :param meta_info:标记线程锁的元信息
        """
        self._lock = threading.Lock()
        self._meta_info = meta_info

    def __enter__(self):
        self._lock.acquire(blocking=False)

    def __exit__(self, exc_type, exc_val, exc_tb):
        self._lock.release()
        if exc_type is None:
            print(f"lock:{self._meta_info},released")
        else:
            # 发生了异常
            print(f"lock:{self._meta_info}, released with err:{exc_val}")


@singleton
class TextParser:
    def __init__(self):
        # 分词
        self.tok = hanlp.load("SIGHAN2005_PKU_BERT_BASE_ZH", output_key="tok")
        # 词性标注
        self.pos = hanlp.load(
            "PKU_POS_ELECTRA_SMALL", output_key="pos", input_key="tok"
        )
        print("inited TextParser")

    def set_tok(self, words: dict):
        """
        设置自定义词汇
        :param words:{词汇:词性}形式,如{"白菜":"n"}
        :return:
        """
        w = set()
        for word, anno in words.items():
            if anno not in annotations:
                raise Exception(f"unrecognized annotation in words:{anno}")
            w.add(word)
        if words:
            self.tok.dict_combine = w
            self.pos.dict_tags = words

    def clear_tok(self):
        """
        清空自定义词汇
        :return:
        """
        self.tok.dict_combine = None
        self.pos.dict_tags = {}

    def parse(self, work_id: str, text: str, func_tok: callable = None) -> dict:
        """
        解析并分词处理
        :param work_id:调用者任务 ID,多线程并发情况下的锁标识
        :param text: 待解析文本
        :param func_tok: 创建自定义词汇与词性的函数,需返回 {词汇:词性} 形式的数据
        :return:分词与词性
        """
        if func_tok:
            if hasattr(func_tok, "__call__"):
                # 获取并设置自定义词汇
                words = func_tok()
                self.set_tok(words)
            else:
                raise Exception("func_tok is not callable")

        # 加线程锁,hanlp 非线程安全
        with ThreadLock(work_id):
            # 将文本拆成词汇
            tok_res = self.tok(text)

            # 分词词汇的词性
            pos_res = self.pos(tok_res)

            result = {"tok": tok_res, "pos": pos_res}

            # 清空自定义词汇
            self.clear_tok()

            return result

测试:

p1 = TextParser()
r1 = p1.parse("t1", "我想吃红色大萝卜")
print("r1,", r1)

p2 = TextParser()
r2 = p2.parse("t2", "我想吃红色大萝卜", lambda: {"红色大萝卜": "n"})
print("r2,", r2)
r3 = p2.parse("t2", "我非常不喜欢红色大萝卜")
print("r3,", r3)

输出:

1
2
3
4
5
6
7
inited TextParser
lock:t1,released
r1, {'tok': ['我', '想', '吃', '红色', '大', '萝卜'], 'pos': ['r', 'v', 'v', 'n', 'a', 'n']}
lock:t2,released
r2, {'tok': ['我', '想', '吃', '红色大萝卜'], 'pos': ['r', 'v', 'v', 'n']}
lock:t2,released
r3, {'tok': ['我', '非常', '不', '喜欢', '红色', '大', '萝卜'], 'pos': ['r', 'd', 'd', 'v', 'n', 'a', 'n']}

参考链接:https://hanlp.hankcs.com/docs/annotations/pos/pku.html

本文由作者按照 CC BY 4.0 进行授权