推特、微博上的文本没有严格的文本格式，灵活度比较高。文本内容往往包含一些网址URL、HTML标签、话题标签和@某人等内容，对文本的分词及词汇统计都会产生影响，所以在文本分析之前要去掉这些无意义的数据。搜索引擎上搜索一番，没有比较完美的解决方案。故基于搜索基础形成此文，以后再有类似场景，直接调用就好了。需要说明的是，因为实际场景比较复杂，本文的处理方式不能全面覆盖，需要另行修改调整。

删除URL

def clean_url(text):
    sentences = text.split(' ')
    # 处理http://类链接
    url_pattern = re.compile(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%|\-)*\b', re.S)
    # 处理无http://类链接
    domain_pattern = re.compile(r'(\b)*(.*?)\.(com|cn)')
    if len(sentences) > 0:
        result = []
        for item in sentences:
            text = re.sub(url_pattern, '', item)
            text = re.sub(domain_pattern,'', text)
            result.append(text)
        return ' '.join(result)
    else:
        return re.sub(url_pattern, '', sentences)

整体思路是先根据空格切分成列表，然后对列表中每一项进行链接替换，最终再组成字符串。

删除HTML标签

第一种情况是仅删除HTML标签，不包括标签内的文本。代码直接来自于网络，貌似没有额外需要补充的。

def clean_html(text):
    html_pattern = re.compile('</?\w+[^>]*>', re.S)
    text=re.sub(html_pattern,'', text)
    return text

第二种情况不单单去除HTML文本，也包括HTML标签内的文本。比如’网页链接’、’转推’等文字都是可以直接跳转的，这样的文字采用第一种情况是无法去除的。但是这种方法过于严厉，所以只能对a标签、b标签等使用，而不能用在p标签或者div标签中。

def clean_html_strict(text):
    html_pattern = re.compile('(<a|<b)(.*?)(</a>|</b>)', re.S)
    text = re.sub(html_pattern, '', text)
    return text.strip()

除了用上述方法处理’网页链接’、’转推’等情况，还可以采用先用第一种情况，再分别处理’网页链接’、’转推’等情况。

删除话题标签

推特和微博上的话题一般都用’【】’、’##’、’[]’等符号包起来，所以直接根据这些符号进行过滤即可。

def clean_tag(text):
    tag_pattern = re.compile('(\[|\#|【)(.*?)(\#|\]|\】)', re.S)
    text = re.sub(tag_pattern, '', text)
    return text.strip()

@某人标签

def clean_at(text):
    at_pattern = re.compile('@\S*', re.S)
    text = re.sub(at_pattern, '', text)
    return text.strip()

分词时仅保留中文文本

import re
# 正则匹配模式
pattern = r'[a-zA-Z0-9\’!\"#$%&\'()*+,-\.\/\:;<=>?@，。?★、…【】《》？“”‘’！\[\]^_`{|}~\\\s]+'
# 分词时进行过滤
words = " ".join([word for word in jieba.lcut(article) if word not in stop_words and not str(word).isdigit() and len(word) > 1 and not re.search(pattern, str(word))]);

纸上得来终觉浅

推文&博文的文本处理-删除URL&HTML标签&话题标签

删除URL

删除HTML标签

删除话题标签

@某人标签

分词时仅保留中文文本