一个中国小姐姐讲python自然语言处理的课程,真的讲得太好了,强烈推荐。
一、概述
数据科学应该掌握的三种技能:
数据的两种格式:
二、英语
profanity n. 亵圣; 对神灵的亵渎; (亵圣的) 诅咒语
corpus n. (书面或口语的) 文集,文献,汇编; 语料库;
Dreyfus model 德雷福斯模型
potty train 对(幼儿)作坐便训练 n.(幼儿的) 便盆 adj. 发疯的; 癫狂的; 喜爱; 对…痴迷
三、代码
#抓取p标签内容 # Scrapes transcript data from scrapsfromtheloft.com def url_to_transcript(url): '''Returns transcript data specifically from scrapsfromtheloft.com.''' page = requests.get(url).text soup = BeautifulSoup(page, "lxml") text = [p.text for p in soup.find(class_="post-content").find_all('p')] print(url) return text #生成字典 data_combined = {key: [combine_text(value)] for (key, value) in data.items()} #对文字进行处理 import re import string def clean_text_round1(text): '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.''' text = text.lower() text = re.sub('\[.*?\]', '', text) text = re.sub('[%s]' % re.escape(string.punctuation), '', text) text = re.sub('\w*\d\w*', '', text) return text round1 = lambda x: clean_text_round1(x) # Apply a second round of cleaning def clean_text_round2(text): '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.''' text = re.sub('[‘’“”…]', '', text) text = re.sub('\n', '', text) return text round2 = lambda x: clean_text_round2(x)
https://www.youtube.com/watch?v=xvqsFTUsOmc
https://github.com/adashofdata/nlp-in-python-tutorial
四、获得数据的地方