【教程】爬取和统计Google Scholar上指定关键词的文章信息

栏目：娱乐资讯时间：2023-08-02

　　要爬取谷歌学术文献信息，你可以使用 Python 的 requests、BeautifulSoup 和 re 库来实现。

　　以下是一个简单的代码示例，可以帮助你获取谷歌学术搜索结果页面的 HTML 代码，并从中提取出每篇文献的标题、作者、摘要和链接：

　　```python

　　import requests

　　from bs4 import BeautifulSoup

　　import re

　　# 搜索关键词

　　query = 'python web scraping'

　　# 构造查询字符串

　　params = {'q': query}

　　# 定义请求头

　　headers = {

　　'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

　　# 发送 GET 请求并获取响应

　　response = requests.get('https://scholar.google.com/scholar', params=params, headers=headers)

　　# 解析 HTML 代码

　　soup = BeautifulSoup(response.text, 'html.parser')

　　# 提取每篇文献的信息

　　articles = soup.find_all('div', {'class': 'gs_ri'})

　　for article in articles:

　　# 提取标题

　　title = article.find('h3', {'class': 'gs_rt'}).text.strip()

　　# 提取作者

　　authors = article.find('div', {'class': 'gs_a'}).text.strip()

　　authors = re.sub(r'xa0', '', authors)

　　authors = re.split(' - ', authors)

　　# 提取摘要

　　abstract = article.find('div', {'class': 'gs_rs'}).text.strip()

　　# 提取链接

　　link = article.find('h3', {'class': 'gs_rt'}).find('a')['href']

　　# 打印结果

　　print('Title:', title)

　　print('Authors:', authors)

　　print('Abstract:', abstract)

　　print('Link:', link)

　　print('-------------------')

　　```

　　这段代码会输出每篇文献的标题、作者、摘要和链接。你可以根据需求修改代码，提取更多或更少的信息。