[script] Build a Crawler from a Google Sheet list of URLs

Being in-house for a publisher (who publishes a lot by definition) I was looking for a way to look at everything that was released over the past X days and that would allow me to quickly scan the SEO Titles.

The solution I came up with (and I’m giving you today) is:

  • Crawl a list of URLs from Google Sheet
  • Retrieve and scrap some useful data
  • Export the result back to Google Sheet

It doesn’t cover:

  • how to get the list in Google Sheet since that method will vary depending on the platform you’re using
  • how to run the script every day automatically (I will in a future post, spoiler alert it’s a con job running on AWS) (and for FREE 🙂 )

1- The first step (and probably the less fun) is to setup you Google Account

I’m using the excellent gspread Python library and the website is very clear so instead of copy/pasting I will just give you the right link: https://gspread.readthedocs.io/en/latest/oauth2.html#oauth-client-id

Summary:

  • Go to Google developer console https://console.developers.google.com/project
  • Create a project (or use an existing one)
  • Activate Google Drive API and Google Sheets API
  • APIs & Services > Credentials | Create credentials > Service account key | Create
  • Actions>Create Key>JSON

Congratulations! You now have a JSON file on your computer. I renamed it ‘secret.json’ and place it in the same directory than your Python code.

Copy the service account email in the Service accounts menu and share your Google doc with that email. It looks like something like this:

{account name}@{project name}-266558842.iam.gserviceaccount.com

You can also find it in the JSON previously downloaded on the line “client_email”

2 – Now the fun part where Python does all the work for you

Disclaimer: I’m not a professional developer so the code is probably (absolutely?) suboptimal but it works. I’m more than open to hear your thoughts and optimizations!

I assume it’s not your first Python gig so I won’t explain how to install Python or libraries. You’ll need:

  • gspread https://gspread.readthedocs.io/en/latest/index.html
  • requests https://requests.readthedocs.io/en/master/
  • BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/
pip install requests gspread beautifulsoup4

In the code below you just have to replace where’s it’s commented:

  • json file name
  • spreadsheet name
  • source sheet name + column number
  • destination sheet name
  • if you don’t have a header you can remove the ‘pop’
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import requests
from bs4 import BeautifulSoup

scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('secret.json',scope)
client = gspread.authorize(creds)

ssheet = client.open('SEO - Daily <title> check')
 #name of your spreadsheet
sheet= ssheet.worksheet('urls source')
 #name of your sheet

urls = sheet.col_values(7)
 #I'm getting the URLs in the 7th column, there's a good chance it'll be different for you
urls.pop(0) #removes first item (i.e. header) if no header just remove that line

result = []

for url in urls:
	if len(url)>0:
		print(str(url))
		try:
			page = requests.get(url)
			soup = BeautifulSoup(page.content, 'html.parser')	
			result.append([url,page.status_code,soup.find('title').text])
# I'm only getting URL + STATUS CODE + <TITLE> but your imagination (and bs4 here) is the limit

		except:
			print("error")
 		
ssheet.values_clear('data!A2:D')
 #clear the sheet first
ssheet.values_update(
    'data!A2',
    params={
        'valueInputOption': 'USER_ENTERED'
 #https://developers.google.com/sheets/api/reference/rest/v4/ValueInputOption
    },
    body={
        'values': result
    }
)

Voila! Next step is to run a cron job every day at night so the data magically appears in the morning.

Leave a Reply 0

Your email address will not be published. Required fields are marked *