[script] Build a Crawler from a Google Sheet list of URLs -

Being in-house for a publisher (who publishes a lot by definition) I was looking for a way to look at everything that was released over the past X days and that would allow me to quickly scan the SEO Titles.

The solution I came up with (and I’m giving you today) is:

Crawl a list of URLs from Google Sheet
Retrieve and scrap some useful data
Export the result back to Google Sheet

It doesn’t cover:

how to get the list in Google Sheet since that method will vary depending on the platform you’re using
how to run the script every day automatically (I will in a future post, spoiler alert it’s a con job running on AWS) (and for FREE 🙂 )

1- The first step (and probably the less fun) is to setup you Google Account

I’m using the excellent gspread Python library and the website is very clear so instead of copy/pasting I will just give you the right link: https://gspread.readthedocs.io/en/latest/oauth2.html#oauth-client-id

Summary:

Go to Google developer console https://console.developers.google.com/project
Create a project (or use an existing one)
Activate Google Drive API and Google Sheets API
APIs & Services > Credentials | Create credentials > Service account key | Create
Actions>Create Key>JSON

Congratulations! You now have a JSON file on your computer. I renamed it ‘secret.json’ and place it in the same directory than your Python code.

Copy the service account email in the Service accounts menu and share your Google doc with that email. It looks like something like this:

{account name}@{project name}-266558842.iam.gserviceaccount.com

You can also find it in the JSON previously downloaded on the line “client_email”

2 – Now, the fun part where Python does all the work for you

Disclaimer: I’m not a professional developer so the code is probably (absolutely?) suboptimal but it works. I’m more than open to hearing your thoughts and optimizations!

I assume it’s not your first Python gig, so I won’t explain how to install Python or libraries. You’ll need:

gspread https://gspread.readthedocs.io/en/latest/index.html
requests https://requests.readthedocs.io/en/master/
BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/

pip install requests gspread beautifulsoup4

In the code below, you just have to replace where it’s commented:

json file name
spreadsheet name
source sheet name + column number
destination sheet name
if you don’t have a header you can remove the ‘pop’

import gspread
from oauth2client.service_account import ServiceAccountCredentials
import requests
from bs4 import BeautifulSoup

scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('secret.json',scope)
client = gspread.authorize(creds)

ssheet = client.open('SEO - Daily <title> check')
 #name of your spreadsheet
sheet= ssheet.worksheet('urls source')
 #name of your sheet

urls = sheet.col_values(7)
 #I'm getting the URLs in the 7th column, there's a good chance it'll be different for you
urls.pop(0) #removes first item (i.e. header) if no header just remove that line

result = []

for url in urls:
	if len(url)>0:
		print(str(url))
		try:
			page = requests.get(url)
			soup = BeautifulSoup(page.content, 'html.parser')	
			result.append([url,page.status_code,soup.find('title').text])
# I'm only getting URL + STATUS CODE + <TITLE> but your imagination (and bs4 here) is the limit

		except:
			print("error")
 		
ssheet.values_clear('data!A2:D')
 #clear the sheet first
ssheet.values_update(
    'data!A2',
    params={
        'valueInputOption': 'USER_ENTERED'
 #https://developers.google.com/sheets/api/reference/rest/v4/ValueInputOption
    },
    body={
        'values': result
    }
)

Voila! The next step is to run a cron job every day at night so the data magically appears in the morning.

Was this helpful?

0 / 0

Vincent Malischewski

SEO/Data Enthusiast

I help international organizations and large-scale websites to grow intent-driven audiences on transactional content and to develop performance-based strategies.

Currently @ZiffDavis – Lifehacker, Mashable, PCMag
ex @DotdashMeredith, @FuturePLC