Here’s what I learned from building a scraper and a (semi) automated website -

I built a lot of websites just for fun. That one was by far the most challenging project, and there’s still a lot of work to be done!

The basic idea was to see if I could scrape enough data from review websites to make solid enough buying recommendations (spoiler alert: not really). Of course, the ambition is not necessarily to rank in Google (they are not big fans of auto-generated data) but more about building a product database; the front-end part was really just for fun.

I will write a post with more details (and code) later, but for now, I wanted to share my first learnings from this experience.

Python was probably not the best choice for the project

I love Python, and the idea sparkled because of the famous BeautifulSoup but it was the first time I tried to build something bigger than just a script. The only projects of this scale I worked on in the past were written in other languages, mostly PHP. I found myself lost in my folder structure and quickly reached some of the language limits. I can easily blame my lack of professional experience, but still, I found it harder than PHP to organize my code elegantly.

Internet is a mess!!

It’s close to impossible to build scraping rules that works for a vast majority of websites. Even the basics… missing canonical, multiple title tags, random headings, etc.

It becomes a real challenge when you try to scrape (at scale) some non-structured data like superlatives (e.g. best X for Y), pros and cons. I hear you smarty pants in the back: what about the schema??? True, schema helps and I thought that would be the easy part but…

Schema implementation is also a mess!

unfortunately, even a structured language has some level of flexibility and makes the scraping a real brain-teaser.

GitHub is not intuitive

It took me a while to learn how to use it, and every time I stop using it for a couple of weeks, I forget how to… I love what it does but I hate using it!!!

I love Raspberry Pis

I have 9 active Pis in my home, so I knew that already. The scraper and Database is running automatically on a Raspberry Pi 4 (4Gb RAM) + SSD and it’s been running for months now without any trouble (and fan… I got rid of it after 2 days). I love the idea of having a tiny machine browsing the web all-day from my desk drawer.

Sleeping helps

I did that project mostly at night when my son was (finally) asleep. And for about 3 months I didn’t sleep well and got stuck on some trivial parts just because of a foggy brain. Usually, taking a 2 days break to recharge the batteries with good sleep was just enough to unlock productivity. So, SLEEP!

You can take a look at the current version here: https://www.ahoyreviews.com/ (I never said I was a designer! 🙂 )

Was this helpful?

0 / 0

Vincent Malischewski

SEO/Data Enthusiast

I help international organizations and large-scale websites to grow intent-driven audiences on transactional content and to develop performance-based strategies.

Currently @ZiffDavis – Lifehacker, Mashable, PCMag
ex @DotdashMeredith, @FuturePLC