I built a lot of websites just for fun. That one was by far the most challenging project, and there’s still a lot of work to be done!
The basic idea was to see if I could scrape enough data from review websites to make solid enough buying recommendations (spoiler alert: not really). Of course, the ambition is not necessarily to rank in Google (they are not big fans of auto-generated data) but more about building a product database; the front-end part was really just for fun.
I will write a post with more details (and code) later, but for now, I wanted to share my first learnings from this experience.
Python was probably not the best choice for the project
I love Python, and the idea sparkled because of the famous BeautifulSoup but it was the first time I tried to build something bigger than just a script. The only projects of this scale I worked on in the past were written in other languages, mostly PHP. I found myself lost in my folder structure and quickly reached some of the language limits. I can easily blame my lack of professional experience, but still, I found it harder than PHP to organize my code elegantly.
Internet is a mess!!
It’s close to impossible to build scraping rules that works for a vast majority of websites. Even the basics… missing canonical, multiple title tags, random headings, etc.
It becomes a real challenge when you try to scrape (at scale) some non-structured data like superlatives (e.g. best X for Y), pros and cons. I hear you smarty pants in the back: what about the schema??? True, schema helps and I thought that would be the easy part but…
Schema implementation is also a mess!
unfortunately, even a structured language has some level of flexibility and makes the scraping a real brain-teaser.
GitHub is not intuitive
It took me a while to learn how to use it, and every time I stop using it for a couple of weeks, I forget how to… I love what it does but I hate using it!!!
I love Raspberry Pis
I have 9 active Pis in my home, so I knew that already. The scraper and Database is running automatically on a Raspberry Pi 4 (4Gb RAM) + SSD and it’s been running for months now without any trouble (and fan… I got rid of it after 2 days). I love the idea of having a tiny machine browsing the web all-day from my desk drawer.
I did that project mostly at night when my son was (finally) asleep. And for about 3 months I didn’t sleep well and got stuck on some trivial parts just because of a foggy brain. Usually, taking a 2 days break to recharge the batteries with good sleep was just enough to unlock productivity. So, SLEEP!
You can take a look at the current version here: https://www.ahoyreviews.com/ (I never said I was a designer! 🙂 )