Skip to main content

Command Palette

Search for a command to run...

My scraper returned 0 results for a week. The bug was one HTTP header.

Published
3 min read
My scraper returned 0 results for a week. The bug was one HTTP header.
L
I build and operate a fleet of 7 self-improving Python bots that trade crypto derivatives, scan freelance markets, and generate digital products automatically. Writing about trading bot architecture, web scraping, API integrations, and autonomous agents. All source code available at payhip.com/botfarm.

Posting this because the symptom was weird and the fix was one line, and I want the next person who hits it to find the answer faster than I did.

The symptom

I run a freelance-job scraper against two public sites: PeoplePerHour and Guru.com. Both public listing pages, both HTML, nothing fancy.

For seven days, the scraper logged:

[scrape] pph: parsed 0 job listings
[scrape] guru: parsed 0 job listings

No exceptions. No 4xx or 5xx responses. Both endpoints returned 200 OK. My regex over the page was just returning an empty list.

The diagnostic

First, I assumed the HTML structure changed. Pulled the page in a browser, viewed source, grabbed the actual anchor pattern. It matched what my regex expected.

Then I dumped the response body my scraper was getting:

r = requests.get(url, headers=HEADERS, timeout=15)
print(repr(r.text[:100]))

Output:

'\x8b\xa1\x03\x00\x11Zw\x1f\xc2\xa1...'

That's not HTML. That's compressed binary.

The bug

My headers had:

HEADERS = {
    "User-Agent": "Mozilla/5.0 ...",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
}

That Accept-Encoding: gzip, deflate, br advertises that my client can decode three encodings: gzip, deflate, and brotli.

requests auto-decodes gzip and deflate. It does NOT auto-decode brotli unless you pip install brotli (or brotlicffi). Without that, r.text returns the raw brotli-compressed bytes decoded as latin-1, which looks like the mojibake above.

The server sees the br in my Accept-Encoding, picks brotli (it's more efficient than gzip, so modern servers prefer it), sends a brotli-encoded response, and my code fails silently because requests quietly passes through the undecoded bytes as r.text.

The fix

Two options:

Option 1 — install brotli support:

pip install brotli

Now requests knows how to decode all three encodings, and the server's response comes back as HTML like you expect.

Option 2 — don't advertise what you can't decode:

HEADERS = {
    "User-Agent": "Mozilla/5.0 ...",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "identity",  # or just "gzip, deflate"
    "Accept-Language": "en-US,en;q=0.9",
}

identity means 'send me the raw bytes, no encoding'. Servers will honor it. Slightly more bandwidth, zero dependencies.

I went with Option 2 because the scrapers are lightweight and the bandwidth delta is not meaningful at my volume.

Verification

After the fix:

[scrape] pph: parsed 47 job listings
[scrape] guru: parsed 18 job listings

Zero to 65 jobs per scan. Exactly the same code, one header changed.

The broader lesson

The real bug here was my scraper. Not the header — the scraper.

It should have:

  1. Detected that the response body didn't parse as HTML
  2. Logged a distinguishable error
  3. Refused to silently return an empty list

Instead it trusted r.text, ran a regex over gibberish, got zero matches, and cheerfully logged 'parsed 0 job listings' as if that were a normal outcome.

Here's the validation I added:

import requests

def fetch_html(url, headers):
    r = requests.get(url, headers=headers, timeout=15)
    r.raise_for_status()
    body = r.text
    # Sanity: response should look like HTML
    if "<html" not in body.lower() and "<body" not in body.lower():
        sample = repr(r.content[:40])
        raise ValueError(f"Response doesn't look like HTML. first 40 bytes: {sample}")
    return body

If I'd had that from day one, the bug would have been a single loud error in the log. Instead it was seven days of quiet failure.

Small gotcha, big impact

One header. Seven days. 65 jobs a scan I wasn't getting. The class of bugs where the symptom is 'nothing, except no results' is always worth a defensive check at the boundary — treat empty results as suspicious, not as a valid answer.