17,844 questions
1
vote
1
answer
94
views
Why is html files not being generated
My version of scrapy is 2.11.0
I am learning scrapy and the code they give as an example to try is this:
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "...
3
votes
1
answer
141
views
How to stop/kill achieved Scrapy spider instance within RStudio
I'm making a tutorial on how to scrape with Scrapy. For that, I use Quarto/RStudio and the website https://quotes.toscrape.com/. For pedagogic purposes, I need to run a first crawl on the first page, ...
Advice
0
votes
0
replies
62
views
Can Cloudflare be bypassed from unrendered browsers using basic techniques like setting proper headers or cookies?
I’m building a Scrapy-based crawler and facing Cloudflare protection on some sites.
Here’s my current setup:
I have a separate API service that can bypass Cloudflare by simulating a real browser (e.g....
0
votes
1
answer
79
views
Scrapy handle status 202
I'm quite new to web scraping, and in particular in using Scrapy's spiders, pipelines...
I'm getting some 202 status from some spider requests' response, hence the page content is not available yet
...
2
votes
1
answer
67
views
Scrapy Playwright freezes after initialization ([scrapy.middleware] INFO: Enabled item pipelines:[‘carscraper.pipelines.PostgreSQLPipeline’])
After starting a spider, there is the problem with freezing on a stage when pipeline must enable. There is no errors, just scrapy-playwrigth script, but it stopes on beggining before even starts ...
0
votes
1
answer
98
views
ModuleNotFoundError using macOS Brew installed module
I'm far from a Python expert and this is my first Scrapy project. I installed Scrapy using Brew. I've been able to do some basics with Scrapy and making progress. I need to add Beautiful Soup to clean ...
0
votes
0
answers
108
views
Unexpected(?) availability of child elements during start events in lxml.etree.iterparse
I’m writing a sitemap XML parser using lxml.etree.iterparse
class Sitemap:
"""Class to parse Sitemap (type=urlset) and Sitemap Index
(type=sitemapindex) files"""
...
0
votes
1
answer
78
views
A way to defer yielding a Request in scrapy?
My scrapy logic is as follows:
get all rows from child_page_table where parent_page_id is null
for each row, if parent_page_id is (still) null, yield a Request with callback scrape_page
[scrape_page] ...
2
votes
1
answer
75
views
Stopping Scrapy from fetching enqueued requests after timout or Keyboard Interrupt
I am trying to make a web crawler with Scrapy which fetches some html pages and saves them via default Request callback i.e. parse()
The thing is, I want the spider to stop crawling pending or ...
-1
votes
2
answers
99
views
How can I scrape content that's loaded dynamically on Sainsbury's product pages?
Trying to build a scraper that extracts nutritional information from each product page on Sainsbury (for eg, scraping energy values out of https://www.sainsburys.co.uk/gol-ui/product/sainsburys-...
0
votes
1
answer
74
views
Getting error in using Scrapy for scraping a simple website
I have pressed the command scrapy or scrapy crawl bookspider -o bookdata.csv and the error looks like this:
Traceback (most recent call last):
File "C:\Users\Tunansh Vatsa\AppData\Local\...
-4
votes
1
answer
68
views
scrapy webcrawler refuses to crawl http on localhost [closed]
I had a small webcrawler that was written using scrapy and since I didn't want to run it against real site during development I used a local mirror. Mirror was served with python -m http.server 8000 ...
2
votes
1
answer
79
views
Scrapy Crawlspider does not work with 507 status code
        I have a scrapy Crawlspider that parses reviews, using a scrapy-rotating-proxies.
        But when I tried to connect to the site I got the 507 status code. In ...
0
votes
1
answer
76
views
What is the proper way to process None values in Scrapy?
I'm using Scrapy v2.12.0 and Item Loader. My spider returns None for some item fields in certain items. For instance, the field 'exterior_color' is processed as follows::
in the spider, in ...
1
vote
0
answers
69
views
Problems downloading files with Selenium + Scrapy
This code is supposed to download some documents it most locate within series of given links.
While is does seemingly locate the link of the pdf file, its failing to download it. What might be the ...
