Newest 'scrapy' Questions

1 vote

1 answer

94 views

Why is html files not being generated

My version of scrapy is 2.11.0 I am learning scrapy and the code they give as an example to try is this: from pathlib import Path import scrapy class QuotesSpider(scrapy.Spider): name = "...

crawlingdev

279

asked Dec 3 at 13:24

3 votes

1 answer

141 views

How to stop/kill achieved Scrapy spider instance within RStudio

I'm making a tutorial on how to scrape with Scrapy. For that, I use Quarto/RStudio and the website https://quotes.toscrape.com/. For pedagogic purposes, I need to run a first crawl on the first page, ...

Didier mac cormick

237

asked Nov 20 at 10:00

Advice

0 votes

0 replies

62 views

Can Cloudflare be bypassed from unrendered browsers using basic techniques like setting proper headers or cookies?

I’m building a Scrapy-based crawler and facing Cloudflare protection on some sites. Here’s my current setup: I have a separate API service that can bypass Cloudflare by simulating a real browser (e.g....

Muhammad Sameer

50

asked Nov 5 at 16:50

0 votes

1 answer

79 views

Scrapy handle status 202

I'm quite new to web scraping, and in particular in using Scrapy's spiders, pipelines... I'm getting some 202 status from some spider requests' response, hence the page content is not available yet ...

Manu310

178

asked Oct 28 at 11:27

2 votes

1 answer

67 views

Scrapy Playwright freezes after initialization ([scrapy.middleware] INFO: Enabled item pipelines:[‘carscraper.pipelines.PostgreSQLPipeline’])

After starting a spider, there is the problem with freezing on a stage when pipeline must enable. There is no errors, just scrapy-playwrigth script, but it stopes on beggining before even starts ...

Max Plakushko

23

asked Oct 18 at 21:58

0 votes

1 answer

98 views

ModuleNotFoundError using macOS Brew installed module

I'm far from a Python expert and this is my first Scrapy project. I installed Scrapy using Brew. I've been able to do some basics with Scrapy and making progress. I need to add Beautiful Soup to clean ...

JReekes

1

asked Sep 27 at 22:55

0 votes

0 answers

108 views

Unexpected(?) availability of child elements during start events in lxml.etree.iterparse

I’m writing a sitemap XML parser using lxml.etree.iterparse class Sitemap: """Class to parse Sitemap (type=urlset) and Sitemap Index (type=sitemapindex) files""" ...

abebus

1

asked Sep 3 at 11:05

0 votes

1 answer

78 views

A way to defer yielding a Request in scrapy?

My scrapy logic is as follows: get all rows from child_page_table where parent_page_id is null for each row, if parent_page_id is (still) null, yield a Request with callback scrape_page [scrape_page] ...

user1713450

1,513

asked Jul 9 at 21:43

2 votes

1 answer

75 views

Stopping Scrapy from fetching enqueued requests after timout or Keyboard Interrupt

I am trying to make a web crawler with Scrapy which fetches some html pages and saves them via default Request callback i.e. parse() The thing is, I want the spider to stop crawling pending or ...

srajan0149

43

asked Jun 9 at 10:21

-1 votes

2 answers

99 views

How can I scrape content that's loaded dynamically on Sainsbury's product pages?

Trying to build a scraper that extracts nutritional information from each product page on Sainsbury (for eg, scraping energy values out of https://www.sainsburys.co.uk/gol-ui/product/sainsburys-...

Siddharth Gianchandani

11

asked Jun 6 at 14:32

0 votes

1 answer

74 views

Getting error in using Scrapy for scraping a simple website

I have pressed the command scrapy or scrapy crawl bookspider -o bookdata.csv and the error looks like this: Traceback (most recent call last): File "C:\Users\Tunansh Vatsa\AppData\Local\...

noob_coder123

1

asked May 19 at 4:00

-4 votes

1 answer

68 views

scrapy webcrawler refuses to crawl http on localhost [closed]

I had a small webcrawler that was written using scrapy and since I didn't want to run it against real site during development I used a local mirror. Mirror was served with python -m http.server 8000 ...

Anton

132

asked May 19 at 3:11

2 votes

1 answer

79 views

Scrapy Crawlspider does not work with 507 status code

I have a scrapy Crawlspider that parses reviews, using a scrapy-rotating-proxies. But when I tried to connect to the site I got the 507 status code. In ...

CollonelDain

31

asked May 3 at 13:08

0 votes

1 answer

76 views

What is the proper way to process None values in Scrapy?

I'm using Scrapy v2.12.0 and Item Loader. My spider returns None for some item fields in certain items. For instance, the field 'exterior_color' is processed as follows:: in the spider, in ...

Dmitry Borisoglebsky

143

asked Apr 5 at 7:49

1 vote

0 answers

69 views

Problems downloading files with Selenium + Scrapy

This code is supposed to download some documents it most locate within series of given links. While is does seemingly locate the link of the pdf file, its failing to download it. What might be the ...

42WaysToAnswerThat

371

asked Apr 2 at 1:03

Collectives™ on Stack Overflow

Why is html files not being generated

How to stop/kill achieved Scrapy spider instance within RStudio

Can Cloudflare be bypassed from unrendered browsers using basic techniques like setting proper headers or cookies?

Scrapy handle status 202

Scrapy Playwright freezes after initialization ([scrapy.middleware] INFO: Enabled item pipelines:[‘carscraper.pipelines.PostgreSQLPipeline’])

ModuleNotFoundError using macOS Brew installed module

Unexpected(?) availability of child elements during start events in lxml.etree.iterparse

A way to defer yielding a Request in scrapy?

Stopping Scrapy from fetching enqueued requests after timout or Keyboard Interrupt

How can I scrape content that's loaded dynamically on Sainsbury's product pages?

Getting error in using Scrapy for scraping a simple website

scrapy webcrawler refuses to crawl http on localhost [closed]

Scrapy Crawlspider does not work with 507 status code

What is the proper way to process None values in Scrapy?

Problems downloading files with Selenium + Scrapy

Hot Network Questions