Scrapy – Building Web Crawlers With Scrapy

Scrapy is a Python library that helps developers build web crawling applications. It supports all of the common functionality that you would expect from a scraping library, so you can focus on developing your project.

It comes with many useful tools to manage your scraping tasks, including a convenient terminal command that makes it easier to run your spiders and collect data. It also provides excellent logging to help you monitor your spider’s progress and detect any errors that may occur.

The core architecture of all scrapy based scrapers is simple: generators (classes) generate either requests with callbacks or results that will be saved to storage. You can then call these generators with the scrapy command or your own python script to execute them.

Basically, your generators need to define the initial request and then how to follow links on the page and parse downloaded content. They also need to return an iterable of Requests which the Spider will begin to crawl from.

Once your spider has crawled the first URL, it will send the request to the Downloader Middleware which will then download that page and generate a response. This response is then sent back to the engine via the Spider Middleware.

Your spider can handle multiple domains in parallel and can also be tuned for different types of websites to optimize the number of requests it sends. It can even scrape pages using XPath selectors, which is very powerful and enables you to search for specific information on a page.

The XPath selectors you use are converted to CSS selectors under-the-hood in order to be more flexible, but the same rules apply. This means that the selectors you create in the shell will be able to match all kinds of HTML elements on the page.

You’ll also learn about the Python logging package, which is built into Scrapy, and how to use it to log every aspect of your crawling. You can set up a variety of levels of logging, and you can even send email notifications when certain events happen.

To use the Python logging packages with Scrapy, you need to add one of the logging classes and set the appropriate settings in the class definition. You can also set a custom_settings variable to override these settings.

For example, if you want to scrape a list of products, you’d set the start_requests to a generator function that returns an iterable of Request objects and set the custom_settings to a list of items that are to be extracted from those requests.

Similarly, you can override the parse function to set the callback function to be used when your spider has finished scraping a URL. Lastly, you can override the start_requests and parse functions to customize which websites are chosen as initial scrapers and how they are handled.

The Spider class is the main class that Scrapy uses for crawling and scraping tasks. It’s a very easy class to get started with, and is the one that you’ll be using most of the time when building your spider. It also has many other convenience methods to assist with your spider’s tasks.