RoachPHP: Mastering Web Scraping with PHP
- Understanding Web Scraping and Crawling
- Getting Started with RoachPHP
- Creating Your First Spider
- Diving Deeper: Scraping for More Information
- Best Practices and Precautions
- Conclusion
Hey there, fellow PHP enthusiast! Welcome. In today's digital world, data is the new gold mine (Meta, I'm watching you 😏), and web scraping is one trusty old way we use to dig it up. Whether it's for market research, competitive analysis, or simply satisfying your curiosity, web-scraping empowers you to access and organize data like never before.
Now, you might be wondering, "Why PHP?" Well, PHP has grown more robust with time, thanks to a vibrant community and a plethora of tools at our disposal. In this article, we'll explore how PHP can be your go-to language for web scraping, with a focus on the fantastic package, RoachPHP.
Understanding Web Scraping and Crawling
Let's take a brief detour and discuss what web scraping and crawling mean, and why they're so essential in today's digital landscape.
- Web Scraping is akin to precision surgery in the digital realm. It involves extracting specific data from web pages. Think of it as plucking the ripest fruit from a tree without bothering with the leaves and branches. Whether it's fetching product prices, weather forecasts, or real estate listings, web scraping allows us to target and extract just what we need.
- Web Crawling, on the other hand, is similar to a swarm of data-collecting worker bees across the internet. It involves systematically navigating through multiple web pages to gather information. Instead of handpicking individual data points, web crawling sweeps up vast amounts of data. This approach is ideal for tasks like indexing web content for search engines.
These processes have a shared goal: to harness the vast wealth of data the internet holds. Data is currency in the digital world, and web scraping and crawling act as one of the means of minting it. Data collected can empower businesses and researchers to amass large and diverse datasets for analysis. This abundant data also fuels the recent surge in AI, serving as lifeblood powering AI algorithms and machine learning models.
Getting Started with RoachPHP
Now that we've got our scraping and crawling patted down, let's get started with RoachPHP. But first, what is RoachPHP?
Roach is your one-stop shop for web scraping in PHP. This isn't your average web crawler; it's a powerhouse inspired by Scrapy for Python, and tailored for modern PHP. This powerful PHP package combines the best tools from the PHP ecosystem, including Symfony's DomCrawler, XPath, Guzzle, and BrowserShot, to give you the edge in web scraping.
It allows you to define spiders
that crawl and scrape web documents. But that's not all – RoachPHP goes the extra mile. It includes an entire data processing pipeline to clean, persist, and fine-tune the data you extract, all within the PHP ecosystem.
One of RoachPHP's standout features is its framework-agnostic nature. It liberates you from the constraints of a specific PHP framework, offering the flexibility to leverage its capabilities in your preferred PHP environment. Let's see examples on how to use it in a simple PHP environment.
Setting Up Your PHP Project
First things first, let's set up a new PHP project using Composer. Create a new folder for our app and initialize composer:
composer init
During the initialization, Composer will prompt you for various project details., but you can leave most of them empty if you like since you can always edit the generated composer.json
later. Make sure you Add PSR-4 autoload mapping and leave it as default.
After initialization, two directories and one file will be generated by composer in your project root - src
, vendor
and the composer.json
. I like to set my PSR-4 base namespace to App\\
in that composer.json
. Here's how mine looks like:
composer.json
{ "name": "codewithkyrian/web-crawler", "type": "project", "autoload": { "psr-4": { "App\\": "src/" } }, "authors": [ { "name": "Kyrian Obikwelu", } ], "require": { "php": "^8.1", }}
Once everything is set up, let's install RoachPHP. Roach has wrappers for Laravel and Symphony but since we're in a vanilla php project, we'll use the core package.
composer require roach-php/core
If you're like me and you prefer a more visually appealing way to inspect data instead of the standard var_dump
, Symfony's "var-dumper" package comes to the rescue. It provides cleaner and more structured output, making debugging a breeze. I always include it in any new project of mine.
composer require symfony/var-dumper
Creating Your First Spider
Let's get our hands dirty and create our very first web-scraping spider. Before we dive into the code, let's keep things neat and organized. We'll create a new folder under src
called Spiders
to house our spider classes. This way, we can maintain a clean project structure.
Create a new PHP class named ImdbTopMoviesSpider
src/ImdbTopMoviesSpider.php
<?php namespace App\Spiders; use Generator;use RoachPHP\Http\Response;use RoachPHP\Spider\BasicSpider;use Symfony\Component\DomCrawler\Crawler; class ImdbTopMoviesSpider extends BasicSpider{ /** * @var string[] */ public array $startUrls = [ 'https://www.imdb.com/chart/top/' ]; /** * Parses the response and returns a generator of items. */ public function parse(Response $response): Generator { $title = $response->filter('h1.title__text')->text(); $description = $response->filter('div.ipc-title__description')->text(); yield $this->item([ 'title' => $title, 'description' => $description, ]); }}
Let's dissect this code to understand the different parts:
-
Namespace and Class Definition: We define a PHP namespace for our spider class to keep our code organized. The class extends
BasicSpider
, a RoachPHP base class that streamlines spider creation. -
Start URL: In the
$startUrls
property, we specify the URL where our spider begins its journey. RoachPHP initiates the process by sending requests to all the URLs defined in the$startUrls
property. Here, it starts by navigating to IMDb's Top 250 Movies page. -
Parsing: The
parse
method is where the real action happens. It receives a response from the website and extracts data. -
Data Extraction: The
Response
inside the parse method is built on top Symphony's DomCrawler, which means all methods of theCrawler
class are available on the Response class as well. In this example, we're using CSS selectors to extract the title and description of the page. -
Yielding Items: We yield each extracted item using
$this->item(...)
, which sends items through the item processing pipeline one by one. This is where the power of PHP Generators comes into play. Instead of returning all items at once,yield
allows us to efficiently generate and send items one at a time, optimizing memory usage.
This is an efficient way to handle data extraction, especially when dealing with large datasets. If you're new to PHP Generators, check out the documentation to learn more about this powerful feature. The extracted data can further be manipulated, processed or stored, but we'll get to that much later in this article.
To run our spider, create an index.php
file in the root folder of your project and import the composer autoload file to load all the necessary dependencies:
index.php
<?php require_once 'vendor/autoload.php';
Next, we'll import our ImdbTopMoviesSpider
class and the Roach
class:
index.php
use App\Spiders\ImdbTopMoviesSpider;use RoachPHP\Roach;
To collect data using our spider, we'll create an instance of ImdbTopMoviesSpider
using RoachPHP and store the results in the $topMovieDetails
variable, and use dump
to display the collected data:
index.php
$topMovieDetails = Roach::collectSpider(ImdbTopMoviesSpider::class); dump($topMovieDetails);
With index.php
set up, open your terminal and navigate to the root folder of your project. Then, run the following command to execute the PHP script:
php index.php
This command will trigger the web scraping spider to crawl IMDb's top 250 movies page, extract data, and display it in your terminal…………..or not.
You're encountering an error, right?? You should see an error resembling this:
Fatal error: Uncaught InvalidArgumentException: The current node list is empty...
This issue occurs because some websites, like IMDb, implement security measures to verify the legitimacy of incoming requests. IMDb, for example, checks for a valid user agent in the request headers to ensure that the request is organic.
To bypass this issue, we need to modify the request headers before they are sent, ensuring that we include a user agent. One way to achieve this is by overriding the initialRequests
method in our spider class. The initialRequests
method allows us to create an array of Request
objects, each representing a request to be sent. The Request
class constructor takes several parameters, including the HTTP method, URI, a callable parseMethod
, and an array of Guzzle request options.
Here's an example implementation:
/** @return Request[] */protected function initialRequests(): array{ return [ new Request( 'GET', "https://www.imdb.com/chart/top/", [$this, 'parse'], [ 'headers' => [ 'User-Agent' => 'Mozilla/5.0 (compatible; RoachPHP/0.1.0)' ] ] ), ];}
While manually creating requests works, I personally am not a fan of that method. Using this removes the need for the $startUrls
property, so I have to manually create everything. RoachPHP provides an elegant solution using middlewares. Middlewares in RoachPHP allow us to intercept and modify both outgoing requests and incoming responses. For our purpose, we can utilize the inbuilt UserAgentMiddleware
middleware, which lets us define a custom user agent for all requests.
Here's how to use it:
src/Spiders/ImdbTopMoviesSpider.php
/** * The downloader middleware that should be used for runs of this spider. */public array $downloaderMiddleware = [ RequestDeduplicationMiddleware::class, [UserAgentMiddleware::class, ['userAgent' => 'Mozilla/5.0 (compatible; RoachPHP/0.1.0)']],];
By adding this middleware to our spider class, we ensure that the user agent is attached to every request automatically. While we were at it, I also added another inbuilt middleware RequestDeduplicationMiddleware
, which helps to avoid sending duplicate requests. This method simplifies the process and helps maintain clean and organized code, something I'm so obsessed about.
Now, with these adjustments in place, you can confidently run your spider once more. Your result should resemble the following:
[2023-09-06T07:33:05.787523+00:00] roach.INFO: Run starting [] [][2023-09-06T07:33:05.789017+00:00] roach.INFO: Dispatching request {"uri":"https://www.imdb.com/chart/top/"} [][2023-09-06T07:33:15.811204+00:00] roach.INFO: Run statistics {"duration":"00:00:10","requests.sent":1,"requests.dropped":0,"items.scraped":0,"items.dropped":0} [[2023-09-06T07:33:15.811463+00:00] roach.INFO: Run finished [] []^ array:1 [ 0 => RoachPHP\ItemPipeline\Item^ {#201 -data: array:2 [ "title" => "IMDb Top 250 Movies" "description" => "IMDb Top 250 as rated by regular IMDb voters" ] -dropReason: "" -dropped: false }]
Enhancing Our Data Extraction
We've taken our first steps into web scraping, but let's face it, what we've done so far is a bit basic and not so practical or useful. It's time to put our newfound skills to proper use and extract something more meaningful.
Let's extract the titles and URLs of the top 250 movies on that IMDb page. Clear what we had previously in your parse
method and add this:
src/Spiders/ImdbTopMoviesSpider.php
$response →filter('ul.ipc-metadata-list div.ipc-title > a') →each(fn(Crawler $node) => [ 'url' => $node→link()→getUri(), 'title' => $node→children('h3')→text(),]);
- We're targeting
<ul class="ipc-metadata-list">
list that contain movie list (according to the structure of that page), and looking for<div class="ipc-title">
elements that have direct children of<a>
tags. The$node
variable represents each matched element (Wrapped in a crawler, so we can further crawl it). - We extract the URL using
$node→link()→getUri()
, ensuring that we get the complete and resolved URL. We could've used$node→attr('href')
to extract the href, but thehref
attribute often contains relative paths, which may not be practical for our purposes (IMDb even used relative paths!). Using$node->link()->getUri()
ensures that we get the complete and resolved URL. - We also extract the title by accessing the direct children of the
<a>
tag with$node->children('h3')->text()
.
Now that we've collected the data for each post, we need to yield these items efficiently. Since we're working with a Generator, we'll yield each item one by one:
src/Spiders/ImdbTopMoviesSpider.php
// Yield each extracted itemforeach ($items as $item) { yield $this->item($item);}
When you run your project once again, you'll see all 250 items are elegantly displayed in the console (and yes, it's quite a substantial list! 😅).
Item Processors
If you've observed the extracted movie titles, you'll notice they come with unnecessary numbers, such as "4. The Godfather Part II." We don't need those extra digits cluttering our data, so let's tidy it up.
Now we could do the house cleaning right there in our Spider class, but I'm a fan of Single Responsibility Principle (SRP) in classes and keeping classes simple. Luckily for people like me, RoachPHP provides us with an elegant solution for post-processing our data after extraction, and it's called "Item Processors." These processors work by sending the extracted data through a series of sequential steps, making it easy to perform various data cleaning and enhancement tasks.
When we call yield $this→item($item)
in our Spider, the item is passed through a pipeline of multiple processors invoked sequentially to process the item. Post-processing our data includes stuff like persisting the data to a database, validating the data and dropping those that don't meet our criteria, cleaning the data, adding extra metadata to an item e.g., Computed values, etc. The use-cases are endless.
It's recommended to keep your ItemProcessors simple, making them focus on one task only. So we'll create one to clean up those movie titles. Create a folder called "Processors" to house all our processors. Inside this folder, create a new class named CleanMovieTitle
.
src/Processors/CleanMovieTitle.php
<?php namespace App\Processors; use RoachPHP\ItemPipeline\ItemInterface;use RoachPHP\ItemPipeline\Processors\ItemProcessorInterface; class CleanMovieTitle implements ItemProcessorInterface{ public function configure(array $options): void { } public function processItem(ItemInterface $item): ItemInterface { $item->set('title', preg_replace('/^\d+\.\s/', '', $item->get('title'))); return $item; }}
I'm using preg_replace
and regular expressions (Regex) to perform the removal, but you can use any other method you like, there are many ways to kill a rat!
We'll, of course, still need to plug our new processor into our Spider using the following code. If the $itemProcessors
property doesn't exist yet, create it. Don't forget to import the CleanMovieTitle
class.
src/Spiders/ImdbTopMoviesSpider.php
/** * The item processors that emitted items will be sent through. */public array $itemProcessors = [ CleanMovieTitle::class,];
With this setup, our items will now undergo data cleaning as soon as they are ready in the generator. Running our application again, you'll notice that the movie titles are now clean and free from unnecessary numbers.
Diving Deeper: Scraping for More Information
We could stop here, but I'm a perfectionist, and I want us to cover all ground. Well, it's practically impossible in one article, but let's cover as much as we can. Let's explore a more complex example—scraping additional information about the trending books on Open Library. In this section, we'll cover advanced techniques such as handling pagination, dealing with nested data, and saving the scraped data to a JSON file.
To get started, we need a new Spider to handle crawling the Open Library website. If you're up for a challenge, you can try creating this Spider yourself. Simply visit the Open Library's trending page, study its structure, and give it a shot. Extract data about each book, including its title, URL, author, and cover image. You should have something similar to this:
src/Spiders/OpenLibrarySpider.php
<?php namespace App\Spiders; use Generator;use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;use RoachPHP\Downloader\Middleware\UserAgentMiddleware;use RoachPHP\Http\Response;use RoachPHP\Spider\BasicSpider;use Symfony\Component\DomCrawler\Crawler; class OpenLibrarySpider extends BasicSpider{ /** * @var string[] */ public array $startUrls = [ 'https://openlibrary.org/trending/forever' ]; /** * The downloader middleware that should be used for runs of this spider. */ public array $downloaderMiddleware = [ RequestDeduplicationMiddleware::class, [UserAgentMiddleware::class, ['userAgent' => 'Mozilla/5.0 (compatible; RoachPHP/0.1.0)']], ]; /** * Parses the response and returns a generator of items. */ public function parse(Response $response): Generator { $items = $response ->filter('ul.list-books > li') ->each(fn(Crawler $node) => [ 'title' => $node->filter('.resultTitle a')->text(), 'url' => $node->filter('.resultTitle a')->link()->getUri(), 'author' => $node->filter('.bookauthor a')->text(), 'cover' => $node->filter('.bookcover img')->attr('src'), ]); foreach ($items as $item) { yield $this->item($item); } }}
Then let's go back to our index.php
file and replace the ImdbTopMoviesSpider
with our new Spider.
index.php
$trendingBooks= Roach::collectSpider(OpenLibrarySpider::class); dump($trendingBooks);
Easy peasy lemon squeezy! 🍋…………or not.
Handling Pagination
Well, things get a bit more interesting when we're dealing with paginated data. Unlike the IMDb example, where all the information was on a single page, the Open Library's list of trending books is paginated. This means we need to navigate through multiple pages to collect all the data we want. But don't worry; it's not as complicated as it might sound, seriously 😏.
The strategy you use to tackle this depends hugely on the pagination structure of the website you're crawling. Taking a closer look at the Open Library's pagination structure, we see that the "Next" button is conveniently the last child of the .pagination
div, except for the last page where it's a span containing the last page's number. This structure simplifies our task.
To handle pagination, append the following code to the end of your parse
method:
src/Spiders/OpenLibrarySpider.php
// Try to get the next page url and yield a request for it if it exists.try { $nextPageUrl = $response->filter('div.pager div.pagination > :last-child')->link()->getUri(); yield $this->request('GET', $nextPageUrl);} catch (Exception) {}
Here's what's happening: we attempt to get the last child of the .pagination
div, regardless of what type of element it is. If it's an anchor (a link), we yield a request for the next page. If not, it will throw an exception, which we catch and do nothing.
With this approach, we can easily traverse through all the pages, and our items will be processed as soon as they're ready for each page. Voilà! Pagination, sorted. 📖📖📖
Crawling even deeper 🪲
Up to this point, we've been working with the information available in the book listings. But what if we want more detailed information about each book, like the description, number of pages, and publication date? Not a problem at all! We can easily fetch this additional data by diving deeper into the individual book pages. Here's how you can do it:
Modify the foreach
loop over the items to yield a request for each book page:
src/Spiders/OpenLibrarySpider.php
foreach ($items as $item) { yield $this->request('GET', $item['url'], 'parseBookPage', ['item' => $item]);}
In this loop, we're creating a request for each book page and passing the data we extracted earlier as an option. The method 'parseBookPage'
indicates which method should handle the response, and we're passing the 'item'
data as an option.
Then, create the parseBookPage
method to handle parsing individual book pages:
src/Spiders/OpenLibrarySpider.php
/** * Parses the book page and returns a generator of items. */public function parseBookPage(Response $response): Generator{ $item = $response->getRequest()->getOptions()['item']; $descriptionArray = $response ->filter('div.book-description-content p') ->each(fn(Crawler $node) => $node->text()); $item['description'] = implode("\n", $descriptionArray); $item['pages'] = $response->filter('span[itemprop="numberOfPages"]')->innerText(); $item['publishDate'] = $response->filter('span[itemprop="datePublished"]')->innerText(); yield $this->item($item);}
In the parseBookPage
method, we start by getting the item data from the request's options. Then, we crawl through the response to extract the description, number of pages, and publication date of the book. For the description, because it's often split into multiple paragraphs, we use implode
to join them into one string with line breaks ( \n
) as separators. I also used innerText()
instead of the regular text()
because the 'pages' and 'publishDate' information is missing on some book pages, and the former doesn't throw exceptions in such a case.
Finally, set your spider loose on the web🕷️by running your project once more. FYI, be prepared for a bit of a wait, or better still, grab a cup of coffee, relax, and let your Spider do its magic! ☕🕰️🕷️
Storage and Persistence
Crawling and merely printing data to the console is just the beginning. In real-life scenarios, we often want to store that precious data for future use. There are a plethora of options on how to save your data - database, CSV, JSON, etc. For demonstration purposes, let's save it to a JSON file.
First, create a new folder in your project's root directory to house your output JSON files. You can name it something like 'output.' Next, replace the dump
line in your index.php
file with the following code:
index.php
$trendingBooks = array_map(fn($item) => $item->all(), $trendingBooks); file_put_contents('./output/trending-books.json', json_encode($trendingBooks, JSON_UNESCAPED_SLASHES | JSON_PRETTY_PRINT));
The array_map
part is necessary because the collectSpider
method returns an array of ItemInterface
, and json_encode
doesn't know how to handle such a custom class. An easy, cost-effective fix is to map through it and call the all()
method for each item to get the underlying array before passing it to json_encode
.
The flag
JSON_UNESCAPED_SLASHES
is to prevent escaping slashes in URLs andJSON_PRETTY_PRINT
is to format the JSON for readability.
Now, when you run your project, it will generate an output JSON file in the 'output' folder. Open the JSON file, and you'll see your scraped data beautifully formatted and ready for further analysis. Your data is now preserved and can be used for various purposes – what a satisfying feeling! 📂📊🎉
Best Practices and Precautions
We've journeyed through the exciting world of web scraping, but before you embark on your own scraping adventures, it's essential to understand best practices, precautions, and the responsible use of this powerful tool. Let's delve into some key considerations to keep in mind:
-
Respect Website Policies and Robots.txt: First and foremost, always check if the website you're scraping has a "robots.txt" file that defines which parts of the site can be scraped. Respect the rules laid out there. Even if scraping is not explicitly prohibited, always scrape responsibly and avoid causing unnecessary traffic or burden to the website.
-
Rate Limiting and Throttling: Many websites employ rate limiting to prevent excessive requests from a single IP address. Implement rate limiting in your scraping scripts to avoid being blocked. Spread your requests over time to mimic human behavior and avoid overwhelming the server.
-
User Agents: Use a User-Agent header in your HTTP requests to identify your scraping bot as a real browser. However, avoid impersonating well-known browsers or tools excessively, as this can be misleading and might violate website terms.
-
IP Rotation: If you encounter IP blocking or restrictions, consider using a proxy or rotating your IP addresses to avoid detection. But remember, using proxies can be a complex task and might have associated costs.
-
Monitoring and Maintenance: Websites' structures can change over time, causing your scrapers to break. Implement regular monitoring to detect any issues promptly and update your scraping scripts accordingly.
-
Data Privacy: Be mindful of the data you're collecting. Respect privacy laws and avoid scraping personal or sensitive information without consent.
-
Error Handling: Prepare your scraper to handle various errors gracefully. This includes HTTP errors, timeouts, and changes in website structure.
-
Legal Considerations: Understand the legal implications of web scraping in your jurisdiction. Consult legal experts if you have doubts about the legality of scraping a specific website or the data you intend to collect.
-
Ethical Use: Use web scraping ethically and responsibly. Scraping for competitive analysis, research, or personal projects is generally acceptable, but avoid scraping for malicious purposes, such as spamming, fraud, or misinformation.
-
Documentation and Attribution: Document your scraping process, including the websites you scrape, the data you collect, and your scraping methodology. If you use scraped data publicly, provide proper attribution to the source.
Web scraping is a powerful tool, but it should always be used with care and responsibility. 🌐🕸️🧹
Conclusion
Finally, we've reached the end of our web scraping journey today. Thank you for joining me on this exploration, and I hope you've learnt enough to embark on your own web scraping adventures. Check out the Roach documentation for more details on RoachPHP as well as the documentations for DomCrawler and GuzzleHttp for more resources on the topic.
You can find the complete code for this article on GitHub. Feel free to explore, experiment, and adapt it to your own projects. For any questions, feedback, or if you simply want to connect with a fellow enthusiastic PHP developer, please don't hesitate to reach out to me at [email protected]. Happy coding to all you passionate PHP devs out there, and may your code always run smoothly! 🚀🐘
0 Comments
No comments yet. Be the first to comment!
Would you like to say something? Please log in to join the discussion.
Login with GitHub