How to Do Web Scraping with PHP

From WikiHTP

In this tutorial, we will learn to do web scraping only with PHP and to do web scraping with cURL and PHP. We will also know for what purposes we can use it.

Web scraping[edit]

For those who are not familiar with the term web scraping, you should know that it is a technique used to extract information from websites. With a few lines of code, you can go through the source code of a web page as it is seen in the browser, save it in a database, show it identically in its own URL, extract only important information within the code, among others.

Web scraping with PHP[edit]

we will create the index.php file, then we will write the following code:

<?PHP 
  $html = file_get_contents('https://www.wikihtp.com/'); //Convert URL information to string
  echo $html;
?>

Once you run this code, And we see that it identically copies the web page https://www.wikitp.com/, let's review the PHP code to understand what happened.

The result of the file_gets_content function was stored in the $html variable, what file_gets_content does is convert the information of a file into a string, the file being in this case the web page https://www.wikihtp.com/.

Then we show this string in our document using echo.

Web scraping with cURL and PHP[edit]

Now let's see how we can do web scraping using the cURL library, for this, we create our curl.php file inside the / web scraping folder created earlier and write the following code:

    // we define the cURL function
    function curl($url) {
        $ch = curl_init($url); // Login cURL
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Set cURL to return the result as a string
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Configure cURL so that it does not verify the certificate peer since our URL uses the HTTPS protocol
        $info = curl_exec($ch); // Establish a cURL session and assign the information to the $info variable
        curl_close($ch); // Close session cURL
        return $info; // Returns the information of the function
    }

    $website = curl("https://www.wikihtp.com/");  // Execute the curl function scraping the website https://www.wikihtp.com and return the value to the variable $website
    echo $website;


Within the code in the comments you will find the explanation of each line, summarizing, first we created a function called curl, it could also be done directly without creating the function, then we started a session with curl_init, then we made a couple of configurations, CURLOPT_RETURNTRANSFER is related to power using the result as a string and CURLOPT_SSL_VERIFYPEER helped us so that cURL can work in our URL despite having the HTTPS protocol.

Then at the end, we use the curl function created to display our scrawled web page.

Conclusion[edit]

In conclusion, we can see that there is more than one way to do web scraping with PHP. This tutorial tries to make an approach towards this topic, however, there are more possibilities of how to treat the scraped information.

About This Tutorial

This page was last edited on 6 October 2020, at 00:18.