Inclure la librairie CRUNZ dans mon Web Crawler

**cheboy** · 07/12/2019, 23h35

Bonjour.

Recherchant un moyen pour exécuter de façon répétitives et périodiques mon Web Scraper dont le code est ci-dessous de sorte à RÉPÉTER SEULEMENT L'ACTION DE LA VARIABLE $already_crawled, j'ai découvert une librairie PHP nommée CRUNZ (https://github.com/lavary/crunz) qui permet d'exécuter des tâches répétitives directement via PHP.

Cette variable $already_crawled est de base un tableau qui représente les URLs déjà Crawlés.
Alors, ce que je souhaite est de RECRAWLER (retélécharger) les URLs déjà rampés et que représentent la variable (tableau) $already_crawled.

Alors là, je me demande comment inclure cette librairie dans mon Web Crawler de sorte à retélécharger et traiter les URLs déjà téléchargés et traités ???

Aidez-moi s'il vous plaît.

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// This is our starting point. Change this to whatever URL you want.
$start = "";
// Our 2 global arrays containing our links to be crawled.
$already_crawled = array();
$crawling = array();
function get_details($url) {
  // The array that we pass to stream_context_create() to modify our User Agent.
  $options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegBot/0.1\n"));
  // Create the stream context.
  $context = stream_context_create($options);
  // Create a new instance of PHP's DOMDocument class.
  $doc = new DOMDocument();
  // Use file_get_contents() to download the page, pass the output of file_get_contents()
  // to PHP's DOMDocument class.
  @$doc->loadHTML(@file_get_contents($url, false, $context));
  // Create an array of all of the title tags.
  $title = $doc->getElementsByTagName("title");
  // There should only be one <title> on each page, so our array should have only 1 element.
  $title = $title->item(0)->nodeValue;
  // Give $description and $keywords no value initially. We do this to prevent errors.
  $description = "";
  $keywords = "";
  // Create an array of all of the pages <meta> tags. There will probably be lots of these.
  $metas = $doc->getElementsByTagName("meta");
  // Loop through all of the <meta> tags we find.
  for ($i = 0; $i < $metas->length; $i++) {
    $meta = $metas->item($i);
    // Get the description and the keywords.
    if (strtolower($meta->getAttribute("name")) == "description")
      $description = $meta->getAttribute("content");
    if (strtolower($meta->getAttribute("name")) == "keywords")
      $keywords = $meta->getAttribute("content");
  }
  // Return our JSON string containing the title, description, keywords and URL.
  return '{ "Title": "'.str_replace("\n", "", $title).'", "Description": "'.str_replace("\n", "", $description).'", "Keywords": "'.str_replace("\n", "", $keywords).'", "URL": "'.$url.'"},';
}
function follow_links($url) {
  // Give our function access to our crawl arrays.
  global $already_crawled;
  global $crawling;
  // The array that we pass to stream_context_create() to modify our User Agent.
  $options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegBot/0.1\n"));
  // Create the stream context.
  $context = stream_context_create($options);
  // Create a new instance of PHP's DOMDocument class.
  $doc = new DOMDocument();
  // Use file_get_contents() to download the page, pass the output of file_get_contents()
  // to PHP's DOMDocument class.
  @$doc->loadHTML(@file_get_contents($url, false, $context));
  // Create an array of all of the links we find on the page.
  $linklist = $doc->getElementsByTagName("a");
  // Loop through all of the links we find.
  foreach ($linklist as $link) {
    $l =  $link->getAttribute("href");
    // Process all of the links we find. This is covered in part 2 and part 3 of the video series.
    if (substr($l, 0, 1) == "/" && substr($l, 0, 2) != "//") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"].$l;
    } else if (substr($l, 0, 2) == "//") {
      $l = parse_url($url)["scheme"].":".$l;
    } else if (substr($l, 0, 2) == "./") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"].dirname(parse_url($url)["path"]).substr($l, 1);
    } else if (substr($l, 0, 1) == "#") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"].parse_url($url)["path"].$l;
    } else if (substr($l, 0, 3) == "../") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"]."/".$l;
    } else if (substr($l, 0, 11) == "javascript:") {
      continue;
    } else if (substr($l, 0, 5) != "https" && substr($l, 0, 4) != "http") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"]."/".$l;
    }
    // If the link isn't already in our crawl array add it, otherwise ignore it.
    if (!in_array($l, $already_crawled)) {
        $already_crawled[] = $l;
        $crawling[] = $l;
        // Output the page title, descriptions, keywords and URL. This output is
        // piped off to an external file using the command line.
        echo get_details($l)."\n";
    }
  }
  // Remove an item from the array after we have crawled it.
  // This prevents infinitely crawling the same page.
  array_shift($crawling);
  // Follow each link in the crawling array.
  foreach ($crawling as $site) {
    follow_links($site);
  }
}
// Begin the crawling process by crawling the starting link first.
follow_links($start);

Inclure la librairie CRUNZ dans mon Web Crawler

Langage PHP

Discussions similaires

Partager

Partager