Modifier Web Scraper de sorte à ce qu'il respecte les Robots.txt

**cheboy** · 21/11/2019, 15h04

Bonjour.

J'ai un Web Scraper PHP basé sur DOMElement mais qui ne respecte pas la Politique Robots.txt. Donc, ce que je voudrais c'est de pouvoir permettre à mon code de respecter la politique de Robots.txt qui ne Crawle pas les liens ayant comme politique l'user-agent dissalow.

Veuillez m'aider s'il vous plaît à modifier ce Web Crawler afin qu'il vérifie les fichiers Robots.txt et ne télécharge pas les liens ou répertoires User-agent: dissalow qui sont empêcher ou interdits d'être rampés par les Robots.

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// This is our starting point. Change this to whatever URL you want.
$start = "";
// Our 2 global arrays containing our links to be crawled.
$already_crawled = array();
$crawling = array();
function get_details($url) {
  // The array that we pass to stream_context_create() to modify our User Agent.
  $options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: howBot/0.1\n"));
  // Create the stream context.
  $context = stream_context_create($options);
  // Create a new instance of PHP's DOMDocument class.
  $doc = new DOMDocument();
  // Use file_get_contents() to download the page, pass the output of file_get_contents()
  // to PHP's DOMDocument class.
  @$doc->loadHTML(@file_get_contents($url, false, $context));
  // Create an array of all of the title tags.
  $title = $doc->getElementsByTagName("title");
  // There should only be one <title> on each page, so our array should have only 1 element.
  $title = $title->item(0)->nodeValue;
  // Give $description and $keywords no value initially. We do this to prevent errors.
  $description = "";
  $keywords = "";
  // Create an array of all of the pages <meta> tags. There will probably be lots of these.
  $metas = $doc->getElementsByTagName("meta");
  // Loop through all of the <meta> tags we find.
  for ($i = 0; $i < $metas->length; $i++) {
    $meta = $metas->item($i);
    // Get the description and the keywords.
    if (strtolower($meta->getAttribute("name")) == "description")
      $description = $meta->getAttribute("content");
    if (strtolower($meta->getAttribute("name")) == "keywords")
      $keywords = $meta->getAttribute("content");
  }
  // Return our JSON string containing the title, description, keywords and URL.
  return '{ "Title": "'.str_replace("\n", "", $title).'", "Description": "'.str_replace("\n", "", $description).'", "Keywords": "'.str_replace("\n", "", $keywords).'", "URL": "'.$url.'"},';
}
function follow_links($url) {
  // Give our function access to our crawl arrays.
  global $already_crawled;
  global $crawling;
  // The array that we pass to stream_context_create() to modify our User Agent.
  $options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: howBot/0.1\n"));
  // Create the stream context.
  $context = stream_context_create($options);
  // Create a new instance of PHP's DOMDocument class.
  $doc = new DOMDocument();
  // Use file_get_contents() to download the page, pass the output of file_get_contents()
  // to PHP's DOMDocument class.
  @$doc->loadHTML(@file_get_contents($url, false, $context));
  // Create an array of all of the links we find on the page.
  $linklist = $doc->getElementsByTagName("a");
  // Loop through all of the links we find.
  foreach ($linklist as $link) {
    $l =  $link->getAttribute("href");
    // Process all of the links we find. This is covered in part 2 and part 3 of the video series.
    if (substr($l, 0, 1) == "/" && substr($l, 0, 2) != "//") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"].$l;
    } else if (substr($l, 0, 2) == "//") {
      $l = parse_url($url)["scheme"].":".$l;
    } else if (substr($l, 0, 2) == "./") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"].dirname(parse_url($url)["path"]).substr($l, 1);
    } else if (substr($l, 0, 1) == "#") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"].parse_url($url)["path"].$l;
    } else if (substr($l, 0, 3) == "../") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"]."/".$l;
    } else if (substr($l, 0, 11) == "javascript:") {
      continue;
    } else if (substr($l, 0, 5) != "https" && substr($l, 0, 4) != "http") {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"]."/".$l;
    }
    // If the link isn't already in our crawl array add it, otherwise ignore it.
    if (!in_array($l, $already_crawled)) {
        $already_crawled[] = $l;
        $crawling[] = $l;
        // Output the page title, descriptions, keywords and URL. This output is
        // piped off to an external file using the command line.
        echo get_details($l)."\n";
    }
  }
  // Remove an item from the array after we have crawled it.
  // This prevents infinitely crawling the same page.
  array_shift($crawling);
  // Follow each link in the crawling array.
  foreach ($crawling as $site) {
    follow_links($site);
  }
}
// Begin the crawling process by crawling the starting link first.
follow_links($start);

Merci d'avance.

Modifier Web Scraper de sorte à ce qu'il respecte les Robots.txt

Langage PHP

Discussions similaires

Partager

Partager