Améliorer mon Web Crawlerpour ramper plus de Fichiers

**cheboy** · 10/11/2019, 21h38

Bonsoir à tous.

Excusez-moi d dérangement. J'ai un Web Scrapper (un robbot web) qui me permet de télécharger seulement les Textes (title <title>, Description <meta Description> et url).

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
function get_details($url) {
 
	// The array that we pass to stream_context_create() to modify our User Agent.
	$options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: chegBot/0.1\n"));
	// Create the stream context.
	$context = stream_context_create($options);
	// Create a new instance of PHP's DOMDocument class.
	$doc = new DOMDocument();
	// Use file_get_contents() to download the page, pass the output of file_get_contents()
	// to PHP's DOMDocument class.
	@$doc->loadHTML(@file_get_contents($url, false, $context));
 
	// Create an array of all of the title tags.
	$title = $doc->getElementsByTagName("title");
	// There should only be one <title> on each page, so our array should have only 1 element.
	$title = $title->item(0)->nodeValue;
	// Give $description and $keywords no value initially. We do this to prevent errors.
	$description = "";
	$keywords = "";
	// Create an array of all of the pages <meta> tags. There will probably be lots of these.
	$metas = $doc->getElementsByTagName("meta");
	// Loop through all of the <meta> tags we find.
	for ($i = 0; $i < $metas->length; $i++) {
		$meta = $metas->item($i);
		// Get the description and the keywords.
		if (strtolower($meta->getAttribute("name")) == "description")
			$description = $meta->getAttribute("content");
		if (strtolower($meta->getAttribute("name")) == "keywords")
			$keywords = $meta->getAttribute("content");
 
	}
	// Return our JSON string containing the title, description, keywords and URL.
	return '{ "Title": "'.str_replace("\n", "", $title).'", "Description": "'.str_replace("\n", "", $description).'", "Keywords": "'.str_replace("\n", "", $keywords).'", "URL": "'.$url.'"},';
 
}

1 - Je veux que vous m'aidiez à le modifier de sorte à récupérer aussi dans le Document (Balise <body>) tous les fichiers ou extensions (.docx, .pdf, .jpeg, .png, .svg, .mp3, .mp4, etc... En gros tous les Fichiers Textes ou Vidéos ou encore Images possibles dans la Balise <body>) possible et disponibles dans ce body. Tous ces Fichiers dans une Variable PHP: $file.

2 - Aidez moi aussi à récupérer dans une variable $icon, tous les Href des Icônes <link rel="icon" type="image/png" href="favicon.png" /> disponible dans la balise link ayant une valeur
icon dans l'attribut rel.

Donc, pour être plus clair, je veux récupérer tous les Liens des Attributs Href et src disponibles de la balise <Body> de la Page Web dans un premier temps ET dans un second temps, le Href de <link> avec l'attribut <rel> avec comme valeur icon.

AIDEZ-MOI DONC S'IL VOUS PLAÎT.

Merci d'avance.

Améliorer mon Web Crawlerpour ramper plus de Fichiers

Langage PHP

Discussions similaires

Partager

Partager