[JSP/SERVLET] Moteur de recherche dans fichiers PDF

**crewstyle** · 19/04/2007, 17h26

Bonjour à tous et à toutes,

Je suis en train de développez une application J2EE/Web offrant une fonction simple de moteur de recherche dans des fichiers PDF.
Le formulaire se compose normalement, comme celui de Google par exemple (un champs texte et un bouton de validation. Pour les options, je verrais seul une fois ce problème réglé), et les fichiers PDF sont répertoriés dynamiquement.

En recherchant sur le net, j'ai lu qu'il faut lire byte par byte le flux d'entrée des fichiers PDF ... Seulement je n'ai pas trouvé comment !
Pourriez-vous me donner une aide, ou me guider, me remettre sur la bonne voix s'il vous plait ?

Merci d'avance.
Achraf

**crewstyle** · 24/04/2007, 11h47

Bonjour tout le monde,
J'ai fait mes recherches depuis que je suis ici avec vous et j'ai fini par trouver une librairie que tous avez adopté (pour la plupart) : iText.

Celle-ci permet de convertir plusieurs formats en PDF, de concaténer le contenu de 2 fichiers PDF, etc etc. Cependant elle ne fait pas ce que moi je recherche, c'est-à-dire faire une recherche dans un PDF.

J'aimerai donc que vous m'aiguilliez sur ce qu'il faudrait que je fasse, s'il vous plait, car je patoge complètement.

Quelle classe devrais-je m'inspirer pour lire un PDF et comparer un mot avec un autre (entré dans un champ input et récupéré à l'aide de request.getParameter()) ?

**deadstar62** · 24/04/2007, 11h52

je ne me suis jamais lancé dans ce problème, mais si tu peux convertir un fichier pdf en .txt par exemple, il est possible de le parcourir pour pouvoir faire ta recherche. Enfin c'est une idée comme ca :p

**crewstyle** · 24/04/2007, 11h56

Envoyé par deadstar62

je ne me suis jamais lancé dans ce problème, mais si tu peux convertir un fichier pdf en .txt par exemple, il est possible de le parcourir pour pouvoir faire ta recherche. Enfin c'est une idée comme ca :p

Merci de m'aider

Si cela avait été aussi simple, je t'avouerai que je me serais lancé dans cette idée depuis belle lurette ^^

Mais voici un contre exemple :
Pièce jointe fichier PDF et le même convertir en TXT

**crewstyle** · 26/04/2007, 14h53

Bonjour,

J'ai continuer à essayer de développer ce moteur de recherche dans des fichiers PDF ... mais rien n'y fait.
Entre PDFBox, iText, JPedal et bfopdf, je ne sais plus où donner de la tête.
Donc si quelqu'un aurait un exemple de traitement, ou si quelqu'un saurait comme y faire, pourrait-il me proposer son exemple s'il vous plait car je ne m'en sorts plus du tout

Merci d'avance pour votre soutient

PS : voici ce que je n'ai pas réussi à faire fonctionner :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
	Writer textExtrait = null;
	File fichierPDF = new File(this.getServletContext().getRealPath("1.pdf"));
 
	PDDocument lucenePDF = PDDocument.load(fichierPDF);
	PDFTextStripper stripper = new PDFTextStripper();
	stripper.writeText(lucenePDF, textExtrait);
 
	if( textExtrait.toString().contains(recherche) ) {
		writer.println(fichierPDF.getName());
	}
 
	lucenePDF.close();

Erreur :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
ATTENTION: "Servlet.service()" pour la servlet SearchPdf a généré une exception
org.pdfbox.exceptions.WrappedIOException
	at org.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:128)
	at org.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:119)
	at servlet.SearchPdf.service(SearchPdf.java:71)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
	at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
	at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
	at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
	at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
	at java.lang.Thread.run(Thread.java:619)

SearchPdf.java:71 (servlet que j'ai développé utilisant PDFBox) :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

PDFTextStripper stripper = new PDFTextStripper();

**crewstyle** · 26/04/2007, 15h48

Nouvelle erreur (en plus de l'ancienne !)

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
java.lang.Throwable: Warning: You did not close the PDF Document
	at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
	at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
	at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)
	at java.lang.ref.Finalizer.access$100(Finalizer.java:14)
	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)

Je n'y arrive plus du tout ... si quelqu'un aurait l'amabilité d'éclairer ma lanterne ...

**crewstyle** · 27/04/2007, 09h17

Bonjour,

Je continue alors

J'ai trouvé ceci par delph1983. Malheureusement je n'ai pas réussi à l'adapter.
Je voudrais faire exactement la même chose mis à part qu'au lieu d'afficher le nombre de fois que le mot apparait, j'aimerai juste récupérer la phrase où il se trouve (du type quelques mots avant et après).

Merci de votre aide.

**crewstyle** · 27/04/2007, 14h14

Bonjour,

J'ai toujours ce même problème. Pouvez-vous m'aider s'il vous plait ?

[ EDIT ]
J'ai préféré vous remettre le code au propre :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
public class SearchPdf extends HttpServlet {
	/**
	 * 
	 */
	private static final long serialVersionUID = 1L;
 
	protected void service(HttpServletRequest request, HttpServletResponse response) throws ServletException, WrappedIOException, IOException {
		PrintWriter writer = response.getWriter();
		String recherche = request.getParameter("recherche");
 
		// DEBUT NOUVELLE SERVLET
		response.setContentType("text/html");
 
		writer.println("<form method=\"POST\" action=\"" + response.encodeURL( request.getContextPath() + "/SearchPdf" ) + "\">");
		writer.println("Rechercher: <input name='recherche' /> <input type='submit' value='Rechercher' />");
		writer.println("</form>");
 
		if( recherche == null )
		{
			writer.println("<p>Aucun document ne contient le(s) mot(s) recherché(s)</p>");
		}
		else
		{
			String resultat = "";
			String permissions = "";
			String textTotal = "";
 
			int nombreTotal = 0;
 
			File fichierPDF = new File(this.getServletContext().getRealPath("1.pdf"));
			try
			{
				String[] tabChaine;
				String chaineToSplit = null;
 
				//FileInputStream fin = new FileInputStream(fichierPDF.toString());
				System.out.println("*** Début du traitement du fichier PDF: " + fichierPDF.getName() + " ***\n");
 
				PDDocument pdf = null;
 
				try {
					pdf = PDDocument.load(fichierPDF.toString());
					if( pdf.isEncrypted() ) {
						permissions = "Certains fichiers sont cryptés.\n";
 
						try {
							pdf.decrypt("");
						}
						catch( InvalidPasswordException e ) {
							permissions += "- " + pdf.getDocumentInformation().getTitle() + " n'a pu être lu.\n";
						}
					}
					else {
						PDFTextStripper txt = new PDFTextStripper();
 
						for(int i=0; i<txt.getEndPage(); i++) {
							txt.setStartPage(i);
							txt.setEndPage(i);
 
							textTotal = txt.getText(pdf);
 
							tabChaine = chaineToSplit.split("");
							nombreTotal += tabChaine.length - 1;
						}
 
						System.out.println("matches: " + nombreTotal);
 
						if( nombreTotal > 0 ) {
							resultat += "Le mot recherché apparait " + nombreTotal + " fois dans ";
							resultat += pdf.getDocumentInformation().getTitle();
						}
						//else {
						//	resultat = null;
						//}
 
						//lucene doc
						//Document doc = LucenePDFDocument.getDocument(fin);
					}
 
				}
				finally
				{
					if( pdf != null ) {
						pdf.close();
					}
				}
 
				System.out.println("******** Fin du traitement du fichier PDF ********\n"); 
			}
			catch (CryptographyException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
			writer.println("<p>resultat : </p>" + resultat);
			writer.println("<p>permissions : </p>" + permissions);
			writer.println("<p>Textes extraits : </p>" + textTotal);
		}
		// FIN NOUVELLE SERVLET
	}
}

Erreur générée :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
27 avr. 2007 14:09:58 org.apache.catalina.core.StandardWrapperValve invoke
ATTENTION: "Servlet.service()" pour la servlet SearchPdf a généré une exception
org.pdfbox.exceptions.WrappedIOException
	at org.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:128)
	at org.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:119)
	at servlet.SearchPdf.service(SearchPdf.java:73)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
	at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
	at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
	at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
	at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
	at java.lang.Thread.run(Thread.java:619)

Ligne de code propageant l'erreur :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

						PDFTextStripper txt = new PDFTextStripper();

**crewstyle** · 27/04/2007, 16h18

Bon bah ... merci aux participants, j'y suis arrivé.
L'erreur venait du fait qu'un fichier PDFTextStripper.properties est à ajouter au classpath du projet si l'on rajoute les librairies de PDFBox à la main (c'est-à-dire sans passer par le fichier .JAR !)

Bonne journée à toutes et à tous.

[JSP/SERVLET] Moteur de recherche dans fichiers PDF

Servlets/JSP Java

Discussions similaires

Partager

Partager