Expression régulière : filtrer tout sauf un expression

**thomine** · 22/07/2010, 09h17

Bonjour,

Avis aux experts des expressions régulières. J'ai 1 expression régulière à faire pour filtrer des URLs. Elle consiste à faire une négation.

Objectif : laisser passer toutes les URL sauf celles qui du type http://test/admin* et http://test/login.aspx

Jeu de test :
1. http://test/admin
2. http://test/admin/
3. http://test/admin_client/
4. http://test/admin/login.aspx
5. http://test/login.aspx
6. http://test/toto.aspx
7. http://test/titi/tata.aspx

Résultat attendu : seules les URLs 6 et 7 du jeu de test doivent être acceptées.

Expression régulière : http://(.*)/(?!(admin(.*)))(?!login.aspx)(.*)

=> C'est ok pour l'expression 4, elle est bien filtrée. Par contre, je galère avec les expressions 1, 2, 3, 4 du jeu de test.
Je m'aide de http://regexlib.com/RETester.aspx pour tester facilement mes expressions régulières.

Par avance, merci pour votre aide !

**thomine** · 23/07/2010, 08h14

J'ai obtenu la solution sur un autre forum : http://regexadvice.com/forums/69876/...ead.aspx#69876

Ty:

http://([^/]*)/(?!admin.*)(?!login.aspx).*

I've clean up the lookaheads a bit, just to get rid of the unnecessary parentheses and make them a bit more readable.

To see what was going wrong with your pattern, I'll explain it a section at a time:

http:// - all OK here

(.*)/ - this is where things start to go wrong. What I think you are intending to do is to match everything to the next "/" character - ie skip the "domain" part of the URL. However you have used the '*' quantifier which means "match zero or more characters, matching as many as possible". Because the regex engine works through the pattern one operator at a time, it will do what you say and match everything from after the "http://" to the end of the string (or line if there are multiple lines and you are not using the "singleline" or "dot matches newline" option).

At this point, it tries to match the "/" but is at the end of the string and so has to backtrack, releasing 1 character at a time from those matched by the '.*' until it finds a "/" character. AS you can see, this effectively searches for the LAST "/" in the string. In your test cases #2 and#3, that is the last character of the URL. The things that follow are negative lookaheads which will always succeed matching nothing (the lookahead can't match - fail - and so the negation turns this into a "succeed") and another ".*" which is quite happy to match nothing at all.

In test cases #1, #4 and #5, the backtracking leaves the "login.aspx" and this lets the first lookahead reject the match and so this works for those cases.

For test cases #6 and #7, the negative lookaheads both succeed and so you get the match.

The "typical" correction for the greediness of the '.*' operator is to use '.*?' which means "match zero or more of any character, matching as few as possible". however this doesn't work in this case because of the way the regex engine actually does the laziness checking.

When it sees '.*?/', the first thing it does is to not try to match anything with the '.' operator but sees if what follows can match - in this case the '/' of the pattern. If this fails, the regex engine goes back to the '.' and lets it try to match - which is nearly every case it will. This carries on until it tests the first "/" in (say) test case #2 (after the "http://test" part). Now, the '/' will match and so it tries to move on in the pattern, getting to the first negative lookahead. In this case it matches the "admin" part and so the negative lookahead returns a "fail".

The regex engine then backtracks to see if there is some other path that will lead to a match. That means it backs off the "/" it has matched and gets to the '.*?' again. as we've reached this as a result of a failure further on, the regex engine uses the '.' to match the "/" character and the process described in the previous paragraph starts all over again, this time matching the 'admin" characters until it again gets to the "/" at the end. We are now in the situation where neither lookahead can match and so both succeed, and the final '.*' can also succeed and so an overall match is declared.

My solution involves explicitly matching all non"/' character and then the "/" character. There is no alternate path in this that the regex can use to backtrack past this and so the negative lookaheads are forced to operate on the required part of all test cases and so return the required matches.

Susan

Merci à Susan :-)

Expression régulière : filtrer tout sauf un expression

Dotnet

Discussions similaires

Partager

Partager