Use "\A \z", not "^ $" with Python regular expressions

Posted by todsacerdoti 1 day ago

Counter41Comment21OpenOriginal

Comments

Comment by flufluflufluffy 1 day ago

The vast majority of the times I use ^/$, I actually want the behavior of matching start/end of lines. If I had some multi-line text, and only wanted to update or do something with the actual beginning or end of the entire text, I’d typically just do it manually.

Comment by theamk 1 day ago

A lot of time I want to check for valid identifier:

    if not re.match('^[a-z0-9_]+$', user):
        raise SomeException("invalid username")
as written, the code above is incorrect - it will happily accept "john\n", which can cause all sort of havoc down the line

Comment by extraduder_ire 1 day ago

Shouldn't you use the match returned from the string? Or use .fullmatch() (added 3.4) to match the whole string.

Comment by theamk 1 day ago

In general no, you should not use match from the string. If you are getting input from user, you want a more complex processing (like stripping all whitespace), and if you are getting input from API calls, you want to either use specified name as-is, or fail.

Yes, fullmatch() will help, and so will \Z. It's just that it is so easy to forget...

Comment by Joker_vD 1 day ago

Regular expressions as we basically now them today were made for ed. In that context, '$' absolutely had to match the terminating newline or it would've been completely useless.

Comment by seanwilson 1 day ago

I wish one of those regex libraries that replaces the regex symbols with human readable words would become standard. Or they don't work well?

Regex is one of those things where I have to look up to remind myself what the symbols are, and by the time I need this info again I've forgotten it all.

I can't think of anywhere else in general programming where we have something so terse and symbol heavy.

Comment by db48x 1 day ago

It’s been done. Emacs, for example, has rx notation. From the manual:

    35.3.3 The ‘rx’ Structured Regexp Notation
    ------------------------------------------
    
    As an alternative to the string-based syntax, Emacs provides the
    structured ‘rx’ notation based on Lisp S-expressions.  This notation is
    usually easier to read, write and maintain than regexp strings, and can
    be indented and commented freely.  It requires a conversion into string
    form since that is what regexp functions expect, but that conversion
    typically takes place during byte-compilation rather than when the Lisp
    code using the regexp is run.
    
       Here is an ‘rx’ regexp(1) that matches a block comment in the C
    programming language:
    
         (rx "/*"                    ; Initial /*
             (zero-or-more
              (or (not "*")          ;  Either non-*,
                  (seq "*"           ;  or * followed by
                       (not "/"))))  ;     non-/
             (one-or-more "*")       ; At least one star,
             "/")                    ; and the final /
    
    or, using shorter synonyms and written more compactly,
    
         (rx "/*"
             (* (| (not "*")
                   (: "*" (not "/"))))
             (+ "*") "/")
    
    In conventional string syntax, it would be written
    
         "/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
Of course, it does have one disadvantage. As the manual says:

       The ‘rx’ notation is mainly useful in Lisp code; it cannot be used in
    most interactive situations where a regexp is requested, such as when
    running ‘query-replace-regexp’ or in variable customization.
Raku also has advanced the state of the art considerably.

Comment by zahlman 1 day ago

For this to matter, it seems that I would have to be in the situation of:

* running a regex not in multi-line mode

* on input that was presumably split from multiple lines, or within a line of multi-line input

* wherein I care whether the line in question is the last line of input without a trailing newline

* but I didn't check, or `.strip()` or anything

I can't say I recall ever being bitten by this.

And there is also nothing here to justify \A over ^.

Comment by eviks 1 day ago

so why \A instead of ^?

Comment by tkocmathla 1 day ago

\A always matches the start of the string, but in multiline mode, ^ will match both the start of the string and the start of each line:

https://docs.python.org/3/library/re.html#re.MULTILINE

Comment by svilen_dobrev 1 day ago

it's in the spec. Since forever, like v 1.3? don't remember.

And it is same in perl: from `man perlre`:

   ^   Match the beginning of the string  (or line, if /m is used)

Comment by autoexec 1 day ago

I've said it before and I'll say it again, I'd like Python a lot more if it abandoned re and handled regex like perl did.

Comment by edflsafoiewq 1 day ago

I've never used perl. What's the difference?

Comment by autoexec 1 day ago

It doesn't need an import at all. It's just a normal part of the language's syntax and can be used just about anywhere:

    $foo =~ /regex/
    $result = $foo =~ /regex/
    if ($foo =~ /regex/) {whatever;}
    while (/regex/) {whatever;}
The captures ($1, $2, etc.) are global and usable wherever you need them.

In this particular case the default is that $ matches the end of a string without a newline but you can include it anytime you need to:

   $foo =~ /regex$/ # end of string without newline
   $foo =~ /regex$/m # end of string with newline

Comment by instig007 1 day ago

ABC: Always. Build on. Parser Combinators.

Python ecosystem has several options, for instance: https://parsy.readthedocs.io/en/latest/tutorial.html

Comment by az09mugen 1 day ago

They could simply advise to use boundaries '\b' instead.

Comment by notpushkin 1 day ago

Which would also match whitespace in addition to the \n they’re trying to avoid matching?

Comment by queenkjuul 1 day ago

Comment by tomhow 7 hours ago

We detached this comment from https://news.ycombinator.com/item?id=46804436 and marked it off topic.

Please don't follow people around the site to continue political arguments from unrelated threads.

Comment by zahlman 1 day ago

This thread is about regular expressions in Python.

Comment by queenkjuul 20 hours ago

What you said is not wrong. Here's the article, in case you missed it

https://www.reuters.com/world/us/evidence-contradicts-trump-...