Unexpected behavior in from/stop-before #26

Closed
opened 3 years ago by tgbugs · 7 comments
tgbugs commented 3 years ago (Migrated from github.com)

The code block below illustrates two issues with from/stop-before (implementation seen here) 6c161ae31d/brag-lib/brag/support.rkt (L115-L121).

The primary issue is that from/stop-before only stops before the last element in a :seq and does not correctly back up to before the first element of the stop-before pattern.

There is a secondary issue, which is possibly a matter of documentation, is that from/stop-before will match eof to stop-before.

I'm fairly certain that the second issue is due to the behavior of complement that is used in the implementation.

I am not sure about the underlying cause of the first issue.

#lang racket

(require brag/support)

(define bad-lexer
  (lexer-srcloc
   ; from/stop-before does not correctly backtrack to the point before "\n*"
   [(from/stop-before "start" (:seq "\n" (:+ "*") (:or " " "\t")))
    (token 'ARGH lexeme)]
   [any-char (token 'AC lexeme)]))

(define bad-lexer-2
  ; from/stop-before matches an empty string at end of file
  (lexer-srcloc
   [(from/stop-before "\n" (:seq "\n" (:+ "*") (:or " " "\t")))
    (token 'ARGH lexeme)]
   [any-char (token 'AC lexeme)]))

(define (get-tokens tokenizer)
  (define (sigh next-token [accum '()])
    (let ([t (next-token)])
      (if (eq? t eof)
          accum
          (sigh next-token (cons t accum)))))
  (reverse (sigh tokenizer)))

(define (make-bad-lexer lexer port)
  (define (next-token)
    (lexer port))
  next-token)

(module+ test
  (define (make-port) (open-input-string "start
a
* H
p
argh"))

  (println "start issues")
  (get-tokens (make-bad-lexer bad-lexer (make-port)))
  (println "end issues")
  (get-tokens (make-bad-lexer bad-lexer-2 (make-port)))

  )

Output

"start issues"
(list
 (srcloc-token (token-struct 'ARGH "start\na\n*" #f #f #f #f #f) (srcloc 'string #f #f 1 9))
 (srcloc-token (token-struct 'AC " " #f #f #f #f #f) (srcloc 'string #f #f 10 1))
 (srcloc-token (token-struct 'AC "H" #f #f #f #f #f) (srcloc 'string #f #f 11 1))
 (srcloc-token (token-struct 'AC "\n" #f #f #f #f #f) (srcloc 'string #f #f 12 1))
 (srcloc-token (token-struct 'AC "p" #f #f #f #f #f) (srcloc 'string #f #f 13 1))
 (srcloc-token (token-struct 'AC "\n" #f #f #f #f #f) (srcloc 'string #f #f 14 1))
 (srcloc-token (token-struct 'AC "a" #f #f #f #f #f) (srcloc 'string #f #f 15 1))
 (srcloc-token (token-struct 'AC "r" #f #f #f #f #f) (srcloc 'string #f #f 16 1))
 (srcloc-token (token-struct 'AC "g" #f #f #f #f #f) (srcloc 'string #f #f 17 1))
 (srcloc-token (token-struct 'AC "h" #f #f #f #f #f) (srcloc 'string #f #f 18 1)))
"end issues"
(list
 (srcloc-token (token-struct 'AC "s" #f #f #f #f #f) (srcloc 'string #f #f 1 1))
 (srcloc-token (token-struct 'AC "t" #f #f #f #f #f) (srcloc 'string #f #f 2 1))
 (srcloc-token (token-struct 'AC "a" #f #f #f #f #f) (srcloc 'string #f #f 3 1))
 (srcloc-token (token-struct 'AC "r" #f #f #f #f #f) (srcloc 'string #f #f 4 1))
 (srcloc-token (token-struct 'AC "t" #f #f #f #f #f) (srcloc 'string #f #f 5 1))
 (srcloc-token (token-struct 'ARGH "\na\n*" #f #f #f #f #f) (srcloc 'string #f #f 6 4))
 (srcloc-token (token-struct 'AC " " #f #f #f #f #f) (srcloc 'string #f #f 10 1))
 (srcloc-token (token-struct 'AC "H" #f #f #f #f #f) (srcloc 'string #f #f 11 1))
 (srcloc-token (token-struct 'ARGH "\np\nargh" #f #f #f #f #f) (srcloc 'string #f #f 12 7)))
The code block below illustrates two issues with `from/stop-before` (implementation seen here) https://github.com/mbutterick/brag/blob/6c161ae31df9b4ae7f55a14f754c0b216b60c9a6/brag-lib/brag/support.rkt#L115-L121. The primary issue is that `from/stop-before` only stops before the last element in a `:seq` and does not correctly back up to before the first element of the `stop-before` pattern. There is a secondary issue, which is possibly a matter of documentation, is that `from/stop-before` will match eof to `stop-before`. I'm fairly certain that the second issue is due to the behavior of `complement` that is used in the implementation. I am not sure about the underlying cause of the first issue. ```racket #lang racket (require brag/support) (define bad-lexer (lexer-srcloc ; from/stop-before does not correctly backtrack to the point before "\n*" [(from/stop-before "start" (:seq "\n" (:+ "*") (:or " " "\t"))) (token 'ARGH lexeme)] [any-char (token 'AC lexeme)])) (define bad-lexer-2 ; from/stop-before matches an empty string at end of file (lexer-srcloc [(from/stop-before "\n" (:seq "\n" (:+ "*") (:or " " "\t"))) (token 'ARGH lexeme)] [any-char (token 'AC lexeme)])) (define (get-tokens tokenizer) (define (sigh next-token [accum '()]) (let ([t (next-token)]) (if (eq? t eof) accum (sigh next-token (cons t accum))))) (reverse (sigh tokenizer))) (define (make-bad-lexer lexer port) (define (next-token) (lexer port)) next-token) (module+ test (define (make-port) (open-input-string "start a * H p argh")) (println "start issues") (get-tokens (make-bad-lexer bad-lexer (make-port))) (println "end issues") (get-tokens (make-bad-lexer bad-lexer-2 (make-port))) ) ``` Output ```racket "start issues" (list (srcloc-token (token-struct 'ARGH "start\na\n*" #f #f #f #f #f) (srcloc 'string #f #f 1 9)) (srcloc-token (token-struct 'AC " " #f #f #f #f #f) (srcloc 'string #f #f 10 1)) (srcloc-token (token-struct 'AC "H" #f #f #f #f #f) (srcloc 'string #f #f 11 1)) (srcloc-token (token-struct 'AC "\n" #f #f #f #f #f) (srcloc 'string #f #f 12 1)) (srcloc-token (token-struct 'AC "p" #f #f #f #f #f) (srcloc 'string #f #f 13 1)) (srcloc-token (token-struct 'AC "\n" #f #f #f #f #f) (srcloc 'string #f #f 14 1)) (srcloc-token (token-struct 'AC "a" #f #f #f #f #f) (srcloc 'string #f #f 15 1)) (srcloc-token (token-struct 'AC "r" #f #f #f #f #f) (srcloc 'string #f #f 16 1)) (srcloc-token (token-struct 'AC "g" #f #f #f #f #f) (srcloc 'string #f #f 17 1)) (srcloc-token (token-struct 'AC "h" #f #f #f #f #f) (srcloc 'string #f #f 18 1))) "end issues" (list (srcloc-token (token-struct 'AC "s" #f #f #f #f #f) (srcloc 'string #f #f 1 1)) (srcloc-token (token-struct 'AC "t" #f #f #f #f #f) (srcloc 'string #f #f 2 1)) (srcloc-token (token-struct 'AC "a" #f #f #f #f #f) (srcloc 'string #f #f 3 1)) (srcloc-token (token-struct 'AC "r" #f #f #f #f #f) (srcloc 'string #f #f 4 1)) (srcloc-token (token-struct 'AC "t" #f #f #f #f #f) (srcloc 'string #f #f 5 1)) (srcloc-token (token-struct 'ARGH "\na\n*" #f #f #f #f #f) (srcloc 'string #f #f 6 4)) (srcloc-token (token-struct 'AC " " #f #f #f #f #f) (srcloc 'string #f #f 10 1)) (srcloc-token (token-struct 'AC "H" #f #f #f #f #f) (srcloc 'string #f #f 11 1)) (srcloc-token (token-struct 'ARGH "\np\nargh" #f #f #f #f #f) (srcloc 'string #f #f 12 7))) ```
tgbugs commented 3 years ago (Migrated from github.com)

I think that the implementation of complement is in 32fc3b68d1/br-parser-tools-lib/br-parser-tools/private-lex/stx.rkt (L84-L88).

edit: Further exploration of the expanded form suggests that the issue may be in the definition of lexer-body here 32fc3b68d1/br-parser-tools-lib/br-parser-tools/lex.rkt (L215-L283). If this is ultimately an issue in br-parser-tools, let me know if I should create a new issue there.

aside: The expansion of lexer includes an unused reference to lexeme-srcloc-p that bloats the expanded form and probably slows down compile times.

I think that the implementation of complement is in https://github.com/mbutterick/br-parser-tools/blob/32fc3b68d14a027ec70fb5cca38471ebdfed9ee7/br-parser-tools-lib/br-parser-tools/private-lex/stx.rkt#L84-L88. edit: Further exploration of the expanded form suggests that the issue may be in the definition of `lexer-body` here https://github.com/mbutterick/br-parser-tools/blob/32fc3b68d14a027ec70fb5cca38471ebdfed9ee7/br-parser-tools-lib/br-parser-tools/lex.rkt#L215-L283. If this is ultimately an issue in br-parser-tools, let me know if I should create a new issue there. aside: The expansion of `lexer` includes an unused reference to `lexeme-srcloc-p` that bloats the expanded form and probably slows down compile times.
mbutterick commented 3 years ago (Migrated from github.com)

As you can see, from/stop-before is a dumb little lexer macro. So the question is whether you can get the matching behavior you want without using from/stop-before

  • If so, then that should become the new implementation for from/stop-before.

  • If not, then it’s just a limitation of the underlying pattern matchers in br-parser-tools (which is basically just a fork of parser-tools). In that case, the only fix is to update the documentation to describe the limitation.

As you can see, `from/stop-before` is a dumb little lexer macro. So the question is whether you can get the matching behavior you want *without* using `from/stop-before` — * If so, then that should become the new implementation for `from/stop-before`. * If not, then it’s just a limitation of the underlying pattern matchers in `br-parser-tools` (which is basically just a fork of `parser-tools`). In that case, the only fix is to update the documentation to describe the limitation.
tgbugs commented 3 years ago (Migrated from github.com)

As I suspected. I'll dig around to see if I can find a way to implement the behavior I'm looking for and report back. If there is a lexer form that can do limited lookahead it should be possible.

As I suspected. I'll dig around to see if I can find a way to implement the behavior I'm looking for and report back. If there is a lexer form that can do limited lookahead it should be possible.
tgbugs commented 3 years ago (Migrated from github.com)

As suspected, there isn't a way to fix this using lexer abbreviations alone, so the documentation of the behavior should be updated.

However, there is a way to fix the issue on the action side by using file-position to set the port to the correct position. The solution resolves both issues, but with some trade-offs, and I do no think that it is not sufficiently general for inclusion in brag/support, though maybe with a few tweaks it could be?

For my narrow use-case find-last works because I only need to search backward to the last newline in the lexeme, however more complicated stop-before patterns would likely require running something like regexp-match-positions for the stop pattern over the lexeme.

(define (find-last char str)
  ; we don't want to do string reverse because it is slow we just want
  ; to find the first char going backward, string-ref backward seems
  ; like it is ok if we really want this to go fast maybe could chunk
  ; bytes into cacheline size and search for a sub-offset within that?
  (letrec ([search (λ (look-at)
                     (println (string-ref str look-at))
                     (if (eq? (string-ref str look-at) char)
                         (values (substring str 0 look-at) (sub1 look-at))
                         (search (sub1 look-at))))])
    (search (sub1 (string-length str)))))

(define (token-stop-before TOKEN TOKEN-EOF lexeme char input-port start-pos)
  "Encapsulates the machinery needed to correctly reset the location of input-port when
using from/stop-before where the stop-before pattern contains multiple charachters."
  (if (eq? (peek-char input-port) eof)
      (token TOKEN-EOF lexeme)
      (let-values ([(lexeme-correct offset) (find-last char lexeme)])
        (file-position input-port (+ (position-offset start-pos) offset))
        (token TOKEN lexeme-correct))))

(define ok-lexer
  (lexer
   [(from/stop-before "start" (:seq "\n" (:+ "*") (:or " " "\t")))
    (token-stop-before 'ARGH 'ARGH-EOF lexeme #\newline input-port start-pos)]
   [any-char (token 'AC lexeme)]))
As suspected, there isn't a way to fix this using lexer abbreviations alone, so the documentation of the behavior should be updated. However, there is a way to fix the issue on the action side by using `file-position` to set the port to the correct position. The solution resolves both issues, but with some trade-offs, and I do no think that it is not sufficiently general for inclusion in `brag/support`, though maybe with a few tweaks it could be? For my narrow use-case `find-last` works because I only need to search backward to the last newline in the lexeme, however more complicated stop-before patterns would likely require running something like `regexp-match-positions` for the stop pattern over the lexeme. ```racket (define (find-last char str) ; we don't want to do string reverse because it is slow we just want ; to find the first char going backward, string-ref backward seems ; like it is ok if we really want this to go fast maybe could chunk ; bytes into cacheline size and search for a sub-offset within that? (letrec ([search (λ (look-at) (println (string-ref str look-at)) (if (eq? (string-ref str look-at) char) (values (substring str 0 look-at) (sub1 look-at)) (search (sub1 look-at))))]) (search (sub1 (string-length str))))) (define (token-stop-before TOKEN TOKEN-EOF lexeme char input-port start-pos) "Encapsulates the machinery needed to correctly reset the location of input-port when using from/stop-before where the stop-before pattern contains multiple charachters." (if (eq? (peek-char input-port) eof) (token TOKEN-EOF lexeme) (let-values ([(lexeme-correct offset) (find-last char lexeme)]) (file-position input-port (+ (position-offset start-pos) offset)) (token TOKEN lexeme-correct)))) (define ok-lexer (lexer [(from/stop-before "start" (:seq "\n" (:+ "*") (:or " " "\t"))) (token-stop-before 'ARGH 'ARGH-EOF lexeme #\newline input-port start-pos)] [any-char (token 'AC lexeme)])) ```
mbutterick commented 3 years ago (Migrated from github.com)

OK. I will update the docs. I think token-stop-before would make more sense as an add-on or patch for the underlying parser-tools package, because that’s the primary residence of those lexer matchers.

OK. I will update the docs. I think `token-stop-before` would make more sense as an add-on or patch for the underlying `parser-tools` package, because that’s the primary residence of those lexer matchers.
mbutterick commented 3 years ago (Migrated from github.com)

Closed by 243f32a7bb

Closed by 243f32a7bb8c52bf330f2eaabc80a5c6715bace9
tgbugs commented 3 years ago (Migrated from github.com)

Just a note in case anyone comes across this in the future. from/stop-before can be used with :seq or any pattern than can match more than a single literal token, but it only ever stops before the final element of the pattern.

Just a note in case anyone comes across this in the future. `from/stop-before` can be used with `:seq` or any pattern than can match more than a single literal token, but it only ever stops before the final element of the pattern.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mbutterick/brag#26
Loading…
There is no content yet.