Rule name identifiers maybe shouldn't have source locations #34

Open
opened 2 years ago by jackfirth · 2 comments
jackfirth commented 2 years ago (Migrated from github.com)

So given a grammar like this:

#lang brag

program: statement*
statement: LITERAL-INTEGER

And an appropriate lexer-based tokenizer, using (parse path (make-tokenizer port)) produces syntax objects that look like this:

(program (statement 42) (statement 58) (statement 92))

All well and good. The source locations are even correct, assuming the lexer uses lexer-srcloc. Specifically, the following syntax objects have source locations:

  • The whole (program ...) syntax object has a source location
  • Each (statement ...) syntax object has a source location
  • Each number literal syntax object has a source location
  • Each occurrence of the program and statement identifiers has a source location

That last part seems off to me. The program identifier gets the same source location as the surrounding (program ...) syntax object. But the identifier itself is more of an implicitly-inserted thing from the user's perspective, like #%app or #%datum.

Where this matters to me is that I use the source locations of original syntax objects in my resyntax tool to figure out how to copy their original source code text into the refactored output code. So if one of those program or statement identifiers ends up in the output syntax object of my refactoring tool - perhaps because it was rearranging pieces of the enclosing (program ...) expression - the tool will duplicate the whole original expression when it tries to figure out how to render the output program identifier in refactored source code.

I think the rule name identifiers shouldn't have any source location information. Maybe they shouldn't even be syntax-original?, but that I'm less sure on.

So given a grammar like this: ``` #lang brag program: statement* statement: LITERAL-INTEGER ``` And an appropriate lexer-based tokenizer, using `(parse path (make-tokenizer port))` produces syntax objects that look like this: ```scheme (program (statement 42) (statement 58) (statement 92)) ``` All well and good. The source locations are even correct, assuming the lexer uses `lexer-srcloc`. Specifically, the following syntax objects have source locations: - The whole (program ...) syntax object has a source location - Each (statement ...) syntax object has a source location - Each number literal syntax object has a source location - **Each occurrence of the `program` and `statement` identifiers has a source location** That last part seems off to me. The `program` identifier gets the same source location as the surrounding `(program ...)` syntax object. But the identifier itself is more of an implicitly-inserted thing from the user's perspective, like `#%app` or `#%datum`. Where this matters to me is that I use the source locations of original syntax objects in my `resyntax` tool to figure out how to copy their original source code text into the refactored output code. So if one of those `program` or `statement` identifiers ends up in the output syntax object of my refactoring tool - perhaps because it was rearranging pieces of the enclosing `(program ...)` expression - the tool will duplicate the whole original expression when it tries to figure out how to render the output `program` identifier in refactored source code. I think the rule name identifiers shouldn't have any source location information. Maybe they shouldn't even be `syntax-original?`, but that I'm less sure on.
mbutterick commented 2 years ago (Migrated from github.com)

I think the rule name identifiers shouldn't have any source location information

  1. Does this happen with ragg too, or just brag?

  2. if brag handles source locations in a way that’s contrary to documentation or syntax-object norms, I welcome supporting evidence that this is so. Otherwise I would invoke the existing Racket norm against changing the behavior of a package in a backward-incompatible way.

> I think the rule name identifiers shouldn't have any source location information 1. Does this happen with `ragg` too, or just `brag`? 2. if `brag` handles source locations in a way that’s contrary to documentation or syntax-object norms, I welcome supporting evidence that this is so. Otherwise I would invoke the existing Racket norm against changing the behavior of a package in a backward-incompatible way.
jackfirth commented 2 years ago (Migrated from github.com)
  1. Just checked, and yes this happens with ragg too.
  2. Consider this code:
(require syntax/modread)

(with-module-reading-parameterization
  (λ ()
    (with-input-from-string "#lang racket/base 42"
      (λ ()
        (read-syntax)))))

It produces this syntax object:

(module anonymous-module racket/base
  (#%module-begin 42))

Both of the module and anonymous-module identifiers have a span of zero and are not original. The racket/base identifier and the 42 literal each have correct starts and spans, pointing to the racket/base and 42 substrings of #lang racket/base 42, and they're both original. The #%module-begin identifier is an odd one: it's not original but it does have a source location that is the same as the enclosing (#%module-begin 42) form. Due to the way the module and anonymous-module identifiers are handled, I suspect that's just a bug.

The whole form has a start position of 7 and a span of 14, pointing to the racket/base 42 substring, and it is not original because it contains the unoriginal module, anonymous-module, and #%module-begin pieces. The (#%module-begin 42) form also isn't original and it has the same start location and span, which I suspect is another bug since it claims to represent the racket/base 42 substring of the program code but the (#%module-begin 42) form doesn't actually contain the racket/base identifier. It should probably only claim to contain the 42 substring of the code.

It's a bit tricky to say for sure what the "intent" here is because source locations are tricky to produce and mistakes in them are rarely noticed. I think for syntax objects produced by a language's read-syntax function, these are some good guidelines:

  1. A syntax object with a source location shouldn't contain syntax objects with source locations that are outside the container syntax object's location.
  2. Identifiers shouldn't have a source location unless the actual text of that identifier appears in the program's code at that location.
  3. Syntax objects shouldn't be original if they contain any unoriginal syntax objects.
  4. If the identifier after the #lang line is used for the module's initial bindings, it should be original and have a source location.
1. Just checked, and yes this happens with `ragg` too. 2. Consider this code: ```scheme (require syntax/modread) (with-module-reading-parameterization (λ () (with-input-from-string "#lang racket/base 42" (λ () (read-syntax))))) ``` It produces this syntax object: ```scheme (module anonymous-module racket/base (#%module-begin 42)) ``` Both of the `module` and `anonymous-module` identifiers have a span of zero and are not original. The `racket/base` identifier and the `42` literal each have correct starts and spans, pointing to the `racket/base` and `42` substrings of `#lang racket/base 42`, and they're both original. The `#%module-begin` identifier is an odd one: it's *not* original but it does have a source location that is the same as the enclosing `(#%module-begin 42)` form. Due to the way the `module` and `anonymous-module` identifiers are handled, I suspect that's just a bug. The whole form has a start position of 7 and a span of 14, pointing to the `racket/base 42` substring, and it is *not* original because it contains the unoriginal `module`, `anonymous-module`, and `#%module-begin` pieces. The `(#%module-begin 42)` form also isn't original and it has the *same start location and span*, which I suspect is another bug since it claims to represent the `racket/base 42` substring of the program code but the `(#%module-begin 42)` form doesn't actually contain the `racket/base` identifier. It should probably only claim to contain the `42` substring of the code. It's a bit tricky to say for sure what the "intent" here is because source locations are tricky to produce and mistakes in them are rarely noticed. I think for syntax objects produced by a language's `read-syntax` function, these are some good guidelines: 1. A syntax object with a source location shouldn't contain syntax objects with source locations that are outside the container syntax object's location. 2. Identifiers shouldn't have a source location unless the actual text of that identifier appears in the program's code at that location. 3. Syntax objects shouldn't be original if they contain any unoriginal syntax objects. 4. If the identifier after the `#lang` line is used for the module's initial bindings, it should be original and have a source location.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mbutterick/brag#34
Loading…
There is no content yet.