Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator Name "Gotchas" (Function Application Operators, Others) #75

Open
5 tasks
rljacobson opened this issue Sep 12, 2024 · 12 comments
Open
5 tasks

Operator Name "Gotchas" (Function Application Operators, Others) #75

rljacobson opened this issue Sep 12, 2024 · 12 comments

Comments

@rljacobson
Copy link

rljacobson commented Sep 12, 2024

This issue is to discuss and track the following work I propose to do. For the sake of limiting scope, this issue is restricted to working on "incorrectly" named operators (explained below).

  • Identify and list operators in Mathics with problematic names (as described below).
  • Evaluate the impact of renaming operators on the codebase and dependencies.
  • Implement name changes and update all references.
  • Test to ensure functionality is preserved.
  • Document any operators left unchanged and reasons for exclusion.

The purpose of this work is to make Mathics' operator database more complete, correct, and compatible.

Background

In a small handful of cases, Mathematica gives the wrong name to an operator. The primary reason, as far as I can tell, is that Mathematica gives preference to the typographical representation of operators rather than to their computational/semantic meaning. Thus, the naming convention used in Mathematica maps the name of the operator to the function that formats an expression typographically like the operator rather than the underlying function implementing the operator's computational function. There are 13 cases.

Suggested Name Mathematica's Name Parse FullForm Usage String Comments
OverscriptBox Overscript {"OverscriptBox", "[", "expr1", ",", "expr2", "]"}   expr1&expr2  
UnderscriptBox Underscript {"UnderscriptBox", "[", "expr1", ",", "expr2", "]"}   expr1+expr2  
OverunderscriptBox Underoverscript {"UnderoverscriptBox", "[", "expr1", ",", "expr3", ",", "expr2", "]"}   expr1&expr2\%expr3  
UnderoverscriptBox Underoverscript {"UnderoverscriptBox", "[", "expr1", ",", "expr2", ",", "expr3", "]"}   expr1+expr2\%expr3  
InterpretBoxes None     *expr  
SubscriptBox Subscript {"SubscriptBox", "[", "expr1", ",", "expr2", "]"}   expr1_expr2  
SubsuperscriptBox Subsuperscript {"SubsuperscriptBox", "[", "expr1", ",", "expr2", ",", "expr3", "]"}   expr1_expr2\%expr3  
FunctionApplyPrefix Prefix {"expr1", "[", "expr2", "]"} expr1[expr2] expr1@expr2 Operator Notations includes usages with invisible unicode characters.
FunctionApplyInfix Infix {"expr2", "[", "expr1", ",", "expr3", "]"} expr2[expr1, expr3] expr1~expr2~expr3 Infix[f[x,y]] will display as x~f~y. Precedence identifies Infix with this operator, and Precedence[Infix]==30 which is almost correct.
SupersubscriptBox SubsuperscriptBox {"SubsuperscriptBox", "[", "expr1", ",", "expr3", ",", "expr2", "]"}   expr1\^expr2\%expr3  
SqrtBox Sqrt {"SqrtBox", "[", "expr", "]"}   \@expr  
Integrate Integral {"Integrate", "[", "expr1", ",", "expr2", "]"} Integrate[expr1, expr2] ∫expr1expr2  
FunctionApplyPostfix Postfix {"expr2", "[", "expr1", "]"} expr2[expr1] expr1//expr2 Postfix[f[x]] will display as x//f. Precedence identifies Postfix with this operator.

For some of these, it is obvious that they are misnamed based on how they are parsed. OverscriptBox, for example, is called Overscript by Mathematica but is parsed as OverscriptBox. The case of what I call FunctionApplyInfix, for example, is fundamentally the same, but is easy to misunderstand because the underlying semantic meaning is function application which has no corresponding named function. The Mathematica function Infix, despite being the name Mathematica gives this operator, is not the corresponding functional meaning of this operator! The Infix function is concerned with how a function is displayed.

An interesting case is FunctionApplyPrefix, which Mathematica calls Prefix. Again, the Mathematica function Prefix is a directive for displaying a function. The @ operator is actually an alias (sort of) for the square brackets operator [ ], which does not have a name in Mathematica—at least it didn't have a name until the Construct function was introduced!

Challenges

I am assuming in this discussion that operators should be given the name of their underlying (functional) semantic meaning, that is, they should have the same name as the function they are parsed into. There are a few realities that challenge this assumption:

  • Some operators don't have a single function they correspond to.
  • We wish distinct operators to have distinct names, but there are cases where multiple operators correspond to the same underlying function.
  • Some distinct operators share a common lexical representation AND underlying function but have different properties, like arity and precedence. (These operators are necessarily context sensitive.) An example is UnaryPlus vs. Plus, both of which Mathematica just calls Plus.

I haven't made a thorough survey of what Mathics is doing with the 13 operators I identify in the table above, but I think just choosing reasonable alternative names will not present any problems. Mathics already does this for postfix &, for example. But again, part of the work is to figure out what might break.

@rocky should check that this all makes sense.

@mmatera
Copy link
Contributor

mmatera commented Sep 12, 2024

Some few comments:

  • Expressions like Subscript[x,y] are parsed as it is, not as its "boxed" version SubscriptBox:
(* WMA both CLI and Notebook interface *)

In[1]:= X=Subscript[a,b]                                                        

Out[1]= a
         b

Internally, it is stored as an expression, with Subscript as its Head:

In[2]:= X//FullForm                                                             

Out[2]//FullForm= Subscript[a, b]

Its InputForm still is a Subscript:

In[3]:= X//InputForm                                                            

Out[3]//InputForm= Subscript[a, b]

And only after formatting it is converted into a SubscriptBox which is then rendered as a text:

In[4]:= X//OutputForm                                                           

Out[4]//OutputForm= a
                                           b

In[5]:= X//StandardForm                                                         

Out[5]//StandardForm= a
                                              b

The conversion to SubscriptBox happens in the formatting. The result can be shown using ToBoxes (which evaluates the expression and then applies "MakeBoxes"):

In[8]:= X//ToBoxes                                                              

Out[8]= SubscriptBox[a, b]

Other boxes work in the same way.

@mmatera
Copy link
Contributor

mmatera commented Sep 12, 2024

  • Regarding InterpretBoxes, it does not exist:

In[1]:= ?? InterpretBox                                                         

Out[1]= Missing[UnknownSymbol, InterpretBox]

What it does exists is InterpretationBox:

In[2]:= ?? InterpretationBox                                                    

Out[2]= InterpretationBox[boxes, expr]
         
        >    is a low-level box construct that displays as boxes
         
        >     but is interpreted on input as expr. 


        Attributes[InterpretationBox]=
         
        >   {HoldAllComplete, Protected, ReadProtected}

InterpretationBox is created when an expression is formatted in a way that makes hard to reinterpret it as an expression. For example, if we apply MakeBoxes to something that is formatted as "OutputForm", you get an InterpretationBox
with a first element having the "formatted" output, and the second element containing the original expression:

In[3]:= OutputForm[a/b]//MakeBoxes                                                                                            

Out[3]= InterpretationBox[PaneBox["a\n-\nb"], OutputForm[a/b], Editable -> False]

Then

In[4]:= ToExpression[%]                                                                                                                                                       

        a
Out[4]= -
        b

On the other hand, if we try to recover the expression from the string obtained from OutputForm, we get something different:

In[6]:= ToExpression["a\n-\nb"]                                                                                                                                               

Out[6]= -b

What seems to happen when we input expressions in the Notebook frontend of WMA is that the interface interprets first our keystrokes and stores them as "Boxes".
For example, when in a Cell we input the keystroke sequence
[Esc]int[Esc]F[x][esc]dd[esc]x
we obtain
image

If we save the notebook, and look inside with a text editor, we see this:

Cell[BoxData[
 RowBox[{
  RowBox[{"\[Integral]", 
   RowBox[{
    RowBox[{"F", "[", "x", "]"}], 
    RowBox[{"\[DifferentialD]", "x"}]}]}]}]], "Input",
 CellChangeTimes->{{3.935133535804398*^9, 3.935133607630911*^9}},
 CellLabel->"In[8]:=",ExpressionUUID->"cc0a804b-a0c9-41f1-b87c-52b6edd6a001"]

Then, when press [Ctrl]+[Enter], that box expression is "interpreted" (by using MakeExpression rules) into

Integrate[F[x], x]

Then, it is evaluated, and finally, formatted using MakeBoxes rules into

RowBox[{"\[Integral]", RowBox[{RowBox[{"F","[","x","]"}], RowBox[{"\[DifferentialD]","x"}]}]}]

Notice that the WMA CLI (and mathics CLI and Django frontends) do not need to convert out keystrokes into boxes. In Mathics (and I guess the same happens in the WMA CLI), we just collect the user keystrokes into a string, and we parse them as (InputForm) Expressions. When the expression is evaluated, the steps are similar: first the evaluation rules are applied, the result is stored, and then the output is formatted using MakeBoxes rules to produce a Boxed expression,
and finally, the Boxed expression is rendered as a string ( in mathics-django) which is shown in the front end.

@mmatera
Copy link
Contributor

mmatera commented Sep 12, 2024

Regarding Infix, Prefix, and Postfix, they are used as intermediate steps in the formatting process: it is a handful way to represent expressions involving different operators into an abstract tree expression keeping information about how the expression must be rendered. Then, we can use the same MakeBoxes rules for Plus, Times , CircleTimes, etc.

In Mathics we still have a rudimentary implementation of these symbols, but I guess there is no need to store in MathicsScanner specific details about how to parse and render them.

@rljacobson
Copy link
Author

rljacobson commented Sep 12, 2024

Expressions like Subscript[x,y] are parsed as it is, not as its "boxed" version SubscriptBox

To clarify, I am not concerned in this discussion about how different (M-expression) functions are evaluated. Rather, I am talking about the M-expression representation of operator forms. So for example, in the expression expr1\_expr2, the operator \_ is named SubsciptBox Subscript by Mathematica, but it has the semantics of SubscriptBox, by which I mean that it can be thought of as syntactic sugar for SubscriptBox[expr1, expr2]. For this reason, I argue that the operator \_ should be named SubscriptBox.

@rljacobson
Copy link
Author

Regarding InterpretBoxes, it does not exist

Oops, I included it in the table by accident.

The operator \* certainly exists, but Mathematica does not give it a name. There are several operators with this status, but my intention is to limit the scope of this issue to exclude those for now.

@mmatera
Copy link
Contributor

mmatera commented Sep 12, 2024

Expressions like Subscript[x,y] are parsed as it is, not as its "boxed" version SubscriptBox

To clarify, I am not concerned in this discussion about how different (M-expression) functions are evaluated. Rather, I am talking about the M-expression representation of operator forms. So for example, in the expression expr1\_expr2, the operator \_ is named SubsciptBox by Mathematica, but it has the semantics of SubscriptBox, by which I mean that it can be thought of as syntactic sugar for SubscriptBox[expr1, expr2]. For this reason, I argue that the operator \_ should be named SubscriptBox.

Ah, OK, what you are talking about is the "string representation of boxes"
https://reference.wolfram.com/language/tutorial/TextualInputAndOutput.html#28564

So, when you write some string between \( ... \) the way in which the inner text is parsed is different:

In[1]:= A\_B                                                                    

Syntax::syntyp: \ operators can only be used between \( \).

Syntax::sntxf: "A" cannot be followed by "\_B".

In[2]:= A_B                                                                     

Out[2]= A_B

In[3]:= \(A_B\)                                                                 

Out[3]= A_B

In[4]:= \(A_B\)//FullForm                                                       

Out[4]//FullForm= "A_B"

In[5]:= \(A\_B\)                                                                

Out[5]= SubscriptBox[A, B]

In Mathics, it works exactly in the same way:

(*Mathics CLI*)
In[1]:= P\_Q
Syntax::sntxf: "P" cannot be followed by "\_Q" (line 1 of "<stdin>").

In[2]:= \(A\_B\)
Out[2]= SubscriptBox[A, B]

In[3]:= \(A\^B\)
Out[3]= SuperscriptBox[A, B]

In[4]:= \(P_Q\)
Out[4]= "P_Q"

@rljacobson
Copy link
Author

Maybe it's worth talking about what it means to "parse" a Mathematica expression. Here are a few different meanings, all of which are valid and useful for different purposes.

  1. Give a tree representation of the raw text of the source code that preserves whitespace, comments, operator representations used, etc., but supplies the code with structure according to operator precedence and so forth. Mathematica's frontend does this—although it modifies the representation dynamically as the user types.
  2. Same as (1) except whitespace and comments are disregarded. This is useful as an intermediate step for further transformation.
  3. Representing the code in a FullForm-like M-expression representation that does not preserve formatting. In this form, the operator /@ is represented by Map (though typically in some internal tree representation), and likewise all "special" operators are represented by their underlying semantically equivalent M-expressions. For this reason, information about how the original expression was input—using an operator form or just an M-expression—is lost.

For my own purposes, I spend most of my time thinking about (3). But all three require the parser has knowledge of operator properties. Observe that once (1) is obtained, the operators can just be represented by their own (quoted) textual representation, which is what the frontend does.

So there is a sense in which the names of expressions are unimportant. But in a way I feel like this work is part of making (3) happen, which is a necessary step for evaluation.

@mmatera
Copy link
Contributor

mmatera commented Sep 12, 2024

Maybe it's worth talking about what it means to "parse" a Mathematica expression. Here are a few different meanings, all of which are valid and useful for different purposes.

OK, but in any case, in the table, where you write expr1&expr2 should be \(expr1\&expr2\) and so forth, isn't it? Then, I agree that inside a \(...\) block \& is parsed as OverscriptBox and, for the sake of clarity, the operator should have that name.

@rocky
Copy link
Member

rocky commented Sep 14, 2024

In a small handful of cases, Mathematica gives the wrong name to an operator.

It took me a while to understand this, and without @mmatera's follow-up discussion I doubt I would have. I agree with @mmatera that the "Usage String" should be corrected: \(expr1\&expr2\) should appear instead of expr1&expr2.

While the idea of "boxing" is not new — it comes from Knuth's TeX and CSS adopts this idea — using operators in boxing expressions as found in WL, is a bit new and rare. It is not the kind of thing that I would suspect most people would think of when thinking about WL or Mathics3 parsing.

As far as I can tell, the "wrong name" for the association between \_& and OverscriptBox[] is only evident from looking at WolframLanguageData[], and that use is not documented, as far as I can tell. In there is a table of correspondences in the of boxing operators in docs, but that lists "OverscriptBox", not "Overscript". And the documentation for Overscript does not mention a \_& correspondence.

So in sum, while this may be a minor problem in what WolframLanguageData[] reports, it doesn't strike me as a big deal.

As for thoughts on how we should address this: yes, I guess these should be added to operators.yml, with their ASCII or Unicode equivalent which currently is commented out in the {N,L,O}-tokens.

@rocky
Copy link
Member

rocky commented Sep 14, 2024

Regarding Infix, Prefix, and Postfix, they are used as intermediate steps in the formatting process:
it is a hand[y] way to represent expressions involving different operators into an abstract tree expression keeping information about how the expression must be rendered. Then, we can use the same MakeBoxes rules for Plus, Times , CircleTimes, etc.

Apparently, there are two uses we have in Mathics3. In WL these are built-in functions (Infix, Postfix), that do some sort of printing.

In operators.yml there is a "precedence" field which is used via json in the parser of Mathics-core to decide how to build a parse tree.

In Mathics we still have a rudimentary implementation of these symbols, but I guess there is no need to store in MathicsScanner specific details about how to parse and render them.

Things are not ideal, so let me explain

Initially, MathicsScanner was created because Mathics-core was too large and I wanted to break off pieces of this. So right from the start, I knew things were a little coarse.

Before the recent addition of YAML for operators, the MathicsScanner github repository had two things primarily: the Python Mathics3 scanner code, and "Named Character" information named-characters.yml. The intent was for the tokenizer in MathicsScanner to pick up information from named-characters.yml via some intermediary JSON extraction of the data. I see this currently has not been done, although it would be good to do so.

The project mathicsscript, mathics-pygments, and mathics-core do use information from named-characters.yml.

Of course, some named characters also happen to be operators. And this is also noted — we indicate when a named character is used as either operator or part of an operator.

Recently, I added a new YAML table from Robert Jacobson's CSV. This is used to gather information about operators that we can use in a machine-readable way. In fact, right now it is used in Mathics-core for operator precedence information.

In an ideal world, we would split the data portion from mathics-scanner. (It should have been done back in January 2021, but things were messier then and and we had far fewer unit tests; so it would have been beyond my capabilities).

Right now though that hasn't been a big priority for me. If someone else wants to do though, go for it! But it will be a bit of work since 3 other repositories will have to get adjusted to point the to be split off new repository.

@rljacobson
Copy link
Author

Very productive discussion!

I might have confused matters with a typo:

...the operator \_ is named SubsciptBox Subscript by Mathematica, but it has the semantics of SubscriptBox...

Oops.

...what you are talking about is the "string representation of boxes"...

Yes, also called the box sublanguage. The fact that operators like \_ can only be used within \(...\) is a grammatical feature of the language not tied to that specific operator, as arbitrary code can appear inside \(...\). Nonetheless, it does make sense to indicate which operators are part of the box sublanguage. And if it makes things more clear for people to also include \(...\) in the "usage" string, then it's fine with me to add them.

But this raises a software engineering question of how to design the parser for the box sublanguage. This is because \(a+b^c\) parses to RowBox[{"a", "+", RowBox[{"b", "^", "c"}]}]. So operator precedence is still respected, but the interpretation is different. I don't know what Mathics does, if it first produces a syntax tree, or if it parses to the box representation directly, or what. I'm not sure it matters for the present discussion.

@mmatera
Copy link
Contributor

mmatera commented Sep 17, 2024

Indeed, it seems that the WMA parser takes into account the precedence when it parses the "sublanguage". For example,

In[1]:= \(a+b*c^2+d\)                                                          

Out[1]= RowBox[{a, +, RowBox[{b, *, RowBox[{c, ^, 2}]}], +, d}]

Also, if you continue the expression with a "normal" multiplication, what you get is the same as if you had write the Box expression explicitly:

In[2]:= \(a+b*c^2+d\) *3                                                       

Out[2]= 3 RowBox[{a, +, RowBox[{b, *, RowBox[{c, ^, 2}]}], +, d}]

Regarding if it actually matters, I think it does, because our parser should be able to parse this kind of inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants