|
I haven't been working with compilers for a number of years, so maybe there are younger species out there that do things in a different way - I know the "classical" way of doing it, believing that today's compilers are roughly the same:
First, you break the source text into tokens. Then you try to identify structures in the sequence of tokens so that you can form a tree of hiearchical groups representing e.g. functions at some intermediate level, statements at a lower level, terms of a mathematical expression even further down. The term DAG - Directed Acyclic Graph - is commonly used for the parse tree. Nodes in the DAG commonly consist of 3-tuples or 4-tuples in a more or less common format for all nodes: Some semantic / operation code, two or three operands, or whatever else the compiler writer finds necessary.
Many kinds of optimisation is done by restructuring the DAG: Recognizing identical sub-trees (e.g. common subexpressions) that need to be done only once, identifying statements that within a loop will have identical effect in every iteration so that sub-tree can be moved out of the loop, etc. etc. Unreachable code is pruned off the DAG. All such operations are done on an abstract level - a variable X is treated as X without regard to its location in memory, number of bits (unless the language makes special requirements) etc. etc. The DAG is completely independent of the word length, byte ordering, 1- or 2-complement arithmetic, register ID or field structure of the instruction code of any specific machine architecture. You may think of variables and locations as sort of still in a "symbolic" form (lots of symbolic labels where never visible in the source code, so this certainly is "sort of").
Once you have done all the restructuring of the DAG that you care for, you may traverse the tree's leaf node to generate the actual machine instructions. (This part of the compiler is commonly called the "back end".) Now you assign memory addresses, use of registers, choose the fastest sequence of machine instructions for that specific machine. You can still do some optimization, e.g. keeping values in registers (now that you know which registers you've got), but it is essentially very local. The DAG indicates which sub-trees are semantically independent of each other, so that you may reorder them, run them in parallell, or e.g. assemble six independent multiplication operations into one vector multiply if your CPU allows. All internal symbolic rerferences can be peeled off; the only symbols retained are external entry points and references to external modules.
The back end may produce machine instructions for a hypthetical CPU that does not exist in silicon. Yet it has (or may have) its registers, word length, binary address space etc. There could be a machine having this instruction set as its native one. Many years ago, someone wrote an alternative microcode for a PDP-11 architecture so that it could execute the P4 bytecodes directly - but it was dead slow! Usually, you make a software virtual machine that pretends to be a "real" CPU for those instructions, interpreting the bytecodes one by one. JVM is such a virtual machine. For many years, this was The Way to run Java.
Compilers for dotNET essentially has no backend - they do not generate anything ready for execution. Essentially, their output is a linearization of the DAG, i.e. the abstract 4-tuple DAG nodes. The compiler backend is in the dotNET implementation: When a module is requested for the first time, the dotNET backend will do the last stages of compilation, creating machine specific binary code in the native instruction set of that specific machine, assigning specific locations to the named variables etc. The compiled result is stored in a cache, so that next time the same code is requested, no new compilation is required.
So, while Java bytecode is meant to be complete, ready to run, code (symbolic linking to other modules may still be required, but that is not code generation), dotNET assemblies are only half baked, requiring a final compilation step. This takes a little bit of time (it is surprisingly little!), but the generated code is native, requiring no interpretation.
Java compilers can generate binary code, rather than bytecode, but then it is for a specific machine. Or, the JVM may look at a (sequence of) bytecode(s) at run time, translate it to native instructions, but that is like interpreting Motorola 68000 instructions on a 386 (Apple did that when they switced to 386, to run old binary software!), but you will always be bound by the limitiations of the bytecode instruction set.
To a plain user with limited computer knowledge, there is little "visible" difference between the way JVM and dotNET works, but at the internal level, the architectures are signifiantly different.
|
|
|
|
|
That was a lovely, clear and comprehensive elaboration. Thank you (assuming of course, that it is correct!).
|
|
|
|
|
All correct, but are you disagreeing with 'C# is compiled to a type of bytecode'?
|
|
|
|
|
My immediate reaction: Yes, I would disagree. Bytecodes are ready for execution, while the dotNET output from a C# compiler is not.
You could say that I am using a "narrow" defintion of the term, but the term could have a more general meaning. Yes, it could - it could mean any code representation that is built up of bytes. Like source code . We could even generalize the "byte" concept: The old Univac mainframes could work with 9 bit bytes (4 to the word) or 6 bit bytes (6 to the word), while DEC-10 and DEC-20 had 7-bit bytes (5 to the word and one spare bit).
But that is not the commmon "compiler guy" interpretation of "bytecode". The linearized DAG is not directly executable, like a bytecode. Obviously, you could, at run time, do a just-in-time compilation into a bytecode for an interpreter, rather than compiling into native binary code. But at least as far as I know, there are no virtual machines directly interpreting dotNET assemblies with no pre-execution processing step.
In my student days, we were a group of students making an attempt to build a direct interpreter for the intermedidate language from another front end compiler (for the CHILL programming language), having a similar architecture. We soon realized that the data structures required to maintain the current exeuction state would be immensly large and complex; the task of building the interpreter would far exceed making a complete backend compiler. You couldn't do without a unified symbol table. You couldn't do without a label-to-location mapping. You couldn't do without a lot of state information for various objects. You couldn't do without ... So we never completed the project. (It was a hobby project, not a course assignment.)
|
|
|
|
|
Nah, op-codes are ready for execution, bytecodes are not. I like this definition: Bytecode is a form of hardware-independent machine language that is executed by an interpreter. It can also be compiled into machine code for the target platform for better performance.
|
|
|
|
|
If you are right, then terminology is changing.
In my book, the op-code is that field in the binary instruction code that indicates what is to be done: Add, shift, jump, ... Usually, the rest of the binary instruction code is operand specifications, such as memory addresses or constants.
In more high-level contexts I have seen "op-code" used for a field in a structure, e.g. a protocol. Again, the opcode tells what is to be done (at the level of "withdraw from bank account", "turn on" etc.), the other fields tells with what is it to be done.
You suggest a new interpretation, that an op-code is both the 'what to do' and the 'with what to do it'. Maybe that is an upcoming understanding, but certainly not the traditional one.
JVM bytecodes are certainly ready for execution, once you find a machine for it. It is easier to build a virtual machine, a simulator, than to build a silicon one. So that is what we do.
You can build a translator from MC68000 instructions to 386 instructions. Or from IBM 360 instructions to AMD64 instructions. Or from JVM instructions to VAX instructions. Suggesting that the intention of compiling to MC68K instructions was to serve as an intermediate step to 386 code would be crazy - that was never the intention of the MC68K instruction set. Similarly, the intention of Java bytecodes were not to be translated into another instruction set.
If you first compile to one instruction set (including bytecode, such as Java or Pascal P4 bytecode), and then translate to another instruction set, there is generally a loss of information, so that the final code is of poorer quality than if it had been compiled directly from the DAG, which usually contains a lot of info that is lost (i.e. used and then discarded) in the backend. Some of it may be recovered by extensive code analysis, but expect to loose a significant part, in the sense that you will not utilize the target CPU fully. Especially if the first/bytecode architecture has a different register philosophy (general? special?), interrupt system or I/O mechanisms. So, if at all possible, generate the target code from the intermedate level, not from some fully compiled instruction set.
|
|
|
|
|
You're agreeing with me: bytecodes are an abstraction, they for the JVM, not the CPU.
|
|
|
|
|
The JVM IS a CPU!
Just like any microcoded processor is a CPU. There is no principal difference between microcode breaking down the instruction code into activation of the various circuits, or (compiled) C code doing the same thing.
Years ago, I was working on a machine which didn't provide BCD instructions in silicon. Cobol users could either buy a floppydisc with the microcode to give the CPU BCD instructions (microcode was kept in RAM), or they could use the software package that emulated BCD (triggered by the 'Illegal Instruction Code' interrupt).
How would you describe the BCD instructions? As "abstractions" like the Java bytecodes? Or as integral to the CPU (even though they triggered an Illegal Instruction Code if the microcode was not installed)? Are all microcoded intstructions "abstractions"? If so, then this CPU as well as a lot of others are all abstractions.
The Java bytecodes are just like those BCD instructions, except that they cover the complete instruction set. And I am quite sure that it would be possible to write microcode (for this machine with the BCD) to make the bytecodes the "native instruction set" of the machine - it did provide logarithmic/trigonometic functions and malloc/free as instructions, and microcode was developed so that it executed lisp more or less directly (after a tokenization, of course).
|
|
|
|
|
JVM bytecode is abstract - i.e. it's not your CPU's native opcodes. That's it.
|
|
|
|
|
Not even if I have a machine microcoded to handle them?
Years ago, someone did write microcode for a PDP-11 to directly execute Pascal P4 bytecodes. With that microcode, the machine could execute P4 instructions and nothing else. Load a P4 file into RAM, set the instruction pointer to the starting point, and run: The program would execute.
JVM bytecodes are quite similar to P4 bytecodes; there are no essential principal differences that makes one abstract, the other one machine instructions.
...Unless you say that "native opcodes" are those where each bit in the instruction corresponds directly to one physical control signal steering the transistor logic. If you do, then you reject every CPU that has any microcode at all; you accept only 100% hardcoded logic as a true CPU. Even though less microcode is used in today's CPUs than in the golden days of microcoding (like in the VAX era), almost all general CPUs today (as well as many specialized ones) are to some degree microcode. By your logic, the binary instructions fed to those machines are not the CPU's native opcodes, but only an abstraction.
You have the right to say so, but renaming binary programs for almost all machines to "abstractions" does not contribute anything of significance.
|
|
|
|
|
If the definition of bytecode is that it's abstract, & non-natively-executable by the CPU, therefore requires another layer to turn it into the op-codes your given CPU can understand, then sure both your Java bytecode and your CIL/MSIL meet that definition? You say 'bytecodes are ready for execution', I don't agree. You need the JVM - right?
|
|
|
|
|
If I have a VAX executable file, containing VAX instructions, I need an interpreter for those codes. There never was a CPU that interpreted VAX instructions in pure silicon; every VAX on the market ran an interpreter, implemented in microcode.
If I have a JVM executable file, containing Java bytecodes, I need an interpreter for those codes. There never was a CPU that interpreted Java bytecodes in pure silicon. But there was a CPU interpreting Pascal P4 bytecodes running an interpreter implemented in microcode. There is no reason why you couldn't do exactly the same for Java bytecodes.
So I'll agree with you under the condition that we agree that both VAX instructions, and IAPX 386 instructions, IBM 360 instructions, Java bytecodes and Pascal P4 bytecodes are all in the same group. Neither of them are natively executed by silicon, but require an interpreter implemented at a lower level.
The distinction between instructions being directly implemented in silicon and those being interpreted by some code at a lower level may be essential with regart to speed and physical size of the silicon die. For the user, for the programmer and for the system architecure as seen at the programm interface the difference is marginal.
A far more essential question is whether you need to do any preprocessing to a file before you submit it for execution. For a Java bytecode (or Pascal P4) file, you need not do any further processing: The interpreter, whether written in microcode or as, say, conventional PDP-11 machine code, can start churning bytecodes right away, one by one.
With CIL, you can NOT feed the tuples to an interpreter one by one and have it interpret it as you go. Actually, I have tried to do so - not with CIL, but with a very similar intermediate code coming out from the front end compiler for the CHILL language. At the outset, it looked like doable - after all, each tuple indiates an operation and some operands and stuff, sort of like an instruction. But the deeper you dig into it, the more you find that is yet-to-be-determined. Stuff that is dependent on the context, but that context must be built up from information in other parts of the DAG structure. Lots of things you cannot do without traversing major parts of the graph to do a single operation ... unless you do a preprocessing before you start doing any execution at all. THAT is an essential difference. You MUST do a preprocessing before you can start any execution, and that preprocessing will reshape the code into something else that can be interpreted one by by one, like traditional binary VAX instructions.
Sure, there is a lot of preparations to be done to set up, say, a Windows process before it can execute the first instruction of the user program, but that is OS business, not CPU business. The Windows process setup does not restructure the instructions, the way CIL is transformed into a different format. All the machine instructions in the application code are perfectly valid as machine instructions even without the OS preparations.
If you have a PDP-11 with the microcode to interpret Pascal bytecodes, you are in the same situation: Every single one of the bytecodes are valid and can be executed one by one as they stand, without being restructured in any way, and fully defined by themselves, witout having to analyze any large context.
You may still believe that even if my group failed to make a direct interpreter for that intermediate code from the CHILL compiler, it could be done, if we had been clever enough. It turned out that the language and compiler designers said that if we had been clever enough we wouldn't have started the project at all. The intermediate code was never meant for interpretation. No intermediate code is. But bytecode is.
Feel free to implement a direct interpreter for CIL! Report back when you have completed your work.
|
|
|
|
|
To be fair if you type "bytecode" into google most of the links returned refer to java rather than the more generic usage. The former would suggest a definition of 'type of' where the latter would not require the comparison.
|
|
|
|
|
If Google had existed in the early 1980s, a search for "bytecode" would have returned thousands of references to Pascal and its P4 bytecode format. The Pascal compiler was distributed as open source, with a backend for a virtual machine (also available as open source for a couple architectures). You could either adapt the VM to the architecture of your machine, and keep the compiler unchanged, or you could replace the P4 code generating parts of the compiler with binary code generation for your own machine.
Actually, lots of interpreters for non-compiled languages of today do some compilation into some sort of bytecode, which is cached up internally so that e.g. a loop body needs to be symbolically analyzed only on the first iteration. But Java is the only language (after Pascal and its P4) to really focus on this, making "Java Virtual Machine" a marketing concept, and really pushing "compile once, run anywhere" as The Selling Point of the language (more so 20 year ago than today). So you are right: Java is very prominent in bytecode references.
|
|
|
|
|
Member 7989122 wrote: If Google had existed in the early 1980s, a search for "bytecode" would have returned thousands of references to Pascal and its P4 bytecode format
Agreed. But that is not relevant at all.
Member 7989122 wrote: Actually, lots of interpreters for non-compiled languages of today do some compilation into some sort of bytecode
Since I have written two compilers and a number of interpreters and taken the requisite academic course related to compilers I am quite familiar with how they work.
However since most programmers, certainly not the many I have worked with, have not done either of those then I wouldn't expect them to be familiar with that. So I have no expectation that the average programmer would be either.
|
|
|
|
|
That reference is saying the similarity is with the .NET Framework to the Java platform.
Don't confuse the C# Language (or any language) with the .NET Framework which facilitates "managed code" and JIT compilation, similar (and in response) to the 'Java platform' i.e. CIL(MSIL) is synonymous with Java bytecode, CLR synonymous with JRE/JVM etc.
|
|
|
|
|
Member 10815573 wrote: Don't confuse the C# Language (or any language) with the .NET Framework which facilitates "managed code" and JIT compilation, similar (and in response) to the 'Java platform' i.e. CIL(MSIL) is synonymous with Java bytecode, CLR synonymous with JRE/JVM etc.
I wasn't confusing anything. C# and Java were very similar languages at the point when C# was launched and with good reason - Anders Hejlsberg (who is the lead architect on the team developing the C# language) had previously developed Microsoft's J++ language (Microsoft's discontinued implementation of Java). C# really took many of the good bits of Java while the .NET Framework mirrored the Java platform.
C Sharp (programming language) - Wikipedia
Quote: James Gosling, who created the Java programming language in 1994, and Bill Joy, a co-founder of Sun Microsystems, the originator of Java, called C# an "imitation" of Java; Gosling further said that "[C# is] sort of Java with reliability, productivity and security deleted."[17][18] Klaus Kreft and Angelika Langer (authors of a C++ streams book) stated in a blog post that "Java and C# are almost identical programming languages. Boring repetition that lacks innovation,"[19] "Hardly anybody will claim that Java or C# are revolutionary programming languages that changed the way we write programs," and "C# borrowed a lot from Java - and vice versa. Now that C# supports boxing and unboxing, we'll have a very similar feature in Java."[20] In July 2000, Anders Hejlsberg said that C# is "not a Java clone" and is "much closer to C++" in its design.[21]
Since the release of C# 2.0 in November 2005, the C# and Java languages have evolved on increasingly divergent trajectories, becoming somewhat less similar. One of the first major departures came with the addition of generics to both languages, with vastly different implementations. C# makes use of reification to provide "first-class" generic objects that can be used like any other class, with code generation performed at class-load time.[22] Furthermore, C# has added several major features to accommodate functional-style programming, culminating in the LINQ extensions released with C# 3.0 and its supporting framework of lambda expressions, extension methods, and anonymous types.[23] These features enable C# programmers to use functional programming techniques, such as closures, when it is advantageous to their application. The LINQ extensions and the functional imports help developers reduce the amount of "boilerplate" code that is included in common tasks like querying a database, parsing an xml file, or searching through a data structure, shifting the emphasis onto the actual program logic to help improve readability and maintainability.[24]
Right now - thanks both to Microsoft's constant work improving .NET and C# and Oracle's lack of effort on Java - C# is years ahead of Java and is a very different language.
Now is it bad enough that you let somebody else kick your butts without you trying to do it to each other? Now if we're all talking about the same man, and I think we are... it appears he's got a rather growing collection of our bikes.
modified 31-Aug-21 21:01pm.
|
|
|
|
|
Not a citation, but...
When I first read the C# spec in 1999, someone asked me, "isn't that just Microsoft Java?"
|
|
|
|
|
I think everyone thought that at the time..
C Sharp (programming language) - Wikipedia
Quote: James Gosling, who created the Java programming language in 1994, and Bill Joy, a co-founder of Sun Microsystems, the originator of Java, called C# an "imitation" of Java; Gosling further said that "[C# is] sort of Java with reliability, productivity and security deleted."[17][18] Klaus Kreft and Angelika Langer (authors of a C++ streams book) stated in a blog post that "Java and C# are almost identical programming languages.
Now is it bad enough that you let somebody else kick your butts without you trying to do it to each other? Now if we're all talking about the same man, and I think we are... it appears he's got a rather growing collection of our bikes.
modified 31-Aug-21 21:01pm.
|
|
|
|
|
That's what I meant. When they did not want the licenses for J anymore and replaced it with C#, everyone claimed that C# was just a Java clone, totally ignoring that there were plenty of things that went further from the beginning (no primitive data types, a common CLR across .Net languages, properties for objects...)
|
|
|
|
|
CodeWraith wrote: When they did not want the licenses for J anymore
Errr...after a long trial MS agreed not to do java anymore. If they didn't want it then they wouldn't have fought so long to keep it.
CodeWraith wrote: that went further from the beginning (no primitive data types, a common CLR across
Can't imagine that that wouldn't be required to avoid more legal trouble. For example look at java on the Android and the Oracle suit about that. The technology itself just encapsulates the business need, but the business need required that it be different.
|
|
|
|
|
If they didn't want it then they wouldn't have fought so long to keep it.
Not a reasonable conclusion. On more than one occasion Microsoft has engaged in litigation to slow down the opposition, deplete their resources and demoralise them.
Microsoft did this to Sybase while Microsoft Access was being prepared, and was caught very much on the back foot when it unexpectedly won the rights to the source code for SQL Server.
PeterW
If you can spell and use correct grammar for your compiler, what makes you think I will tolerate less?
|
|
|
|
|
It was. I think the book author is misinformed. Unfortunately people assume that just because something is printed its always correct.
Jeremy Falcon
|
|
|
|
|
Slacker007 wrote: C# was modeled mostly after Java and C++ Says you.
Here is another quote from a few pages later: Quote: Because C# is a hybrid of numerous languages, the result is a product that is as syntactically clean (if not cleaner) as Java, is about as simple as VB, and provides just about as much power and flexibility as C + +.
TROELSEN, ANDREW; Japikse, Philip. C# 6.0 and the .NET 4.6 Framework (Kindle Locations 3129-3131). Apress. Kindle Edition.
And
Quote: For example, like VB, C# supports the notion of class properties (as opposed to traditional getter and setter methods) and optional parameters.
TROELSEN, ANDREW; Japikse, Philip. C# 6.0 and the .NET 4.6 Framework (Kindle Locations 3124-3125). Apress. Kindle Edition.
There are two kinds of people in the world: those who can extrapolate from incomplete data.
There are only 10 types of people in the world, those who understand binary and those who don't.
|
|
|
|
|
Medical fact - quitting VB now will greatly increase your lifespan. No citations needed.
|
|
|
|
|