|
i've been working w/ C++ includes lately . whilst examining those utilized in llvm i examined several h header files and found they are not templated and are not merely a declaration but a definition containing code and to my surprise are included in several cpp source files . closer examination revealed they define classes only which i suppose is okeedokee to include in multiple cpp files however does this not result in code bloat as each source must now generate the binary for each class's implementation . in my own work i insist placing the implementation of any non-templated class in it's own cpp source file and let the linker sort it out. i assumed this was sine qua non . is there something i do not understand as the writers of llvm are of course much more knowledgeable/experienced/clever/intelligent than myself . thank you kindly
|
|
|
|
|
Lately, it seems to be a move toward header-only libraries. Most probably, this is a reaction to Dependency hell (I'm a survivor of "DLL hell"). As each project has more and more dependencies, it becomes more difficult to make sure binaries are built properly and consistently and the easiest way out is to put everything in header files as inline functions and let compiler deal with it. This makes horrendous compile times and is clearly not a scalable solution, but for the time being, that's the way the cookie crumbles.
Mircea
|
|
|
|
|
i do not understand how placing the inner workings of each classes or function in the header w/o separating the two would make things easier . just the opposite has been my experience as then more possibly external identifiers are exposed which is why in my work i place as stated non-templated definitions in cpp source . templated classes and functions are separated into declaring h files and implementing impl.h files . each cpp source first includes all the declaring h files templated or not in the proper order to obtain a clean compile . below these are all the impl.h files in any order . one of my current projects is to automate the elimination of unneeded includes . the first prototype which i got working last night resulted in a project build time being reduced from 2m to 10s .
|
|
|
|
|
BernardIE5317 wrote: i do not understand how placing the inner workings of each classes or function in the header w/o separating the two would make things easier
You know those pesky defines that you have in a header file? Imagine you have a library built with
#define OPTION_A 1
#define OPTION_B 0
but your coworker needs the same library built with different options:
#define OPTION_A 0
#define OPtION_B 1
Now you have two different libraries lib_opt10.lib and lib_opt01.lib .
Another project cannot be switched from Visual Studio 2019 to Visual Studio 2022. So now you have lib_opt01_19.lib , lib_opt10_19.lib , lib_opt01_22.lib and lib_opt10_22.lib . And so on and so forth (debug, release, x86, x64, arm, etc.). Congratulations for graduating from "DLL hell" and welcome to the new and improved "dependency hell".
Now if you have everything in the header file, all of a sudden there are no more libraries to manage and the only "small" side-effect is that you drink humongous amounts of coffee while waiting for your project to build.
Mircea
|
|
|
|
|
thank you for kindly explaining the benefit of header only libraries . i must conclude programmers other than myself know how to define their classes w/ no external dependencies so order of declaration is not important which for me has proven a nightmare but one which i have finally overcome to my satisfaction - Kind Regards
|
|
|
|
|
BernardIE5317 wrote: whilst examining those utilized in llvm i examined several h header files and found they are not templated and are not merely a declaration but a definition containing code
It would of course depend on what they were exactly doing.
Been many years but at least way back when a 'template' ended up creating its own source for each usage. Probably necessary because it was not up to the compiler to figure out where overlaps might cause problems.
But excluding actual templates then no I am not a fan of putting a lot of code in header files.
BernardIE5317 wrote: much more knowledgeable/experienced/clever/intelligent than myself
Perhaps...Or they just have a high opinion of their own opinion.
|
|
|
|
|
A C# program is compiled with a compiler written in c++
A C++ program is compiled with a compiler written in assembly
An assembly program is compiled with a compiler written with machine instructions
Is that roughly how it works?
How does programming with machine instructions take place?
Let’s say I have the ASM MOV command, translated to machine instructions that’s probably one or two numbers. One telling the processor it’s an operation rather than a register or memory address and the other: which type of operation exactly it is. I’m making up stuff for how thing could work.
To get the above functioning on all processors a standard should be required where the numbers/machine instructions for MOV are recognized everywhere. I mean it should work like a hardware resource with the same ID present on old and new processors.
|
|
|
|
|
Calin Negru wrote: A C# program is compiled with a compiler written in c++
A C++ program is compiled with a compiler written in assembly
An assembly program is compiled with a compiler written with machine instructions
Is that roughly how it works?
Not exactly. It isn't a hierarchy where each language is translated in a lower level language. Let's put aside the C# for the moment because it isn't not exactly compiled (I'll explain in a bit). For the other languages the compiler is a machine language program (the processor cannot execute anything else) that translates the source code directly into machine language. In general you don't care in what language the compiler was written. These days it's rare to have an assembler or compiler written in assembly language. Most/all of them are written in C, C++ or other high level languages.
C# is a bit different because the compiler translates the source program into code for a virtual machine. This is called IL (intermediate language). When the program gets executed, the parts that need to be executed are translated into machine code. This is called Just-In-Time (JIT) compiling.
Edit: I left out a lot of details and exceptions. For instance the first C++ "compiler" was actually a preprocessor that translated C++ to plain C. This is a very rough sketch.
Mircea
|
|
|
|
|
Calin Negru wrote: A C# program is compiled with a compiler written in c++
A C++ program is compiled with a compiler written in assembly
An assembly program is compiled with a compiler written with machine instructions
Is that roughly how it works?
Nope, that's not how it works. For example, the Roslyn compiler platform for .NET is written in C#. The very first C# compiler was probably written in C/C++, but not subsequent versions.
A MOV instruction is not the same size on all platforms. For example, a MOV instruction, with operands, on an 8-bit CPU is not the same size as it is on 64-bit CPU's. What the op-code value is determined by comes down to available addressing modes for the instruction, available registers and their width, instruction decode logic and hardware in the CPU, data bus width, address bus width, and a sprinkle of arbitrary. Since there are vast differences in CPU design, the hardware makes it impossible to have the same representation across all CPU's.
Think about it. On a 64-bit CPU, how are you going to have a "MOV r, imm" (MOVe immediate to register) be represented and work exactly the same on an 8-bit CPU when it doesn't have registers that can hold a 64-bit integer?
There's far more to this than what I've posted, but this scratches the surface of why that idea will never work.
|
|
|
|
|
A compiler for "any" programming language can be written in "any" language.
Quite a few compilers throughout history has been written in themselves. Usually, you cannot start out with that (I'll come back to that below): You must write the very first compiler in some other language. Often, that first compiler handles only a small subset of the new language. When developing Pascal, Wirth tried to write this very first complier in Fortran, but gave up: While you can write a compiler in Fortran, it certainly isn't a language well suited for the task. So Wirth changed to assembly for the very first small-subset-Pascal compiler.
(I know of an operating system that was written in Fortran, but most people refuse to believe that!)
Once you've got that subset-Pascal (or whatever language we are talking about) up and running, you program the next compiler version in that subset-Pascal, but now you make a more advanced compiler, maybe for the entire, un-subsetted language. Now you have a full compiler written in itself.
Most likely, that subset-Pascal was so limited that you had to program in less elegant ways to get around the limitations. So maybe you program a third Pascal compiler, but since you have now got a full-featured compiler at your hand, you can program version 3 using all the great new features of your new language.
This process of going from a first (here: programmed in assembly) compiler to the second (programmed in subset-Pascal) to a third (programmed in full-featured Pascal) is referred to as 'bootstrapping'.
I know of one case where a full-featured compiler was written in itself, using all the features of the language, and there never was another compiler involved: The language was even more primitive than K&R C, called 'NPL'. Its developer wrote the NPL compiler in NPL, so he knew very well what an NPL line would compile to. So he started at the top of the NPL compiler source code, and typed into a new file the machine instructions that the compiler should generate for the first line. And for the second line. And for the third ... Down to the last line of the NPL compiler source code. When he ran the compiler source through that program he had just been typing in, he got a new file with same contents that what he had typed in, instruction by instruction. So the compiler worked as expected!
(That guy was slightly crazy: I was once complaining to him about a bug in the OS, which was written in NPL. He dug up the OS source code - this was in the days of hardcopy printout - and found the function I was complaining about. After some grunting and huffing, he spotted the error, and dug out a ball point pen to write a correction into the printout. Did he write the corrected statements in NPL, the language of the printout? No. Did he write it in symbolic assembly code? No. Did he write it as the the octal representation of the binary instruction codes? Yes, with offsets and all as octal values!)
|
|
|
|
|
You might want to double-check who you're replying to.
|
|
|
|
|
trønderen wrote: I know of one case where a full-featured compiler was written in itself, using all the features of the language, and there never was another compiler involved: The language was even more primitive than K&R C, called 'NPL'.
Forth is very close to that. And designed specifically like that. The most primitive basics are written (or were written in) assembler. Then some other items are added, written in those primitives. Then more are written which are based on the prior sets. And so it goes.
|
|
|
|
|
> A MOV instruction is not the same size on all platforms
That’s why you tell the compiler which architecture you’re targeting, I think I get it.
|
|
|
|
|
It's possible to do. GCC is like that where it is a kind of modular approach. You can specify a platform you're compiling for and it'll use the appropriate compiler "module" to target the platform you specify.
An assembler can be written the same way, targeting a certain chipset, but even then, there are differences inside the same chip, called a "stepping" in modern parlance, where there are bug fixes in the hardware that an assembler has to take into account to generate correct code.
For example, the original 1975 production 6502 CPU had three shift/rotate instructions, ASL, LSR, and ROR. The ROR instruction was bugged where it didn't work correctly[^]. If the assembler didn't know about the bug and generated code that assumed the instruction was correct, your code would probably crash the system. The assembler had to know about the bug and generate equivalent code that didn't use the instruction.
The bug was fixed starting with the 1976 production.
|
|
|
|
|
Dave Kreskowiak wrote: The bug was fixed starting with the 1976 production. Sidetracking just a little bit, but I can't resist :
At CERN, the same thing happened with two completely independent computer families: First, CERN had bought the VAX 780, the very first VAX. When DEC announced the smaller VAX 750, they proudly announced that the 780 bug that made one instruction give rather unexpected results (sorry, I don't remember which instruction), CERN immediately stood up in protest: No! We have written our software to specifically compensate for that bug! If you change it in the 750, so that it gives a different, 'correct' result, we must update our software for that, and we have to maintain different program versions for the 780 and the 750. That is out of the question, and we will not buy any VAX 750.
The bug was not fixed in the VAX 750.
And the story repeats: CERN had used NORD-10 computers for process control (much due to extremely good interrupt handling: The first instruction of the handler was running 900 ns after the arrival of the interrupt signal, which was super-fast in the mid-1970s). Then comes a complete reimplementation of the architecture, labeled ND-100, with a similar happy message: Finally, we have fixed the bug that has been with the NORD-10 since its introduction ... CERN reacted in the same way: If you fix that bug, we can no longer run our software on your new machines; we have adapted to that bug! We will not buy any ND-100!
So the bug from NORD-10 was retained in the ND-100 - just like with the VAX.
|
|
|
|
|
Usually, a machine instruction is a word of, say, 32 bits. The first 6 to 8 bits (typically) tell what this instruction does: Move data, jump to somewhere, add two values etc. The next few bits may indicate how you go about finding the operand, i.e. what to move, where to jump or which value to add. The interpretation of the following bits depend on those (often called the 'address mode' bits): Either as a register number, how far to jump, or the value to add. If the address mode bits so says, maybe the value to add is not in the instruction itself, but the instruction tells in which register you can find the address of the value to add.
The compiler breaks down your code in more primitive operations, such as "add the value of X", without being concerned about what the proper machine instruction will look like. Not until it gets down to the very bottom, the 'code generator'. A compiler may provide different code generators for different machine types. A given code generator knows what an 'add' instruction looks like on x64 processors, knows the proper address mode bits and how to put the address of X into the instruction. Another code generator, for ARM, knows another code for ADD on the ARM, but it also knows that you cannot directly add something in memory. First the code generator must look up an unused register (and if there is none, it must generate an instruction to flush the value in one register back to memory to free it up), then generate an instruction to load X into the free register, and then generate an instruction to do the actual add of the newly loaded value.
There is no common, standardized code for neither move, jump nor add on different machines. The code generator knows the codes for this machine. You may tell the compiler to switch to another code generator knowing the codes for another machine (that is commonly referred to as 'cross compiling'), but the program that comes out of it cannot be run on this machine; you must copy it to another machine of the right type to run it.
In the good old days, there were dozens of different machine types out there, each with its own instruction codes. The last 35 years or so, the 'x86' has pushed most others out. Every PC in the world can understand x86 codes.
But x86 is for 32 bit machines! 64-bit PCs understands 'x64' codes (and address modes), which are different! If you want that move, jump or add to work on a 64-bit PC, your compiler must select a code generator knowing the proper codes for x64. The program will not work on an old x86 PC.
Relax ... The x64 CPUs can be told to forget the x64 codes, and run the x86 instead. The .exe file tells which instruction set should be activated, so you can run 35 year old programs on your brand new 64 bits PC (provided that your current OS will honor all the requests made by that old program, which is not guaranteed - but 20-25 years is probably on the safe side).
Then, what about your smartphone - will it know x64 move, jump or add instruction codes? No. Will it know x86 instruction codes? No. You have to ask your compiler to use the code generator that makes ARM instruction codes. ARM has 32 bit and 64 bit codes too, that are different ... Besides: If your program was written for Windows, it expects that is can ask the OS (i.e. Windows) for this and that service - and Android says Huh?? The services provided by Android are quite different.
Bottom line: The rosy, cosy days when x86 worked everywhere are over.
Then comes dotNet. When a dotNet 'assembly' (informally you may call it a module) is loaded into your machine, smartphone or whatever, you'll see an incompletely compiled program: The compiler has left a message: Here I should have generated the code for adding the value of X, but I didn't know what kind of machine this will be running on! So please, before running this program, generate the proper instructions for adding X, will you?
dotNet for a given machine has the proper code generator for that machine. dotNet on an ARM32 generates ARM32 codes, dotNet on an ARM64 generates ARM64 codes, dotNet on a 64 bit PC generates x64 codes. They are all different.
For now, Windows itself is not dotNet, so it must be compiled separately for every machine architecture. A growing number of applications are dotNet, incompletely compiled, and the last step of compilation, code generation, is not done until you know for sure which codes to use, just in time for execution. So the dotNet code generator is frequently referred to as the "just in time compiler", or "jitter".
ps.
If you want to look at instruction codes and addressing modes and such to see what they are like, my recommendation is to stay away from x64 and x86. They are both a mess, having grown and been extended and grown more and been extend ... into a crow's nest. ARM64 (aka. Aarch64) is certainly not the simple, easily understood thing that the early 32 bit ARMs were, yet it has retained a much more manageable structure.
|
|
|
|
|
tronderen and Dave, thanks for your extensive explanations. At this point I don’t understand all the bits but I think I’m closing in
|
|
|
|
|
All you really need to understand is:
- a compiler can be written in anything (more or less) from machine code, through assembler up to most high level languages.
- The output of the compiler must be code that is compatible with the machine that will run the final executable.
- the term "machine" can be the actual hardware, a virtual machine (like the Java Virtual Machine), or Framework such as .NET.
- the actual hardware instructions do not have to be the same across all platforms, but it would be nice. Just as USB connectors keep changing so hardware platforms keep evolving.
|
|
|
|
|
|
You can write your own compiler.
Simplest way is to do it, without doing must studying, is to write a 'calculator' which takes tokens like numbers and the plus sign and equals sign.
After you do that then read up a bit more on compilers and apply some of what you read to what you previously wrote.
Then add variables.
If you really want to keep going after that then you add 'if-then-else' because that structure has problems of which compiler theory talks about quite a bit.
|
|
|
|
|
The very short version...
In the beginning there was machine language very closely tied to the CPU. It worked with numeric opcodes that programmers has to literally memorize or look up. This got old real quick.
ML Pseudocode: Instruction Operation
00000000 Stop Program
00000001 Turn bulb fully on
00000010 Turn bulb fully off
00000100 Dim bulb by 10%
00001000 Brighten bulb by 10%
00010000 If bulb is fully on, skip over next instruction
00100000 If bulb is fully off, skip over next instruction
01000000 Go to start of program (address 0) Enter assembly. It's not a compiler. It's an assembler and also a linker as part of a toolset. There's a difference. A compiler will translate code into something that's a one-to-one correlation with machine instructions. Assembly is already that. It's a language that basically gave human-memorable mnemonics to the opcodes. It was originally written in machine language. It's very CPU specific too. This too got old.
There were a ton of other languages made, presumably written in ASM, but this is where a compiler kicks in. To make a really long story short, I'll just mention C's history. C was based on B and B was based on BCPL. I don't know what BCPL was written in, but the first B compiler was written in BCPL. Eventually, the B compiler was re-written in B itself and then the first C compiler was written in that version of B.
A language written its own compiler happens more than you'd think. Anyway, these are still native compilers and eventually they still make their own down to machine code. Now, things like Java and C# I suspect are still written in native languages for obvious reasons, but don't be surprised if a native language's compiler is written in the same language.
Calin Negru wrote: a standard should be required where the numbers/machine instructions for MOV are recognized everywhere. I mean it should work like a hardware resource with the same ID present on old and new processors. This sounds great in theory, but if you look at how bloated and not-fun the Win32 API is, if you always have to maintain backwards compatibility then you keep things bloated when attempting to advance. I mean, it's good on one level, but it's also good to wipe out the old and try something new, like Apple is doing with the M1 chips (even though nothing is every really new, but you get the idea).
Do we really want processors in 100 years having similar constraints as one designed in the 1960s? Rather than enforce that on the CPU, the industry has (correctly so) to rather have compiler targets implemented. You use your preferred language and it compiles down to whatever the CPU expects with optimizations, etc.
Jeremy Falcon
|
|
|
|
|
Jeremy Falcon wrote: C was based on B and B was based on BCPL. I don't know what BCPL was written in, but the first B compiler was written in BCPL
Whilst I, too, don't know what BCPL was written in, I did hear why it was called BCPL. WikiPedia[^] says it stands for "Basic Combined Programming Language" and was invented in Cambridge University (UK). The story that I heard was there was a more complex language jointly designed by universities in Cambridge and London - that was call CPL (Cambridge Plus London). I do not know if CPL saw the light of day; but a simplified version called Basic CPL (or just BCPL) was created.
It had a bizarre construct, which was definitely a candidate for CPs Wierd and Wonderful forum), to resolve the 'Dangling ELSE problem'. It was something like IF condition DO statement and TEST condition THEN statement OR statement . (See https://www.bell-labs.com/usr/dmr/www/bcpl.pdf[^])
Edit:
I've just read the Wikipedia article (I should have done that before posting!). It says the CPL language was named originally from 'Cambridge Programming Language' and later renamed to 'Combined Programming Language'. No mention of London. But CPL (programming language) - Wikipedia[^] does mention the involvement of London and it was nicknamed 'Cambridge Plus London'. Thus, the name I heard was not its real name.
modified 7-Feb-23 5:26am.
|
|
|
|
|
That's cool to know. Thanks for sharing.
Jeremy Falcon
|
|
|
|
|
|
Message Closed
modified 15-May-23 19:06pm.
|
|
|
|
|