IBM PC Assembly Language Tutorial 5
Learning the assembler |
It is my feeling that many people can teach themselves to use the
assembler by reading the MACRO Assembler manual if
1. You have read and understood a book like Morse and thus have a
feeling for the instruction set
2. You know something about DOS services and so can
communicate with the keyboard and screen and do something
marginally useful with files. In the absence of this kind of knowledge,
you can't write meaningful practice programs and so will not
progress.
3. You have access to some good examples (the ones supplied with
the assembler are not good, in my opinion. I will try to supply you
with some more relevant ones.
4. You ignore the things which are most confusing and least useful.
Some of the most confusing aspects of the assembler include the
facilities combining segments. But, you can avoid using all but the
simplest of these facilities in many cases, even while writing quite
substantial applications.
5. The easiest kind of assembler program to write is a COM
program. They might seem harder, at first, then EXE programs
because there is an extra step involved in creating the executable
file, but COM programs are structurally very much simpler.
At this point, it is necessary to talk about COM programs and EXE
programs.As you probably know, DOS supports two kinds of
executable files. EXE programs are much more general, can
contain many segments, and are generally built by compilers and
sometimes by the assembler. If you follow the lead given by the
samples distributed with the assembler, you will end up with EXE
programs. A COM program, in contrast, always contains just one
segment, and receives control with all four segment registers
containing the same value. A COM program, thus, executes in a
simplified environment, a 64K address space. You can go outside
this address space simply by temporarily changing one segment
register, but you don't have to, and that is the thing which makes
COM programs nice and simple. Let's look at a very simple one.
The classic text on writing programs for the C language says that the
first thing you should write is a program which says
HELLO, WORLD.
when invoked. What's sauce for C is sauce for assembler, so let's
start with a HELLO program of our own. My first presentation of this
will be bare bones, not stylistically complete, but just an illustration of
what an assembler program absolutely has to have:
HELLO SEGMENT ;Set up HELLO code and data section
ASSUME CS:HELLO,DS:HELLO ;Tell assembler about conditions
at entry
ORG 100H ;A .COM program begins with 100H byte prefix
MAIN: JMP BEGIN ;Control must start here
MSG DB 'Hello, world.$' ;But it is generally useful to put data first
BEGIN: MOV DX,OFFSET MSG ;Let DX --> message.
MOV AH,9 ;Set DOS function code for printing a message
INT 21H ;Invoke DOS
RET ;Return to system
HELLO ENDS ;End of code and data section
END MAIN ;Terminate assembler and specify entry point
First, let's attend to some obvious points. The macro assembler
uses the general form name opcode operands
Unlike the 370 assembler, though, comments are NOT set off from
operands by blanks. The syntax uses blanks as delimiters within the
operand field (see line 6 of the example) and so all comments must
be set off by semi-colons.
Line comments are frequently set off with a semi-colon in column 1. I
use this approach for block comments too, although there is a
COMMENT statement which can be used to introduce a block
comment.
Being an old 370 type, I like to see assembler code in upper case,
although my comments are mixed case. Actually, the assembler is
quite happy with mixed case anywhere.
As with any assembler, the core of the opcode set consists of
opcodes which generate machine instructions but there are also
opcodes which generate data and ones which function as
instructions to the assembler itself, sometimes called pseudo-ops. In
the example, there are five lines which generate machine code
(JMP, MOV, MOV, INT, RET), one line which generates data (DB)
and five pseudo-ops (SEGMENT, ASSUME, ORG, ENDS, and
END).
We will discuss all of them.
Now, about labels. You will see that some labels in the example end
in a colon and some don't. This is just a bit confusing at first, but no
real mystery. If a label is attached to a piece of code (as opposed to
data), then the assembler needs to know what to do when you JMP
to or CALL that label. By convention, if the label ends in a colon, the
assembler will use the NEAR form of JMP or CALL. If the label does
not end in a colon, it will use the FAR form. In practice, you will
always use the colon on any label you are jumping to inside your
program because such jumps are always NEAR; there is no reason
to use a FAR jump within a single code section. I mention this,
though, because leaving off the colon isn't usually trapped as a
syntax error, it will generally cause something more abstruse to go
wrong.
On the other hand, a label attached to a piece of data or a pseudo-
op never ends in a colon.
Machine instructions will generally take zero, one or two operands.
Where there are two operands, the one which receives the result
goes on the left as in 370 assembler.
I tried to explain this before, now maybe it will be even clearer: there
are many more 8086 machine opcodes then there are assembler
opcodes to represent them. For example, there are five kinds of
JMP, four kinds of CALL, two kinds of RET, and at least five kinds of
MOV depending on how you count them. The macro assembler
makes a lot of decisions for you based on the form taken by the
operands or on attributes assigned to symbols elsewhere in your
program. In the example above, the assembler will generate the
NEAR DIRECT form of JMP because the target label BEGIN labels
a piece of code instead of a piece of data (this makes the JMP
DIRECT) and ends in a colon (this makes the JMP NEAR). The
assembler will generate the immediate forms of MOV because the
form OFFSET MSG refers to immediate data and because 9 is a
constant. The assembler will generate the NEAR form of RET
because that is the default and you have not told it otherwise.
The DB (define byte) pseudo-op is an easy one: it is used to put one
or more bytes of data into storage. There is also a DW (define word)
pseudo-op and a DD (define doubleword) pseudo-op; in the PC
MACRO assembler, the fact that a label refers to a byte of storage,
a word of storage, or a doubleword of storage can be very
significant in ways which we will see presently.
About that OFFSET operator, I guess this is the best way to make
the point about how the assembler decides what instruction to
assemble: an analogy with 370 assembler:
PLACE DC ......
...
LA R1,PLACE
L R1,PLACE
In 370 assembler, the first instruction puts the address of label
PLACE in register 1, the second instruction puts the contents of
storage at label PLACE in register 1. Notice that two different
opcodes are used. In the PC assembler, the analogous instructions
would be
PLACE DW ......
...
MOV DX,OFFSET PLACE
MOV DX,PLACE
If PLACE is the label of a word of storage, then the second
instruction will be understood as a desire to fetch that data into DX.
If X is a label, then "OFFSET X" means "the ordinary number which
represents X's offset from the start of the segment." And, if the
assembler sees an ordinary number, as opposed to a label, it uses
the instruction which is equivalent to LA.
If PLACE were the label of a DB pseudo-op, instead of a DW, then
MOV DX,PLACE
would be illegal. The assembler worries about length attributes of its
operands.
Next, numbers and constants in general. The assembler's default
radix is decimal. You can change this, but I don't recommend it. If
you want to represent numbers in other forms of notation such as
hex or bit, you generally use a trailing letter. For example,
21H
is hexidecimal 21,
00010000B
is the eight bit binary number pictured.
The next elements we should point to are the SEGMENT...ENDS
pair and the END instruction. Every assembler program has to have
these elements.
SEGMENT tells the assembler you are starting a section of
contiguous material (code and/or data). The symmetrically named
ENDS statement tells the assembler you are finished with a section
of contiguous material. I wish they didn't use the word SEGMENT in
this context. To me, a "segment" is a hardware construct: it is the
64K of real storage which becomes addressable by virtue of having
a particular value in a segment register. Now, it is true that the
"segments" you make with the assembler often correspond to real
hardware "segments" at execution time. But, if you look at things like
the GROUP and CLASS options supported by the linker, you will
discover that this correspondence is by no means exact. So, at risk
of maybe confusing you even more, I am going to use the more
informal term "section" to refer to the area set off by means of the
SEGMENT and ENDS instructions.
The sections delimited by SEGMENT...ENDS pairs are really a lot
like CSECTs and DSECTs in the 370 world.
I strongly recommend that you be selective in your study of the
SEGMENT pseudo-op as described in the manual. Let me just
touch on it here.
name SEGMENT
name SEGMENT PUBLIC
name SEGMENT AT nnn
Basically, you can get away with just the three forms given above.
The first form is what you use when you are writing a single section
of assembler code which will not be combined with other pieces of
code at link time. The second form says that this assembly only
contains part of the section; other parts might be assembled
separately and combined later by the linker.
I have found that one can construct reasonably large modular
applications in assembler by simply making every assembly use the
same segment name and declaring the name to be PUBLIC each
time. If you read the assembler and linker documentation, you will
also be bombarded by information about more complex options
such as the GROUP statement and the use of other "combine types"
and "classes." I don't recommend getting into any of that. I will talk
more about the linker and modular construction of programs a little
later. The assembler manual also implies that a STACK segment is
required. This is not really true. There are numerous ways to assure
that you have a valid stack at execution time.
Of course, if you plan to write applications in assembler which are
more than 64K in size, you will need more than what I have told you;
but who is really going to do that? Any application that large is likely
to be coded in a higher level language.
The third form of the SEGMENT statement makes the delineated
section into something like a "DSECT;" that is, it doesn't generate
any code, it just describes what is present somewhere already in the
computer's memory. Sometimes the AT value you give is
meaningful. For example, the BIOS work area is located at location
40 hex. So, you might see
BIOSAREA SEGMENT AT 40H ;Map BIOS work area
ORG BIOSAREA+10H
EQUIP DB ? ;Location of equipment flags, first byte
BIOSAREA ENDS
in a program which was interested in mucking around in the BIOS
work area.
At other times, the AT value you give may be arbitrary, as when you
are mapping a repeated control block:
PROGPREF SEGMENT AT 0 ;Really a DSECT mapping the
program prefix
ORG PROGPREF+6
MEMSIZE DW ? ;Size of available memory
PROGPREF ENDS
Really, no matter whether the AT value represents truth or fiction, it is
your responsibility, not the assembler's, to get set up a segment
register so that you can really reach the storage in question. So, you
can't say
MOV AL,EQUIP
unless you first say something like
MOV AX,BIOSAREA ;BIOSAREA becomes a symbol with value
40H MOV ES,AX ASSUME ES:BIOSAREA
Enough about SEGMENT. The END statement is simple. It goes at
the end of every assembly. When you are assembling a subroutine,
you just say
END
but when you are assembling the main routine of a program you say
END label
where 'label' is the place where execution is to begin.
Another pseudo-op illustrated in the program is ASSUME.
ASSUME is like the USING statement in 370 assembler. However,
ASSUME can ONLY refer to segment registers. The assembler
uses ASSUME information to decide whether to assemble segment
override prefixes and to check that the data you are trying to access
is really accessible. In this case, we can reassure the assembler that
both the CS and DS registers will address the section called
HELLO at execution time. Actually, the SS and ES registers will too,
but the assembler never needs to make use of this information.
I guess I have explained everything in the program except that ORG
pseudo-op. ORG means the same thing as it does in many
assembly languages. It tells the assembler to move its location
counter to some particular address. In this case, we have asked the
assembler to start assembling code hex 100 bytes from the start of
the section called HELLO instead of at the very beginning. This
simply reflects the way COM programs are loaded.
When a COM program is loaded by the system, the system sets up
all four segment registers to address the same 64K of storage. The
first 100 hex bytes of that storage contains what is called the
program prefix; this area is described in appendix E of the DOS
manual. Your COM program physically begins after this. Execution
begins with the first physical byte of your program; that is why the
JMP instruction is there.
Wait a minute, you say, why the JMP instruction at all? Why not put
the data at the end? Well, in a simple program like this I probably
could have gotten away with that. However, I have the habit of putting
data first and would encourage you to do the same because of the
way the assembler has of assembling different instructions
depending on the nature of the operand.
Unfortunately, sometimes the different choices of instruction which
can assemble from a single opcode have different lengths. If the
assembler has already seen the data when it gets to the instructions
it has a good chance of reserving the right number of bytes on the
first pass. If the data is at the end, the assembler may not have
enough information on the first pass to reserve the right number of
bytes for the instruction. Sometimes the assembler will complain
about this, something like "Forward reference is illegal" but at other
times, it will make some default assumption. On the second pass, if
the assumption turned out to be wrong, it will report what is called a
"Phase error," a very nasty error to track down. So get in the habit of
putting data and equated symbols ahead of code.
OK. Maybe you understand the program now. Let's walk through the
steps involved in making it into a real COM file.
1. The file should be created with the name HELLO.ASM (actually
the name is arbitrary but the extension .ASM is conventional and
useful)
2.
ASM HELLO,,;
(this is just one example of invoking the assembler; it uses the
small assembler ASM, it produces an object file and a listing file
with the same name as the source file. I am not going exhaustively
into how to invoke the assembler, which the manual goes into pretty
well. I guess this is the first time I mentioned that there are really two
assemblers; the small assembler ASM will run in a 64K machine
and doesn't support macros. I used to use it all the time; now that I
have a bigger machine and a lot of macro libraries I use the full
function assembler MASM. You get both when you buy the
package).
3. If you issue DIR at this point, you will discover that you have
acquired HELLO.OBJ (the object code resulting from the assembly)
and HELLO.LST (a listing file). I guess I can digress for a second
here concerning the listing file. It contains TAB characters. I have
found there are two good ways to get it printed and one bad way.
The bad way is to use LPT1: as the direct target of the listing file or
to try copying the LST file to LPT1 without first setting the tabs on
the printer. The two good ways are to either
a. direct it to the console and activate the printer with CTRL-
PRTSC. In this case, DOS will expand the tabs for you.
b. direct to LPT1: but first send the right escape sequence to LPT1
to set the tabs every eight columns. I have found that on some early
serial numbers of the IBM PC printer, tabs don't work quite right,
which forces you to the first option.
4.
LINK HELLO;
(again, there are lots of linker options but this is the simplest. It
takes HELLO.OBJ and makes HELLO.EXE). HELLO.EXE? I
thought we were making a COM program, not an EXE program.
Right. HELLO.EXE isn't really executable; its just that the linker
doesn't know about COM programs. That requires another utility.
You don't have this utility if you are using DOS 1.0; you have it if you
are using DOS 1.1 or DOS 2.0. Oh, by the way, the linker will warn
you that you have no stack segment. Don't worry about it.
5.
EXE2BIN HELLO HELLO.COM
This is the final step. It produces the actual program you will exe-
cute. Note that you have to spell out HELLO.COM; for a nominally
rational but actually perverse reason, EXE2BIN uses the default
extension BIN instead of COM for its output file. At this point, you
might want to erase HELLO.EXE; it looks a lot more useful than it is.
Chances are you won't need to recreate HELLO.COM unless you
change the source and then you are going to have to redo the whole
thing.
6.
HELLO
You type hello, that invokes the program, it says
HELLO YOURSELF!!!
(oops, what did I do wrong....?)
No comments:
Post a Comment