*This article should be merged with integer (computer science).*

This seems to clearly indicate a bug in the system, and it's a bigger shock to find out that, no, that's the way it was happens to work (except in computer algebra systems). This document explains such issues in detail.

Nearly all computer users understand the concept of a "bit", or in computer terms, a 1 or 0 value encoded by the setting of a switch. It's not much more difficult to see that if you take two bits, you can use them to represent four unique states:

00 01 10 11If you have three bits, then you can use then to represent eight unique states:

000 001 010 011 100 101 110 111With every bit you add, you double the number of states you can represent. Therefore, the expression for the number of states with

Actually, in some cases 4 bits is a convenient number of bits to deal with, and this collection of bits is called, somewhat painfully, the "nybble". In this document, we will refer to "nybbles" often, but please remember that in reality the term "byte" is common, while the term "nybble" is not.

A nybble can encode 16 different values, such as the numbers 0 to 15. Any arbitrary sequence of bits could be used in principle, but in practice the most natural way is as follows:

0000 = decimal 0 1000 = decimal 8 0001 = decimal 1 1001 = decimal 9 0010 = decimal 2 1010 = decimal 10 0011 = decimal 3 1011 = decimal 11 0100 = decimal 4 1100 = decimal 12 0101 = decimal 5 1101 = decimal 13 0110 = decimal 6 1110 = decimal 14 0111 = decimal 7 1111 = decimal 15This is natural because it follows our instinctive way of considering a normal decimal number. For example, given the decimal number:

- 7531

- 7 × 1000 + 5 × 100 + 3 × 10 + 1 × 1

- 7 × 10
^{3}+ 5 × 10^{2}+ 3 × 10^{1}+ 1 × 10^{0}

Each digit in the number represents a value from 0 to 9, which is ten different possible values, and that's why it's called a decimal or "base-10" number. Each digit also has a "weight" of a power of ten proportional to its position. This sounds complicated, but it's not. It's exactly what you take for granted when you look at a number. You know it without even having to think about it.

Similarly, in the binary number encoding scheme explained above, the value 13 is encoded as:

- 1101

- 1101 =
- 1 × 2
^{3}+ 1 × 2^{2}+ 0 × 2^{1}+ 1 × 2^{0}= - 1 × 8 + 1 × 4 + 0 × 2 + 1 × 1 = 13 decimal

2Another thing to remember is that, aping the metric system, the value 2^{0}= 1 2^{8 }= 256 2^{1}= 2 2^{9 }= 512 2^{2}= 4 2^{10}= 1,024 2^{3}= 8 2^{11}= 2,048 2^{4}= 16 2^{12}= 4,096 2^{5}= 32 2^{13}= 8,192 2^{6}= 64 2^{14}= 16,384 2^{7}= 128 2^{15}= 32,768 2^{16}= 65,536

2Similarly, the value 2^{11}= 2 K = 2,048 2^{12}= 4 K = 4,096 2^{13}= 8 K = 8,192 2^{14}= 16 K = 16,384 2^{15}= 32 K = 32,768 2^{16}= 64 K = 65,536

2and the value 2^{21}= 2 M 2^{22}= 4 M

Use of these prefixes can get a bit confusing since in some documents it can be unclear if, say, "kilo" actually means 1,024 or if it means 1,000. Often in computer discussions it means 1,024 but some writers are sloppy on this point. In any case, we'll see these prefixes often as we continue.

There is a subtlety in this discussion. If we use 16 bits, we can have 65,536 different values, but the values are from 0 to 65,535. People start counting at one, machines start counting from zero, since it's simpler from their point of view. This small and mildly confusing fact even trips up computer mechanics every now and then.

Anyway, this defines a simple way to count with bits, but it has a few restrictions:

- You can only perform arithmetic within the bounds of the number of bits that you have. That is, if you are working with 16 bits at a time, you can't perform arithmetic that gives a result of 65,536 or more, or you get an error called a "numeric overflow". The formal term is that you are working with "finite precision" values.
- There's no way to represent fractions with this scheme. You can only work with non-fractional "integer" quantities.
- There's no way to represent negative numbers with this scheme. All the numbers are "unsigned".

Let's take a side-trip to discuss representation of binary numbers. Computer mechanics often need to write out binary quantities, but in practice writing out a binary number like:

- 1001 0011 0101 0001

- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ...

- 0 1 2 3 4 5 6 7 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 ...

In a hex system, we have 16 digits (0 through 9 followed, by convention, with a through f) and we count up through the sequence as follows:

- 0 1 2 3 4 5 6 7 8 9 a b c d e f 10 11 12 13 14 15 16 ...

Each of these number systems are positional systems, but while decimal weights are powers of 10, the octal weights are powers of 8 and the hex weights are powers of 16. For example:

- octal 756
- = 7 × 8
^{2}+ 5 × 8^{1}+ 6 × 8^{0} - = 7 × 64 + 5 × 8 + 6 × 1
- = 448 + 40 + 6 = decimal 494
- hex 3b2
- = 3 × 16
^{2}+ 11 × 16^{1}+ 2 × 16^{0} - = 3 × 256 + 11 × 16 + 2 × 1
- = 768 + 176 + 2 = decimal 946

- 000 = octal 0
- 001 = octal 1
- 010 = octal 2
- 011 = octal 3
- 100 = octal 4
- 101 = octal 5
- 110 = octal 6
- 111 = octal 7

0000 = hex 0 1000 = hex 8 0001 = hex 1 1001 = hex 9 0010 = hex 2 1010 = hex a 0011 = hex 3 1011 = hex b 0100 = hex 4 1100 = hex c 0101 = hex 5 1101 = hex d 0110 = hex 6 1110 = hex e 0111 = hex 7 1111 = hex fSo it is easy to convert a long binary number, such as 1001001101010001, to octal:

1 001 001 101 010 001 binary = 111521 octaland easier to convert that number to hex:

1001 0011 0101 0001 = 9351 hexbut it takes a lot of figuring to convert it to decimal (37,713 decimal). Octal and hex make a convenient way to represent binary "machine" quantities.

0000 = decimal 0 0001 = decimal 1 0010 = decimal 2 0011 = decimal 3 0100 = decimal 4 0101 = decimal 5 0110 = decimal 6 0111 = decimal 7 1000 = decimal -8 1001 = decimal -7 1010 = decimal -6 1011 = decimal -5 1100 = decimal -4 1101 = decimal -3 1110 = decimal -2 1111 = decimal -1Now we have a "signed integer" number system, using a scheme known as, for reasons unimportant here, "two's complement". With a 16-bit signed integer number, we can encode numbers from -32,768 to 32,767. With a 32-bit signed integer number, we can encode numbers from -2,147,483,648 to 2,147,482,647.

This has some similarities to the sign-bit scheme in that a negative number has its topmost bit set to "1", but the two concepts are different. In sign-magnitude numbers, a "-5" is:

1101while in two's complement numbers, it is:

1011which in sign-magnitude numbers is "-3". Why two's complement is simpler for machines to work with will be explained in a later section.

So now we can represent unsigned and signed integers as binary quantities. Remember that these are just two ways of interpreting a pattern of bits. If a computer has a binary value in memory of, say:

1101-- this could be interpreted as a decimal "13" or a decimal "-3".

Fixed-point formats are often used in business calculations (such as with spreadsheets or COBOL); where floating-point with insufficient precision is unacceptable when dealing when money. It is helpful to study it to see how fractions can be stored in binary.

First, we have to decide how many bits we are using to store the fractional part of a number, and how many we are using to store the integer part. Let's say that we are using a 32-bit format, with 16 bits for the integer and 16 for the fraction.

How are the fractional bits used? They continue the pattern set by the integer bits: if the eight's bit is followed by the four's bit, then the two's bit, then the one's bit, then of course the next bit is the half's bit, then the quarter's bit, then the 1/8's bit, etc.

Examples:

Integer bits Fractional bits 0.5 = 1/2 = 00000000 00000000.10000000 00000000 1.25 = 1 1/4 = 00000000 00000001.01000000 00000000 7.375 = 7 3/8 = 00000000 00000111.01100000 00000000Now for something tricky: try a fraction like 1/5 (in decimal, this is 0.2). You can't do it exactly. The best you can do is one of these:

- 13107/65536 = 00000000 00000000.00110011 00110011 = 0.1999969... in decimal
- 13108/65536 = 00000000 00000000.00110011 00110100 = 0.2000122... in decimal

While both unsigned and signed integers are used in digital systems, even a 32-bit integer is not enough to handle all the range of numbers a calculator can handle, and that's not even including fractions. To obtain greater range we have to abandon signed integers and fixed-point numbers and go to a "floating-point" format.

In the decimal system, we are familiar with floating-point numbers of the form:

- 1.1030402 × 10
^{5}= 1.1030402 × 100000 = 110304.02

1.1030402E5which means "1.103402 times 1 followed by 5 zeroes". We have a certain numeric value (1.1030402) known as a "significand", multiplied by a power of 10 (E5, meaning 10

- 2.3434E-6 = 2.3434 × 10
^{-6}= 2.3434 × 0.000001 = 0.0000023434

- an 11-bit binary exponent, using "excess-1023" format. Excess-1023 means the exponent appears as a unsigned binary integer from 0 to 2047, and you have to subtract 1023 from it to get the actual signed value
- a 52-bit significand, also an unsigned binary number, defining a fractional value with a leading implied "1"
- a sign bit, giving the sign of the number.

byte 0: S x10 x9 x8 x7 x6 x5 x4 byte 1: x3 x2 x1 x0 m51 m50 m49 m48 byte 2: m47 m46 m45 m44 m43 m42 m41 m40 byte 3: m39 m38 m37 m36 m35 m34 m33 m32 byte 4: m31 m30 m29 m28 m27 m26 m25 m24 byte 5: m23 m22 m21 m20 m19 m18 m17 m16 byte 6: m15 m14 m13 m12 m11 m10 m9 m8 byte 7: m7 m6 m5 m4 m3 m2 m1 m0where "S" denotes the sign bit, "x" denotes an exponent bit, and "m" denotes a significand bit. Once the bits here have been extracted, they are converted with the computation:

- <sign> × (1 + <fractional significand>) × 2
^{<exponent> - 1023}

maximum | minimum | |
---|---|---|

positive | 1.797693134862231E+308 | 4.940656458412465E-324 |

negative | -4.940656458412465E-324 | -1.797693134862231E+308 |

The spec also defines several special values that are not defined numbers, and are known as "NaNs", for "Not A Number". These are used by programs to designate overflow errors and the like. You will rarely encounter them and NaNs will not be discussed further here. Some programs also use 32-bit floating-point numbers. The most common scheme uses a 23-bit significand with a sign bit, plus an 8-bit exponent in "excess-127" format, giving 7 valid decimal digits.

byte 0: S x7 x6 x5 x4 x3 x2 x1 byte 1: x0 m22 m21 m20 m19 m18 m17 m16 byte 2: m15 m14 m13 m12 m11 m10 m9 m8 byte 3: m7 m6 m5 m4 m3 m2 m1 m0The bits are converted to a numeric value with the computation:

- <sign> × (1 + <fractional significand>) × 2
^{<exponent> - 127}

maximum | minimum | |
---|---|---|

positive | 3.402823E+38 | 2.802597E-45 |

negative | -2.802597E-45 | -3.402823E+38 |

Such floating-point numbers are known as "reals" or "floats" in general, but with a number of inconsistent variations, depending on context:

A 32-bit float value is sometimes called a "real32" or a "single", meaning "single-precision floating-point value".

A 64-bit float is sometimes called a "real64" or a "double", meaning "double-precision floating-point value".

The term "real" without any elaboration generally means a 64-bit value, while the term "float" similarly generally means a 32-bit value.

Once again, remember that bits are bits. If you have 8 bytes stored in computer memory, it might be a 64-bit real, two 32-bit reals, or 4 signed or unsigned integers, or some other kind of data that fits into 8 bytes.

The only difference is how the computer interprets them. If the computer stored four unsigned integers and then read them back from memory as a 64-bit real, it almost always would be a perfectly valid real number, though it would be junk data.

So now our computer can handle positive and negative numbers with fractional parts. However, even with floating-point numbers you run into some of the same problems that you did with integers:

- As with integers, you only have a finite range of values to deal with. Granted, it's a much bigger range of values than even a 32-bit integer, but if you keep multiplying numbers you'll eventually get one bigger than the real value can hold and have a "numeric overflow".
If you keep dividing you'll eventually get one with a negative exponent too big for the real value to hold and have a "numeric underflow". Remember that a negative exponent gives the number of places to the right of the decimal point and means a really small number.

The maximum real value is sometimes called "machine infinity", since that's the biggest value the computer can wrap its little silicon brain around.

- A related problem is that you have only limited "precision" as well. That is, you can only represent 15 decimal digits with a 64-bit real. If the result of a multiply or a divide has more digits than that, they're just dropped and the computer doesn't inform you of an error.
This means that if you add a very small number to a very large one, the result is just the large one. The small number was too small to even show up in 15 or 16 digits of resolution, and the computer effectively discards it. If you are performing computations and you start getting really insane answers from things that normally work, you may need to check the range of your data. It's possible to "scale" the values to get more accurate results.

It also means that if you do floating-point computations, there's likely to be a small error in the result since some lower digits have been dropped. This effect is unnoticeable in most cases, but if you do some math analysis that requires lots of computations, the errors tend to build up and can throw off the results.

The faction of people who use computers for doing math understand these errors very well, and have methods for minimizing the effects of such errors, as well as for estimating how big the errors are.

By the way, this "precision" problem is not the same as the "range" problem at the top of this list. The range issue deals with the maximum size of the exponent, while the resolution issue deals with the number of digits that can fit into the significand.

- Another more obscure error that creeps in with floating-point numbers is the fact that the significand is expressed as binary fraction that doesn't necessarily perfectly match a decimal fraction.
That is, if you want to do a computation on a decimal fraction that is a neat sum of reciprocal powers of two, such as 0.75, the binary number that represents this fraction will be 0.11, or 1/2 + 1/4, and all will be fine.

Unfortunately, in many cases you can't get a sum of these "reciprocal powers of 2" that precisely matches a specific decimal fraction, and the results of computations will be very slightly off, way down in the very small parts of a fraction. For example, the decimal fraction "0.1" is equivalent to an infinitely repeating binary fraction: 0.000110011 ...

However, high-level programming languages such as LISP and Python offer an abstract number that may be an expanded type such as *rational*, *bignum*, or *complex*. Programmers in LISP or Python (among others) have some assurance that their programming systems will Do The Right Thing with mathematical operations. Due to operator overloading, mathematical operations on any number -- whether signed, unsigned, rational, floating-point, fixed-point, integral, or complex -- are written exactly the same way. Others, such as Rexx or Java provide decimal floating-point which avoids many 'unexpected' results.

0100 0110 (hex 46)to the letter "F", for example. The computer sends such "character codes" to its display to print the characters that make up the text you see.

There is a standard binary encoding for western text characters, known as the "American Standard Code for Information Interchange (ASCII)". The following table shows the characters represented by ASCII, with each character followed by its value in decimal ("d"), hex ("h"), and octal ("o"):

ASCII Table ______________________________________________________________________The strange characters listed in the leftmost column, such as "FF" and "BS", do not correspond to text characters. Instead, they correspond to "control" characters that, when sent to a printer or display device, execute various control functions. For example, "FF" is a "form feed" or printer page eject, "BS" is a backspace, , and "BEL" causes a beep ("bell"). In a text editor, they'll just be shown as a little white block or a blank space or (in some cases) little smiling faces, musical notes, and other bizarre items. To type them in, in many applications you can hold down the CTRL key and press an appropriate code. For example, pressing CTRL and entering "G" gives CTRL-G, or "^G" in the table above, the BEL character.ch ctl d h o ch d h o ch d h o ch d h o ______________________________________________________________________

NUL ^@ 0 0 0 sp 32 20 40 @ 64 40 100 ' 96 60 140 SOH ^A 1 1 1 ! 33 21 41 A 65 41 101 a 97 61 141 STX ^B 2 2 2 " 34 22 42 B 66 42 102 b 98 62 142 ETX ^C 3 3 3 # 35 23 43 C 67 43 103 c 99 63 143 EOT ^D 4 4 4 $ 36 24 44 D 68 44 104 d 100 64 144 ENQ ^E 5 5 5 % 37 25 45 E 69 45 105 e 101 65 145 ACK ^F 6 6 6 & 38 26 46 F 70 46 106 f 102 66 146 BEL ^G 7 7 7 ` 39 27 47 G 71 47 107 g 103 67 147

BS ^H 8 8 10 ( 40 28 50 H 72 48 110 h 104 68 150 HT ^I 9 9 11 ) 41 29 51 I 73 49 111 i 105 69 151 LF ^J 10 a 12 * 42 2a 52 J 74 4a 112 j 106 6a 152 VT ^K 11 b 13 _ 43 2b 53 K 75 4b 113 k 107 6b 153 FF ^L 12 c 14 , 44 2c 54 L 76 4c 114 l 108 6c 154 CR ^M 13 d 15 _ 45 2d 55 M 77 4d 115 m 109 6d 155 SO ^N 14 e 16 . 46 2e 56 N 78 4e 116 n 110 6e 156 SI ^O 15 f 17 / 47 2f 57 O 79 4f 117 o 111 6f 157

DLE ^P 16 10 20 0 48 30 60 P 80 50 120 p 112 70 160 DC1 ^Q 17 11 21 1 49 31 61 Q 81 51 121 q 113 71 161 DC2 ^R 18 12 22 2 50 32 62 R 82 52 122 r 114 72 162 DC3 ^S 19 13 23 3 51 33 63 S 83 53 123 s 115 73 163 DC4 ^T 20 14 24 4 52 34 64 T 84 54 124 t 116 74 164 NAK ^U 21 15 25 5 53 35 65 U 85 55 125 u 117 75 165 SYN ^V 22 16 26 6 54 36 66 V 86 56 126 v 118 76 166 ETB ^W 23 17 27 7 55 37 67 W 87 57 127 w 119 77 167

CAN ^X 24 18 30 8 56 38 70 X 88 58 130 x 120 78 170 EM ^Y 25 19 31 9 57 39 71 Y 89 59 131 y 121 79 171 SUB ^Z 26 1a 32 : 58 3a 72 Z 90 5a 132 z 122 7a 172 ESC ^[ 27 1b 33 ; 59 3b 73 [ 91 5b 133 { 123 7b 173 FS ^\\ 28 1c 34 < 60 3c 74 \\ 92 5c 134 | 124 7c 174 GS ^] 29 1d 35 = 61 3d 75 ] 93 5d 135 } 125 7d 175 RS ^^ 30 1e 36 > 62 3e 76 ^ 94 5e 136 ~ 126 7e 176 US ^_ 31 1f 37 ? 63 3f 77 _ 95 5f 137 DEL 127 7f 177 ______________________________________________________________________

The ASCII table above only defines 128 characters, which implies that ASCII characters only need 7 bits. However, since most computers store information in terms of bytes, normally there will be one character stored to a byte. This extra bit allows a second set of 128 characters, an "extended" character set, to be defined beyond the 128 defined by ASCII.

In practice, there are a number of different extended character sets, providing such features as math symbols, cute little line-pattern building block characters for building forms, and extension characters for non-English languages. The extensions are not highly standardized and tend to lead to confusion.

This table serves to emphasize one of the main ideas of this document: bits are bits. In this case, you have bits representing characters. You can describe the particular code for a particular character in decimal, octal, or hexadecimal, but it's still the same code. The value that is expressed, whether it is in decimal, octal, or hex, is simply the same pattern of bits.

Of course, you normally want to use many characters at once to display sentences and the like, such as:

- Tiger, tiger burning bright!

- 54 69 67 65 72 2c 20 74 69 67 65 72 20 62 75 ...

Now let's consider a particularly confusing issue for the newcomer: the fact that you can represent a number in ASCII as a string, for example:

- 1.537E3

- 31 2e 35 33 37 45 33

- 10110011 10100000 00110110 11011111

- 31 30 31 31 30 30 31 31 20 31 30 31 30 30 ...

Confused? Don't feel too bad, even experienced people get subtly confused with this issue sometimes. The essential point is that the values the computer works on are just sets of bits. For you to actually see the values, you have to get an ASCII representation of them. Or to put it simply: machines work with bits and bytes, humans work with ASCII, and there has to be translation to allow the two to communicate.

8 bits is clearly not enough to allow representation of, say, Japanese characters, since their basic set is a little over 2,000 different characters. As a result, to encode Asian languages such as Japanese or Chinese, computers use a 16-bit code for characters. There are a variety of specs for encoding non-Western characters, the most widely used being "Unicode", which provides character codes for Western, Asian, Indic, Hebrew, and other character sets, including even Egyptian hieroglyphics.

- 10

- 101

46,535 greater than or equal to 32,768? Yes, subtract, write: 1 13,767 greater than or equal to 16,384? No, write: 0 13,767 greater than or equal to 8,192? Yes, subtract, write: 1 5,575 greater than or equal to 4,096? Yes, subtract, write: 1This gives:1,479 greater than or equal to 2,048? No, write: 0 1,479 greater than or equal to 1,024? Yes, subtract, write: 1 455 greater than or equal to 512? No, write: 0 455 greater than or equal to 256? Yes, subtract, write: 1

199 greater than or equal to 128? Yes, subtract, write: 1 71 greater than or equal to 64? Yes, subtract, write: 1 7 greater than or equal to 32? No, write: 0 7 greater than or equal to 16? No, write: 0

7 greater than or equal to 8? No, write: 0 7 greater than or equal to 4? Yes, subtract, write: 1 3 greater than or equal to 2? Yes, subtract, write: 1 1 greater than or equal to 1? Yes, subtract, write: 1

- 46,535 decimal = 1011 0101 1100 0111 binary = b5c7 hex

- b5c7 hex
- = 11 × 16
^{3}+ 5 × 16^{2}+ 12 × 16 + 7 × 1 - = 11 × 4096 + 5 × 256 + 12 × 16 + 7
- = 45,056 + 1,280 + 192 + 7 = 46,535

- The next interesting topic is the operations that can be performed on binary numbers. We'll consider signed and unsigned integers first.

1111 0000 1111 0000 1111 0000 OR 1010 1010 AND 1010 1010 XOR 1010 1010 NOT 1010 1010 ------------ ------------- ------------- ------------- 1111 1010 1010 0000 0101 1010 0101 0101The rules for these operations are as follows:

- AND: The result is 1 if both values are 1.
- OR: The result is 1 if either value is 1.
- XOR: The result is one if only one value is 1.
- NOT: The result is 1 if the value is 0 (this is also called "inverting").

Binary addition is a more interesting operation. If you remember the formal rules for adding a decimal number, you add the numbers one digit at a time, and if the result is ten or more, you perform a "carry" to add to the next digit. For example, if you perform the addition:

- 374 + 452

- 4 + 2 = 6

- 7 + 5 = 12

- ( 1 + ) 3 + 4 = 8

Performing additions in binary are essentially the same, except that the number of possible results of an addition of two bits is minimal:

- 0 + 0 = 0
- 0 + 1 = 1
- 1 + 0 = 1
- 1 + 1 = 10

0011 1010 58 + 0001 1101 + 29 ----------- ---- 0101 0111 87 CC CThe bits on which a carry occurred are marked with a "C". The equivalent decimal addition is shown to the right. Assuming that we are adding unsigned integers, notice what happens if we add:

1000 1001 137 + 0111 1011 + 123 ------------- ----------- (1) 0000 0100 ( 256 + ) 4 ? CCCC C CCThe result, equivalent to a decimal 260, is beyond the range of an 8-bit unsigned value (maximum of 255) and won't fit into 8 bits, so all you get is a value of 4, since the "carry-out" bit is lost. A "numeric overflow" has occurred.

OK, now to get really tricky. Remember how we defined signed integers as two's complement values? That is, we chop the range of binary integer values in half and assign the high half to negative values. In the case of 4-bit values:

0000 0001 ... 0110 0111 1000 1001 ... 1110 1111 0 1 ... 6 7 -8 -7 ... -2 -1Now we can discuss exactly why this scheme makes life easier for the computer. Two's complement arithmetic has some interesting properties, the most significant being that it makes subtraction the same as addition. To see how this works, pretend that you have the binary values above written on a strip of stiff paper taped at the ends into a loop, with each binary value written along the loop, and the loop joined together between the "1111" and "0000" values.

Now further consider that you have a little slider on the loop that you can move in the direction of increasing values, but not backwards, sort of like a slide rule that looks like a ring, with 16 binary values on it.

If you wished to add, say, 2 to 4, you would just move the slider up two values from 4, and you would get 6. Now let's see what happens if you want to subtract 1 from 3. This is the same as adding -1 to 3, and since a -1 in two's complement is 1111, or the same as decimal 15, you move the slider up 15 values from 3. This takes you all the way around the ring to ... 2.

This is a bizarre way to subtract 1 from a value, or so it seems, but you have to admit from the machine point of view it's just dead simple. Try a few other simple additions and subtractions to see how this works.

The problem is that overflow conditions have become much trickier. Consider the following examples:

0111 0100 116 1000 1110 -112 + 0001 0101 + 21 + 1001 0001 -111 ----------- ----- ------------- ------------ 1000 1001 -119 ( = 137) ? (1) 0001 1111 ( 256 + ) 21 ?In the case on the left, we add two positive numbers and end up with a negative one. In the case on the right, we add two negative numbers and end up with a positive one. Both are absurd results. Again, the results have exceeded the range of values that can be represented in the coding system we have used. One nice thing is that you can't get into trouble adding a negative number to a positive one, since obviously the result is going to be within the range of allowed values.

As an aside, if you want to convert a positive binary value to a two's complement value, all you have to do is invert (NOT) all the bits and add 1. For example, to convert a binary 7 to -7:

- 0000 0111 → 1111 1000 + 1 → 1111 1001

This covers addition and multiplication, but what about multiplication and division? Well, take the binary value:

- 0010 1001 = 41

- 0101 0010 = 82

- 0001 0100 = 20 (losing a 1 that was shifted off the right side)

One interesting feature of this scheme are the complications two's complement introduces. Suppose we have a two's complement number such as:

- 1001 1010 = -102

- 0100 1101 = 77

- 1100 1101 = -51

So that covers the basic operations for signed and unsigned integers. What about floating-point values?

The operations are basically the same. The only difference in addition and subtraction is that you have two quantities that have both a significand and an exponent, and they both have to have the same exponent to allow you to add or subtract. For example, using decimal math:

- 3.13E2 + 2.7E3 = 0.313E3 + 2.73E3 = 3.043E3

- 3.13E2 × 2.7E3 = (3.13 × 2.7)E(2 + 3) = 8.45E5

Finally, what about higher functions, like square roots and sines and logs and so on?

There are two approaches. First, there are standard algorithms that mathematicians have long known that use the basic add-subtract-multiply-divide operations in some combination and sequence to generate as good an approximation to such values as you would like. For example, sines and cosines can be approximated with a "power series", which is a sum of specified powers of the value to be converted divided by specific constants that follow a certain rule.

Second, you can use something equivalent to the "look-up tables" once used in math texts that give values of sine, cosine, and so on, where you would obtain the nearest values for, say, sine for a given value and then "interpolate" between them to get a good approximation of the value you wanted. (Younger readers may not be familiar with this technique, as calculators have made such tables generally obsolete, but such tables can be found in old textbooks.) In the case of the computer, it can store a table of sines or the like, look up values, and then interpolate between them to give the value you want.

That covers the basics of bits and bytes and math for a computer. However, just to confuse things a little bit, while computers normally use binary floating-point numbers, calculators normally don't.

Recall that there is a slight error in translating from decimal to binary and back again. While this problem is readily handled by someone familiar with it, calculators in general have to be simple to operate and it would be better if this particular problem didn't arise, particularly in financial calculations where minor discrepancies might lead to a tiresome audit.

So calculators generally really perform their computations in decimal, using a scheme known as "binary-coded decimal (BCD)". This scheme uses groups of four bits to encode the individual digits in a decimal number:

- 0000 = decimal 0
- 0001 = decimal 1
- ...
- 0111 = decimal 7
- 1000 = decimal 8
- 1001 = decimal 9
- 1010 = ILLEGAL
- 1011 = ILLEGAL
- ...
- 1110 = ILLEGAL
- 1111 = ILLEGAL

- 0001 0000 = decimal 10
- 0001 0001 = decimal 11
- 0001 0010 = decimal 12
- 0001 0011 = decimal 13
- ...

One final comment on computer math: there are math software packages that offer "indefinite precision" math, or that is, math taken out to a precision defined by the user. What these packages do is define ways of encoding numbers that can change in the number of bits they use to store values, as well as ways of performing math on them. They are very slow compared to normal computer computations, and many applications actually do fine with 64-bit floating-point math.