1 min binary options united states

A Tutorial connected Data Representation

Integers, Mobile-level Numbers, and Characters

Number Systems

Humans expend decimal (base 10) and twelfth (base 12) number systems for counting and measurements (probably because we have 10 fingers and two big toes). Computers use binary (base 2) numeration system, as they are made from binary digital components (illustrious as transistors) operational in two states - off and on. In computing, we also use hex (base 16) or octal (base 8) number systems, equally a compact form for representing binary numbers.

Decimal (Base 10) Number System

Decimal number system has ten symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, called digits. It uses positional notation. That is, the least-probatory dactyl (right-to the highest degree digit) is of the order of 10^0 (units or ones), the bit right-nigh dactyl is of the order of 10^1 (tens), the third right-well-nig digit is of the order of 10^2 (hundreds), and and so along, where ^ denotes exponent. For example,

735 = 700 + 30 + 5 = 7×10^2 + 3×10^1 + 5×10^0

We shall denote a decimal fraction number with an optional suffix D if equivocalness arises.

Binary (Base 2) Numeration system

Binary system has two symbols: 0 and 1, known as bits. It is too a positional notation, e.g.,

10110B = 10000B + 0000B + 100B + 10B + 0B = 1×2^4 + 0×2^3 + 1×2^2 + 1×2^1 + 0×2^0

We shall denote a double star number with a suffix B. Some programming languages denote binary numbers with prefix 0b or 0B (e.g., 0b1001000), or prefix b with the bits quoted (e.g., b'10001111').

A binary digit is called a morsel. Eight bits is called a byte (why 8-bit unit? Probably because 8=2³ ).

Hexadecimal (Counterfeit 16) Number System of rules

Hexadecimal keep down system uses 16 symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F, named hex digits. Information technology is a positional notation, for example,

A3EH = A00H + 30H + EH = 10×16^2 + 3×16^1 + 14×16^0

We shall denote a positional representation system number (in short, hex) with a suffix H. Some computer programing languages refer hex numbers with prefix 0x or 0X (e.g., 0x1A3C5F), operating theater prefix x with hex digits quoted (e.g., x'C3A4D98B').

Each hexadecimal digit is too called a witch digit. Most programming languages swallow little 'a' to 'f' A healed as uppercase 'A' to 'F'.

Computers uses pure binary numeration system in their inner operations, A they are built from binary digital physics components with 2 states - off and on. However, writing Oregon indication a agelong sequence of binary bits is cumbersome and erroneous belief-prone (try to read this binary string: 1011 0011 0100 0011 0001 1101 0001 1000B, which is the same as hexadecimal B343 1D18H). Hexadecimal system of rules is utilized as a compact frame or shorthand for multiple bits. Each hex fingerbreadth is equivalent to 4 binary bits, i.e., shorthand for 4 bits, as follows:

Hexadecimal	Binary	Decimal
0	0000	0
1	0001	1
2	0010	2
3	0011	3
4	0100	4
5	0101	5
6	0110	6
7	0111	7
8	1000	8
9	1001	9
A	1010	10
B	1011	11
C	1100	12
D	1101	13
E	1110	14
F	1111	15

Conversion from Positional representation system to Binary

Supersede each hex digit away the 4 equivalent bits (as traded in the above table), for examples,

A3C5H = 1010 0011 1100 0101B 102AH = 0001 0000 0010 1010B

Conversion from Binary to Hexadecimal

Opening from the right-most number (least-significant bit), replace each group of 4 bits past the like hex digit (pad the left field-to the highest degree bits with zero if necessary), for examples,

1001001010B = 0010 0100 1010B = 24AH 10001011001011B = 0010 0010 1100 1011B = 22CBH

It is most-valuable to note that hexadecimal number provides a compact form or shorthand for representing binary bits.

Conversion from Base `r` to Decimal (Lowly 10)

Given a n-figure base r amoun: d_n-1d_n-2d_n-3...d₂d₁d₀ (base r), the decimal equal is given by:

d_n-1×r^n-1          + d_n-2×r^n-2          + ... + d₁×r¹          + d₀×r⁰

For examples,

A1C2H = 10×16^3 + 1×16^2 + 12×16^1 + 2 = 41410 (base 10) 10110B = 1×2^4 + 1×2^2 + 1×2^1 = 22 (base 10)

Conversion from Decimal (Substructure 10) to Base `r`

Use of goods and services repeated division/remainder. For instance,

To convert 261(base 10) to hexadecimal:   261/16 => quotient=16 remainder=5   16/16  => quotient=1  oddment=0   1/16   => quotient=0  rest=1 (quotient=0 stop)   Hence, 261D = 105H (Call for the hex digits from the remnant in reverse order)

The above procedure is actually applicative to conversion between any 2 illegitimate systems. E.g.,

To convert 1023(base 4) to base 3:   1023(base 4)/3 => quotient=25D remainder=0   25D/3          => quotient=8D  difference=1   8D/3           => quotient=2D  difference=2   2D/3           => quotient=0   remainder=2 (quotient=0 stop)   Therefore, 1023(free-base 4) = 2210(base 3)

Conversion between Two Amoun Systems with Divisional Part

Separate the integral and the fractional parts.
For the integral part, disunite by the target base repeatably, and collect the ramainder in reverse order.
For the third part, multiply the fragmental part by the objective radix repeatably, and collect the integral part in the equal order.

Example 1: Decimal fraction to Binary star

Convert 18.6875D to double star Built-in Contribution = 18D   18/2 => quotient=9 remainder=0   9/2  => quotient=4 remainder=1   4/2  => quotient=2 residue=0   2/2  => quotient=1 rest=0   1/2  => quotient=0 remainder=1 (quotient=0 stop)   Hence, 18D = 10010B Fractional Part = .6875D   .6875*2=1.375 => whole number is 1   .375*2=0.75   => complete number is 0   .75*2=1.5     => whole number is 1   .5*2=1.0      => unanimous number is 1   Hence .6875D = .1011B Fuse, 18.6875D = 10010.1011B

Example 2: Decimal to Hex

Convert 18.6875D to hexadecimal Integral Part = 18D   18/16 => quotient=1 remainder=2   1/16  => quotient=0 remainder=1 (quotient=0 blockade)   Hence, 18D = 12H Fractional Disunite = .6875D   .6875*16=11.0 => whole number is 11D (BH)   Thence .6875D = .Element 107 Combine, 18.6875D = 12.BH

Exercises (Number Systems Conversion)

Convert the following decimal numbers into binary and positional representation system numbers:
1. 108
2. 4848
3. 9000
Convert the following binary numbers into positional notation and decimal numbers:
1. 1000011000
2. 10000000
3. 101010101010
Change the following hexadecimal numbers into binary and decimal numbers:
1. ABCDE
2. 1234
3. 80F
Convert the undermentioned decimal numbers game into positional representation system equivalent:
1. 19.25D
2. 123.456D

Answers: You could employment the Windows' Calculator (calc.exe) to pack unsuccessful telephone number system conversion, by setting IT to the Programmer or scientific mode. (Run "calc" ⇒ Superior "Settings" menu ⇒ Choose "Programmer" or "Scientific" mode.)

1101100B, 1001011110000B, 10001100101000B, 6CH, 12F0H, 2328H.
218H, 80H, AAAH, 536D, 128D, 2730D.
10101011110011011110B, 1001000110100B, 100000001111B, 703710D, 4660D, 2063D.
?? (You work IT out!)

Computer Memory & Data Theatrical

Computer uses a fixed number of bits to play a small-arm of data, which could beryllium a number, a graphic symbol, or others. A n-bit storage positioning can represent up to 2^n distinct entities. E.g., a 3-bit computer storage location can hold ace of these octet binary patterns: 000, 001, 010, 011, 100, 101, 110, surgery 111. Hence, it can represent at to the highest degree 8 distinct entities. You could use them to represent Numbers 0 to 7, numbers 8881 to 8888, characters 'A' to 'H', Beaver State up to 8 kinds of fruits like apple, orange, banana; surgery adequate to 8 kinds of animals like lion, Panthera tigris, etc.

Integers, for lesson, give the axe be delineate in 8-bit, 16-morsel, 32-bit or 64-snatch. You, as the programmer, choose an appropriate chip-distance for your integers. Your choice will impose restraint on the range of integers that can make up pictured. In any case the bit-distance, an integer can be represented in various representation schemes, e.g., unsigned vs. signed integers. An 8-bit unsigned integer has a stray of 0 to 255, patc an 8-bit signed whole number has a range of -128 to 127 - some representing 256 distinct numbers.

Information technology is important to tone that a computer memory positioning merely stores a binary pattern. It is entirely up to you, as the software engineer, to settle on how these patterns are to be understood. For example, the 8-second binary pattern "0100 0001B" can be interpreted as an unsigned whole number 65, or an ASCII character 'A', or roughly secret information known only to you. In other words, you take in to first decide how to represent a piece of information in a binary pattern in front the binary patterns add up. The interpretation of binary pattern is called information representation or encoding. Furthermore, it is important that the information representation schemes are in agreement-upon by all the parties, i.e., industrial standards need to make up formulated and straightly followed.

At one time you decided on the data representation scheme, in for constraints, in specific, the precision and range will be imposed. Hence, information technology is important to understand data histrionics to write correct and high-performance programs.

Rosette Stone and the Decipherment of Egyptian Hieroglyphs

RosettaStone hieroglyphs

Egyptian hieroglyphs (incoming-to-left) were used aside the antediluvian Egyptians since 4000BC. Unluckily, since 500AD, no 1 could longer record the ancient Egyptian hieroglyphs, until the re-discovery of the Stem canker Edward Durell Stone in 1799 by Napoleon's troop (during Napoleon's Egyptian encroachment) near the town of Rashid (Rosetta) in the Nile Delta.

The Rosetta Stone (left) is inscribed with a decree in 196BC on behalf of King Ptolemy V. The decree appears in 3 scripts: the upper text is Ancient Egyptian hieroglyphs, the middle portion Demotic script, and the lowest Ancient Greek. Because IT presents fundamentally the synoptical text in all three scripts, and Past Greek could still be appreciated, it provided the key to the decipherment of the African country hieroglyphs.

The moral of the story is unless you know the encoding scheme, there is nary way that you can decipher the data.

Reference and images: Wikipedia.

Integer Representation

Integers are whole numbers or fixed-point numbers with the base sharpen fixed after the least-significant bit. They are dividing line to real numbers surgery floating-pointedness numbers racket, where the position of the radix point varies. It is important to bring up note that integers and floating-point numbers are doped differently in computers. They have different representation and are processed differently (e.g., floating-point numbers are processed in a soh-titled floating-point processor). Floating-repoint numbers will be discussed after.

Computers use a fast number of bits to represent an whole number. The commonly-used bit-lengths for integers are 8-bit, 16-bit, 32-bit or 64-bit. Besides bit-lengths, there are cardinal histrionics schemes for integers:

Unsigned Integers: can stand for nought and irrefutable integers.
Signed Integers: tail end represent zero, positive and negative integers. Three representation schemes had been planned for signed integers:
1. Sign-Magnitude representation
2. 1's Full complement representation
3. 2's Complement representation

You, as the coder, need to decide on the bit-length and delegacy strategy for your integers, depending on your application's requirements. Guess that you need a counter for counting a small amount from 0 up to 200, you might choose the 8-bit unsigned integer scheme as there is no Gram-negative numbers pool convoluted.

n-bit Unsigned Integers

Unsigned integers can be aught and positive integers, but non negative integers. The value of an unsigned integer is taken as "the magnitude of its underlying binary pattern".

Example 1: Suppose that n=8 and the positional representation system pattern is 0100 0001B, the value of this unsigned integer is 1×2^0 + 1×2^6 = 65D.

Example 2: Suppose that n=16 and the binary design is 0001 0000 0000 1000B, the value of this unsigned integer is 1×2^3 + 1×2^12 = 4104D.

Example 3: Suppose that n=16 and the binary pattern is 0000 0000 0000 0000B, the value of this unsigned integer is 0.

An n-scra pattern tooshie represent 2^n different integers. An n-bit unsigned integer can defend integers from 0 to (2^n)-1, as tabulated on a lower floor:

n	Minimal	Upper limit
8	0	(2^8)-1 (=255)
16	0	(2^16)-1 (=65,535)
32	0	(2^32)-1 (=4,294,967,295) (9+ digits)
64	0	(2^64)-1 (=18,446,744,073,709,551,615) (19+ digits)

Signed Integers

Signed integers tin can exemplify zero, convinced integers, besides as negative integers. Three representation schemes are available for signed integers:

Sign-Order of magnitude representation
1's Complement histrionics
2's Complement representation

In all the to a higher place ternary schemes, the virtually-significant bit (msb) is known as the sign tur. The sign piece is in use to represent the sign of the integer - with 0 for positive integers and 1 for negative integers. The magnitude of the integer, all the same, is interpreted differently in different schemes.

n-bit Sign Integers in Foretoken-Magnitude Mental representation

In sign-order of magnitude agency:

The most-significant bit (mutual savings bank) is the sign routine, with value of 0 representing positive integer and 1 representing negative integer.
The remaining n-1 bits represents the order of magnitude (absolute value) of the whole number. The absolute value of the integer is interpreted as "the order of magnitude of the (n-1)-bit positional representation system pattern".

Example 1: Suppose that n=8 and the binary representation is 0 100 0001B.
Sign bit is 0 ⇒ confirming
Absolute value is 100 0001B = 65D
Hence, the integer is +65D

Representative 2: Theorise that n=8 and the positional notation mental representation is 1 000 0001B.
Sign bit is 1 ⇒ negative
Absolute value is 000 0001B = 1D
Hence, the whole number is -1D

Example 3: Opine that n=8 and the positional notation representation is 0 000 0000B.
Sign bit is 0 ⇒ positive
Univocal time value is 000 0000B = 0D
Hence, the integer is +0D

Example 4: Opine that n=8 and the binary representation is 1 000 0000B.
Sign bit is 1 ⇒ negative
Complete esteem is 000 0000B = 0D
Hence, the whole number is -0D

sign-magnitude representation

The drawbacks of sign-order of magnitude theatrical are:

There are two representations (0000 0000B and 1000 0000B) for the number zero, which could lead to inefficiency and mental confusion.
Irrefutable and Gram-negative integers need to be processed separately.

n-bit Sign Integers in 1's Complement Representation

In 1's complement representation:

Again, the most significant bit (msb) is the sign bit, with value of 0 representing positive integers and 1 representing negative integers.
The remaining n-1 bits represents the order of magnitude of the integer, as follows:
- for positive integers, the absolute rate of the integer is rival to "the magnitude of the (n-1)-bit double star pattern".
- for negative integers, the absolute value of the integer is capable "the magnitude of the complement (inverse) of the (n-1)-bit binary practice" (hence called 1's complement).

Model 1: Suppose that n=8 and the binary representation 0 100 0001B.
Sign up spot is 0 ⇒ positive
Absolute note value is 100 0001B = 65D
Thu, the integer is +65D

Exercise 2: Suppose that n=8 and the binary mental representation 1 000 0001B.
Sign away bit is 1 ⇒ dismissive
Absolute value is the complement of 000 0001B, i.e., 111 1110B = 126D
Thu, the whole number is -126D

Example 3: Suppose that n=8 and the binary representation 0 000 0000B.
Mansion bit is 0 ⇒ positive
Absolute value is 000 0000B = 0D
Hence, the integer is +0D

Lesson 4: Suppose that n=8 and the binary representation 1 111 1111B.
Communicative bit is 1 ⇒ negative
Downright value is the complement of 111 1111B, i.e., 000 0000B = 0D
Thence, the whole number is -0D

1's complement

Over again, the drawbacks are:

There are two representations (0000 0000B and 1111 1111B) for zero.
The positive integers and negative integers need to be processed separately.

n-bit Sign Integers in 2's Complement Representation

In 2's complement representation:

Over again, the nigh evidential bit (msb) is the sign bit, with value of 0 representing positive integers and 1 representing counter integers.
The remaining n-1 bits represents the magnitude of the integer, as follows:
- for positive integers, the unquestioning value of the integer is capable "the magnitude of the (n-1)-bit binary design".
- for negative integers, the absolute value of the integer is equal to "the order of magnitude of the complement of the (n-1)-spot binary pattern plus one" (thu called 2's full complement).

Good example 1: Hypothesise that n=8 and the double star representation 0 100 0001B.
Sign snatch is 0 ⇒ positive
Implicit value is 100 0001B = 65D
Hence, the integer is +65D

Example 2: Suppose that n=8 and the binary representation 1 000 0001B.
Sign bit is 1 ⇒ negative
Downright value is the complement of 000 0001B plus 1, i.e., 111 1110B + 1B = 127D
Hence, the integer is -127D

Illustration 3: Suppose that n=8 and the double star representation 0 000 0000B.
Sign piece is 0 ⇒ positivist
Univocal value is 000 0000B = 0D
Hence, the integer is +0D

Lesson 4: Hypothesise that n=8 and the binary representation 1 111 1111B.
Sign bit is 1 ⇒ antagonistic
Absolute appreciate is the complement of 111 1111B plus 1, i.e., 000 0000B + 1B = 1D
Hence, the integer is -1D

2's complement

Computers use 2's Complement Delegacy for Signed Integers

We have discussed three representations for signed integers: signed-order of magnitude, 1's complement and 2's full complement. Computers use 2's complement in representing autographed integers. This is because:

There is only one representation for the number zero in 2's complement, instead of two representations in sign-order of magnitude and 1's complement.
Positive and negative integers can be burnt together in addition and subtraction. Minus can be carried dead using the "addition logic".

Example 1: Summation of Two Positive Integers: Suppose that n=8, 65D + 5D = 70D

65D →    0100 0001B  5D →    0000 0101B(+           0100 0110B    → 70D (Satisfactory)

Example 2: Subtraction is treated as Plus of a Positive and a Negative Integers: Reckon that n=8, 5D - 5D = 65D + (-5D) = 60D

65D →    0100 0001B -5D →    1111 1011B(+           0011 1100B    → 60D (discard carry - OK)

Example 3: Add-on of Two Minus Integers: Suppose that n=8, -65D - 5D = (-65D) + (-5D) = -70D

-65D →    1011 1111B  -5D →    1111 1011B(+            1011 1010B    → -70D (discard carry - O.k.)

Because of the secure preciseness (i.e., fixed number of bits), an n-bit 2's complement signed integer has a certain vagabon. For example, for n=8, the range of 2's full complement signed integers is -128 to +127. During addition (and subtraction), it is primal to check whether the result exceeds this range, in other words, whether overflow or underflow has occurred.

Example 4: Overflow: Suppose that n=8, 127D + 2D = 129D (overflow - beyond the range)

127D →    0111 1111B   2D →    0000 0010B(+            1000 0001B    → -127D (wrong)

Example 5: Underflow: Suppose that n=8, -125D - 5D = -130D (underflow - below the range)

-125D →    1000 0011B   -5D →    1111 1011B(+             0111 1110B    → +126D (wrong)

The following diagram explains how the 2's complement workings. By re-transcription the number line, values from -128 to +127 are portrayed contiguously past ignoring the bear bit.

signed integer

Range of n-bite 2's Complement Signed Integers

An n-bite 2's full complement sign-language integer can represent integers from -2^(n-1) to +2^(n-1)-1, atomic number 3 tabulated. Take note that the intrigue ass represent all the integers inside the range, without some gap. In other words, at that place is no missing integers within the supported chain of mountains.

n	minimal	maximal
8	-(2^7) (=-128)	+(2^7)-1 (=+127)
16	-(2^15) (=-32,768)	+(2^15)-1 (=+32,767)
32	-(2^31) (=-2,147,483,648)	+(2^31)-1 (=+2,147,483,647)(9+ digits)
64	-(2^63) (=-9,223,372,036,854,775,808)	+(2^63)-1 (=+9,223,372,036,854,775,807)(18+ digits)

Decoding 2's Complement Numbers

Check mark the preindication bit (denoted as S).
If S=0, the number is positive and its unquestioning rate is the binary value of the remaining n-1 bits.
If S=1, the number is negative. you could "invert the n-1 bits and nonnegative 1" to get the absolute value of negative number.
Or els, you could scan the remaining n-1 bits from the right (least-significant morsel). Look for the first occurrence of 1. Somersault completely the bits to the left of that first occurrence of 1. The flipped approach pattern gives the absolute value. For example,
```
n = 8, bit design = 1 100 0100B S = 1 → negative Scanning from the right and flip all the bits to the left of the archetypal occurrence of 1 ⇒              011 1100B = 60D Hence, the value is -60D
```

Big Endian vs. Little Endian

Modern computers store one byte of data in each memory address or location, i.e., byte addressable memory. An 32-morsel integer is, therefore, stored in 4 memory addresses.

The term"Endian" refers to the order of storing bytes in computer storage. In "Walloping Endian" scheme, the most significant byte is stored first in the lowest memory destination (or big in foremost), while "Undersized Endian" stores the least significant bytes in the lowest retention address.

For example, the 32-bit integer 12345678H (305419896₁₀) is stored as 12H 34H 56H 78H in hulking endian; and 78H 56H 34H 12H in little endian. An 16-bit whole number 00H 01H is understood arsenic 0001H in monstrous endian, and 0100H as lilliputian endian.

Employment (Integer Representation)

What are the ranges of 8-bit, 16-act, 32-bit and 64-flake integer, in "unsigned" and "signed" representation?
Give the value of 88, 0, 1, 127, and 255 in 8-chip unsigned representation.
Spring the value of +88, -88 , -1, 0, +1, -128, and +127 in 8-bit 2's complement autographed theatrical performance.
Give the value of +88, -88 , -1, 0, +1, -127, and +127 in 8-bit sign-magnitude delegacy.
Give the value of +88, -88 , -1, 0, +1, -127 and +127 in 8-bit 1's complement representation.
[TODO] more.

Answers

The range of unsigned n-bit integers is [0, 2^n - 1]. The cooking stove of n-bit 2's complement signed integer is [-2^(n-1), +2^(n-1)-1];
88 (0101 1000), 0 (0000 0000), 1 (0000 0001), 127 (0111 1111), 255 (1111 1111).
+88 (0101 1000), -88 (1010 1000), -1 (1111 1111), 0 (0000 0000), +1 (0000 0001), -128 (1000 0000), +127 (0111 1111).
+88 (0101 1000), -88 (1101 1000), -1 (1000 0001), 0 (0000 0000 operating theatre 1000 0000), +1 (0000 0001), -127 (1111 1111), +127 (0111 1111).
+88 (0101 1000), -88 (1010 0111), -1 (1111 1110), 0 (0000 0000 or 1111 1111), +1 (0000 0001), -127 (1000 0000), +127 (0111 1111).

Aimless-Point Number Representation

A floating-charge number (or real number) tin can represent a rattling large (1.23×10^88) or a very teeny-weeny (1.23×10^-88) measure. IT could also exemplify very massive negative number (-1.23×10^88) and rattling small negative number (-1.23×10^88), too Eastern Samoa naught, as illustrated:

Representation_FloatingPointNumbers

A floating-point number is typically definitive in the scientific notation, with a fraction (F), and an proponent (E) of a destined radix (r), in the form of F×r^E. Decimal numbers use radix of 10 (F×10^E); while double star numbers use radix of 2 (F×2^E).

Theatrical of unsettled repoint number is not unique. For example, the list 55.66 can be diagrammatic as 5.566×10^1, 0.5566×10^2, 0.05566×10^3, and so along. The fractional part can equal normalized. In the normalized chassis, there is only a single not-zero digit before the radix point. For deterrent example, decimal number 123.4567 can be normalized American Samoa 1.234567×10^2; binary number 1010.1011B can be normalized as 1.0101011B×2^3.

It is important to note that floating-point numbers suffer from loss of preciseness when delineate with a fixed identification number of bits (e.g., 32-second or 64-bit). This is because there are infinite number of real numbers (even within a small range of says 0.0 to 0.1). On the other bridge player, a n-tur binary pattern can represent a finite 2^n different Book of Numbers. Hence, non all the real numbers can be represented. The nearest bringing close together will be used instead, resulted in loss of accuracy.

Information technology is too central to note that floating numeral arithmetic is a good deal less efficient than integer arithmetic. Information technology could be speed up with a supposed dedicated floating-point carbon monoxide gas-CPU. Hence, use integers if your applications programme does non require unfixed-point numbers.

In computers, floating-point numbers are represented in scientific notation of fraction (F) and exponent (E) with a radix of 2, in the organize of F×2^E. Both E and F can constitute positive besides as negative. Neo computers adopt IEEE 754 standard for representing floating-point numbers. There are 2 theatrical performance schemes: 32-bit single-precision and 64-second double-precision.

IEEE-754 32-bit Single-Preciseness Floating-Point Numbers

In 32-bit single-precision floating-point representation:

The to the highest degree significant bit is the augury scra (S), with 0 for confirming numbers and 1 for negative numbers pool.
The following 8 bits represent exponent (E).
The remaining 23 bits represents fraction (F).

float

Normalized Form

Let's illustrate with an example, suppose that the 32-bit pattern is 1 1000 0001 011 0000 0000 0000 0000 0000 , with:

S = 1
E = 1000 0001
F = 011 0000 0000 0000 0000 0000

In the normalized form, the actual fraction is normalized with an implicit leading 1 in the pattern of 1.F. In this example, the genuine fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-2 + 1×2^-3 = 1.375D.

The sign over bit represents the sign of the number, with S=0 for positive and S=1 for negative number. In this good example with S=1, this is a electronegative number, i.e., -1.375D.

In normalized form, the actual index is E-127 (so-called excess-127 or bias-127). This is because we need to represent some positive and negative index. With an 8-bit E, ranging from 0 to 255, the excess-127 scheme could provide actual exponent of -127 to 128. In this example, E-127=129-127=2D.

Hence, the number represented is -1.375×2^2=-5.5D.

De-Normalized Form

Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot represent the number zero! Convince yourself on this!

De-normalized organize was devised to represent nada and other numbers.

For E=0, the numbers are in the de-normalized form. An implicit leading 0 (instead of 1) is used for the fraction; and the actual proponent is ever -126. Hence, the number zip can be represented with E=0 and F=0 (because 0.0×2^-126=0).

We can as wel represent real diminished positive and unsupportive numbers in DE-normalized form with E=0. For example, if S=1, E=0, and F=011 0000 0000 0000 0000 0000. The effective fraction is 0.011=1×2^-2+1×2^-3=0.375D. Since S=1, it is a negative amoun. With E=0, the current exponent is -126. Hence the total is -0.375×2^-126 = -4.4×10^-39, which is an highly flyspeck pessimistic number (roughly zero).

Summary

In summary, the respect (N) is calculated as follows:

For 1 ≤ E ≤ 254, N = (-1)^S × 1.F × 2^(E-127). These numbers are in the supposed normalized form. The sign-bit represents the house of the number. Fractional part (1.F) are normalized with an unverbalised leading 1. The exponent is bias (or in excess) of 127, so as to represent some positive and disconfirming advocator. The range of exponent is -126 to +127.
For E = 0, N = (-1)^S × 0.F × 2^(-126). These numbers are in the questionable denormalized form. The index of 2^-126 evaluates to a very small number. Denormalized human body is needed to represent zero (with F=0 and E=0). Information technology can besides represents precise small positive and negative phone number approximately zero.
For E = 255, IT represents special values, such as ±INF (prescribed and negative eternity) and NaN (non a number). This is beyond the scope of this article.

Object lesson 1: Reckon that IEEE-754 32-bit floating-point representation pattern is 0 10000000 110 0000 0000 0000 0000 0000 .

Sign bit S = 0 ⇒ positive number E = 1000 0000B = 128D (in normalized form) Fraction is 1.11B (with an implicit leading 1) = 1 + 1×2^-1 + 1×2^-2 = 1.75D The number is +1.75 × 2^(128-127) = +3.5D

Example 2: Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110 100 0000 0000 0000 0000 0000 .

Sign number S = 1 ⇒ negative telephone number E = 0111 1110B = 126D (in normalized form) Fraction is 1.1B  (with an tacit leading 1) = 1 + 2^-1 = 1.5D The count is -1.5 × 2^(126-127) = -0.75D

Example 3: Conjecture that IEEE-754 32-bit floating-luff representation convention is 1 01111110 000 0000 0000 0000 0000 0001 .

Sign moment S = 1 ⇒ negative number E = 0111 1110B = 126D (in normalized form) Fraction is 1.000 0000 0000 0000 0000 0001B  (with an implicit leading 1) = 1 + 2^-23 The numeral is -(1 + 2^-23) × 2^(126-127) = -0.500000059604644775390625 (may not be right in decimal!)

Example 4 (De-Normalized Form): Theorise that IEEE-754 32-bit mobile-distributor point representation approach pattern is 1 00000000 000 0000 0000 0000 0000 0001 .

Sign in bite S = 1 ⇒ negative number E = 0 (in de-normalized forg) Fraction is 0.000 0000 0000 0000 0000 0001B  (with an implicit leading 0) = 1×2^-23 The bi is -2^-23 × 2^(-126) = -2×(-149) ≈ -1.4×10^-45

Exercises (Floating-breaker point Book of Numbers)

Compute the largest and smallest positive numbers that derriere be portrayed in the 32-bit normalized form.
Compute the largest and smallest negative numbers tail be diagrammatic in the 32-moment normalized form.
Repeat (1) for the 32-bit denormalized form.
Repeat (2) for the 32-bit denormalized form.

Hints:

Largest positive number: S=0, E=1111 1110 (254), F=111 1111 1111 1111 1111 1111.
Smallest positive number: S=0, E=0000 00001 (1), F=000 0000 0000 0000 0000 0000.
Same as to a higher place, but S=1.
Largest positive number: S=0, E=0, F=111 1111 1111 1111 1111 1111.
Smallest positive bi: S=0, E=0, F=000 0000 0000 0000 0000 0001.
Same atomic number 3 above, but S=1.

Notes For Java Users

You can utilisation JDK methods Float.intBitsToFloat(int bits) or Two-base hit.longBitsToDouble(long bits) to create a I-precision 32-bit float or double-precision 64-bit double with the specific bit patterns, and mark their values. For examples,

Scheme.out.println(Float.intBitsToFloat(0x7fffff)); System.out.println(Double.longBitsToDouble(0x1fffffffffffffL));

IEEE-754 64-tur Double-Precision Aimless-Point Numbers

The histrionics scheme for 64-bit duple-precision is corresponding to the 32-bit single-precision:

The about significant bit is the sign flake (S), with 0 for positive numbers and 1 for antagonistic Book of Numbers.
The following 11 bits represent power (E).
The remaining 52 bits represents fraction (F).

double

The value (N) is deliberate arsenic follows:

Normalized form: For 1 ≤ E ≤ 2046, N = (-1)^S × 1.F × 2^(E-1023).
Denormalized kind: For E = 0, N = (-1)^S × 0.F × 2^(-1022). These are in the denormalized form.
For E = 2047, N represents special values, such as ±INF (infinity), NaN (non a number).

More along Floating-Point Representation

There are three parts in the floating-point representation:

The sign bit (S) is self-instructive (0 for incontrovertible numbers and 1 for negative numbers pool).
For the power (E), a so-known as diagonal (operating theatre excess) is practical so American Samoa to represent both positive and negative exponent. The bias is set at half of the pasture. For various precision with an 8-bit exponent, the bias is 127 (or excess-127). For double preciseness with a 11-bit exponent, the bias is 1023 (or excess-1023).
The fraction (F) (also called the mantissa or significand) is composed of an underlying leading flake (before the base point) and the fractional bits (after the base point). The leading bit for normalized numbers is 1; while the leading bit for denormalized numbers is 0.

Normalized Floating-Breaker point Numbers

In normalized take shape, the radix compass point is placed after the first cardinal digit, e,g., 9.8765D×10^-23D, 1.001011B×2^11B. For binary enumerate, the leading bit is always 1, and deman not be diagrammatical explicitly - this saves 1 bit of storage.

In IEEE 754's normalized form:

For single-precision, 1 ≤ E ≤ 254 with excess of 127. Hence, the actual exponent is from -126 to +127. Negative exponents are utilised to symbolize small numbers (< 1.0); while positive exponents are in use to represent large numbers (> 1.0).
N = (-1)^S × 1.F × 2^(E-127)
For double-precision, 1 ≤ E ≤ 2046 with superfluous of 1023. The actual exponent is from -1022 to +1023, and
N = (-1)^S × 1.F × 2^(E-1023)

Take note that n-bit pattern has a finite number of combinations (=2^n), which could lay out finite distinct numbers. It is not possible to represent the infinite numbers game in the real axis (even a dinky range says 0.0 to 1.0 has infinite numbers). That is, non all floating-stop Book of Numbers can be accurately represented. Instead, the nearest estimation is used, which leads to loss of accuracy.

The tokenish and maximum normalized floating-point numbers are:

Precision	Normalized N(min)	Normalized N(goop)
Single	0080 0000H 0 00000001 00000000000000000000000B E = 1, F = 0 N(min) = 1.0B × 2^-126 (≈1.17549435 × 10^-38)	7F7F FFFFH 0 11111110 00000000000000000000000B E = 254, F = 0 N(goop) = 1.1...1B × 2^127 = (2 - 2^-23) × 2^127 (≈3.4028235 × 10^38)
Stunt man	0010 0000 0000 0000H N(min) = 1.0B × 2^-1022 (≈2.2250738585072014 × 10^-308)	7FEF FFFF FFFF FFFFH N(max) = 1.1...1B × 2^1023 = (2 - 2^-52) × 2^1023 (≈1.7976931348623157 × 10^308)

real numbers

Denormalized Floating-Point Numbers

If E = 0, but the fraction is not-nil, so the value is in denormalized form, and a directive fleck of 0 is imitative, atomic number 3 follows:

For single-precision, E = 0,
N = (-1)^S × 0.F × 2^(-126)
For duplicate-preciseness, E = 0,
N = (-1)^S × 0.F × 2^(-1022)

Denormalized form can comprise very small numbers racket closed to zilch, and zero, which cannot be represented in normalized strain, American Samoa shown in the above figure.

The minimum and maximum of denormalized floating-point numbers are:

Precision	Denormalized D(min)	Denormalized D(max)
Single	0000 0001H 0 00000000 00000000000000000000001B E = 0, F = 00000000000000000000001B D(min) = 0.0...1 × 2^-126 = 1 × 2^-23 × 2^-126 = 2^-149 (≈1.4 × 10^-45)	007F FFFFH 0 00000000 11111111111111111111111B E = 0, F = 11111111111111111111111B D(max) = 0.1...1 × 2^-126 = (1-2^-23)×2^-126 (≈1.1754942 × 10^-38)
Stunt man	0000 0000 0000 0001H D(min) = 0.0...1 × 2^-1022 = 1 × 2^-52 × 2^-1022 = 2^-1074 (≈4.9 × 10^-324)	001F FFFF FFFF FFFFH D(max) = 0.1...1 × 2^-1022 = (1-2^-52)×2^-1022 (≈4.4501477170144023 × 10^-308)

Exceptional Values

Zero: Zero cannot be represented in the normalized form, and mustiness be delineated in denormalized form with E=0 and F=0. There are two representations for ordinal: +0 with S=0 and -0 with S=1.

Eternity: The apprais of +infinity (e.g., 1/0) and -eternity (e.g., -1/0) are represented with an advocator of wholly 1's (E = 255 for single-precision and E = 2047 for treble-precision), F=0, and S=0 (for +INF) and S=1 (for -INF).

Not a Amoun (NaN): Nan denotes a assess that cannot be delineate as rattling number (e.g. 0/0). NaN is represented with Exponent of all 1's (E = 255 for single-precision and E = 2047 for double-preciseness) and any not-zero fraction.

Character Encoding

In computer memory, graphic symbol are "encoded" (Beaver State "represented") using a chosen "fictitious character encoding schemes" (aka "character set", "charset", "character mapping", or "code page").

For example, in ASCII (as well American Samoa Latin1, Unicode, and many different part sets):

computer code numbers 65D (41H) to 90D (5AH) represents 'A' to 'Z', respectively.
encipher numbers 97D (61H) to 122D (7AH) represents 'a' to 'z', severally.
code numbers pool 48D (30H) to 57D (39H) represents '0' to '9', severally.

It is important to note that the representation scheme must be known before a binary pattern can exist interpreted. E.g., the 8-bit pattern "0100 0010B" could represent anything under the sun acknowledged sole to the person encoded it.

The most commonly-used character encoding schemes are: 7-bit ASCII (ISO/IEC 646) and 8-minute Latin-x (ISO/IEC 8859-x) for western European characters, and Unicode (ISO/IEC 10646) for internationalization (i18n).

A 7-bit encoding scheme (much as ASCII) rump represent 128 characters and symbols. An 8-bit character encryption scheme (such every bit Latin-x) can represent 256 characters and symbols; whereas a 16-bit encoding scheme (much as Unicode UCS-2) can represents 65,536 characters and symbols.

7-bit ASCII Encipher (aka US-ASCII, ISO/IEC 646, ITU-T T.50)

American Standard Code for Information Interchange (American Standard Encrypt for Entropy Interchange) is peerless of the earlier character secret writing schemes.
ASCII is originally a 7-bit code. It has been extended to 8-bit to improve utilize the 8-bit computer memory organization. (The 8th-bit was originally used for odd-even check in the early computers.)

Encipher numbers racket 32D (20H) to 126D (7EH) are printable (displayable) characters as tabulated (arranged in hexadecimal and decimal fraction) as follows:

Hex	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
2	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
3	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
4	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
5	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
6	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
7	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~

Celestial latitude	0	1	2	3	4	5	6	7	8	9
3			SP	!	"	#	$	%	&	'
4	(	)	*	+	,	-	.	/	0	1
5	2	3	4	5	6	7	8	9	:	;
6	<	=	>	?	@	A	B	C	D	E
7	F	G	H	I	J	K	L	M	N	O
8	P	Q	R	S	T	U	V	W	X	Y
9	Z	[	\	]	^	_	`	a	b	c
10	d	e	f	g	h	i	j	k	l	m
11	n	o	p	q	r	s	t	u	v	w
12	x	y	z	{	\|	}	~

Code number 32D (20H) is the blank or space character.
'0' to '9': 30H-39H (0011 0001B to 0011 1001B) or (0011 xxxxB where xxxx is the equivalent integer value)
'A' to 'Z': 41H-5AH (0101 0001B to 0101 1010B) or (010x xxxxB). 'A' to 'Z' are continuous without disruption.
'a' to 'z': 61H-7AH (0110 0001B to 0111 1010B) or (011x xxxxB). 'A' to 'Z' are also continuous without gap. However, there is a gap 'tween majuscul and small letters. To convert betwixt superior and lowercase, flip the value of bit-5.

Computer code numbers 0D (00H) to 31D (1FH), and 127D (7FH) are special assure characters, which are non-printable (non-displayable), As tabulated below. Many of these characters were used in the early years for contagion control condition (e.g., STX, ETX) and pressman control (e.g., Form-Fertilize), which are now obsolete. The remaining substantive codes today are:
- 09H for Tab ('\t').
- 0AH for Line-Feast Oregon newline (LF or '\n') and 0DH for Carriage-Return (CR or 'r'), which are used as line delimiter (aka crease separator, goal-of-line of business) for school tex files. In that location is unfortunately no standard for stemma delimiter: Unixes and Macintosh use 0AH (LF or "\n"), Windows use 0D0AH (CR+LF operating theater "\r\n"). Programming languages such as C/C++/Java (which was created on Unix) use 0AH (LF or "\n").
- In programming languages such as C/C++/Java, air-feed (0AH) is denoted as '\n', carriage-return (0DH) as '\r', tab (09H) as '\t'.

DEC	HEX	Meaning		DEC	Hexadecimal	Meaning
0	00	NUL	Null	17	11	DC1	Device Moderate 1
1	01	Sol	Get going of Heading	18	12	DC2	Device Control condition 2
2	02	STX	Starting signal of Text	19	13	DC3	Device Hold in 3
3	03	ETX	End of Text	20	14	DC4	Device See to it 4
4	04	EOT	End of Transmission	21	15	NAK	Negative Ack.
5	05	ENQ	Enquiry	22	16	SYN	Sync. Idle
6	06	ACK	Acknowledgment	23	17	ETB	End of Transmission
7	07	BEL	Bell	24	18	CAN	Cancel
8	08	BS	Back Space `'\b'`	25	19	EM	End of Medium
9	09	HT	Flat Tab `'\t'`	26	1A	Submarine sandwich	Substitute
10	0A	Low frequency	Line Feed `'\n'`	27	1B	ESC	Escape
11	0B	VT	Hierarchical Feed	28	1C	IS4	File Separator
12	0C	FF	Soma Feed `'f'`	29	1D	IS3	Group Separator
13	0D	CR	Posture Return `'\r'`	30	1E	IS2	Put down Separator
14	0E	Then	Shift Down	31	1F	IS1	Social unit Separator
15	0F	SI	Shift In
16	10	Discoid lupus erythematosus	Datalink Run	127	7F	DEL	Delete

8-bit Latin-1 (aka ISO/IEC 8859-1)

ISO/IEC-8859 is a collection of 8-bit fictitious character encoding standards for the western languages.

ISO/IEC 8859-1, aka Italian region alphabet No. 1, or Latin-1 in short, is the most unremarkably-used encoding scheme for western European languages. It has 191 printable characters from the latin book, which covers languages care English, German, European country, European country and Spanish. Latin-1 is backward compatible with the 7-bit US-ASCII write in code. That is, the first 128 characters in Latin-1 (computer code numbers 0 to 127 (7FH)), is the synoptic as US-ASCII. Code numbers 128 (80H) to 159 (9FH) are not assigned. Code numbers 160 (A0H) to 255 (FFH) are relinquished as follows:

Hex	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
A	NBSP	¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬	Timid	®	¯
B	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿
C	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
D	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
E	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
F	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

ISO/IEC-8859 has 16 parts. Besides the most commonly-old Share 1, Partially 2 is meant for Central European (Polish, Czech, Hungarian, etc), Piece 3 for South European (Turkish, etc), Part 4 for North European (Estonian, Latvian, etc), Part 5 for Cyrillic, Part 6 for Arabic, Part 7 for Greek, Set forth 8 for Hebrew, Part 9 for Turkish, Part 10 for Nordic, Part 11 for Thai, Part 12 was abandon, Part 13 for Sea Rim, Part 14 for Celtic, Part 15 for French, Finnish, etc. Part 16 for South-Eastern European.

Else 8-bit Extension phone of US-ASCII (ASCII Extensions)

Beside the standardized ISO-8859-x, there are many another 8-piece ASCII extensions, which are non matched with all others.

ANSI (American National Standards Institute) (aka Windows-1252, or Windows Codepage 1252): for Latin alphabets used in the bequest DOS/Windows systems. It is a superset of ISO-8859-1 with code Book of Numbers 128 (80H) to 159 (9FH) assigned to displayable characters, such as "smart" single-quotes and doubled-quotes. A common problem in web browsers is that all the quotes and apostrophes (produced away "smart quotes" in some Microsoft software) were replaced with question marks or some strange symbols. It it because the document is labeled as ISO-8859-1 (instead of Windows-1252), where these cipher numbers game are undefined. All but modern browsers and netmail clients treat charset ISO-8859-1 as Windows-1252 in order to accommodate such mis-labeling.

Hex	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
8	€		‚	ƒ	„	…	†	‡	ˆ	‰	Š	‹	Œ		Ž
9		'	'	"	"	•	–	—		™	š	›	œ		ž	Ÿ

EBCDIC (Figurative Binary Coded Decimal Interchange Code): In use in the early IBM computers.

Unicode (aka ISO/IEC 10646 Universal Character Set)

Before Unicode, no single character encoding strategy could represent characters in entirely languages. For example, western european uses various encryption schemes (in the ISO-8859-x kinsfolk). Even a ace language like Island has a few encoding schemes (GB2312/GBK, BIG5). Many encoding schemes are in conflict of all other, i.e., the Saami cipher bi is assigned to different characters.

Unicode aims to provide a standard character encoding connive, which is universal, expeditious, dedifferentiated and unambiguous. Unicode definitive is maintained by a non-gain organization titled the Unicode Consortium (@ www.unicode.org). Unicode is an ISO/IEC standard 10646.

Unicode is backward compatible with the 7-bit US-ASCII and 8-bit Italian region-1 (ISO-8859-1). That is, the first 128 characters are the same as US-ASCII; and the first 256 characters are the same as Italic language-1.

Unicode originally uses 16 bits (called UCS-2 or Unicode Character Set - 2 byte), which can represent up to 65,536 characters. Information technology has since been swollen to much 16 bits, currently stands at 21 bits. The range of the legal codes in ISO/IEC 10646 is now from U+0000H to U+10FFFFH (21 bits or around 2 million characters), covering totally current and ancient historical scripts. The original 16-bit lay out of U+0000H to U+FFFFH (65536 characters) is known as Basic Multilingual Plane (BMP), covering all the John R. Major languages busy currently. The characters outside BMP are called Supplementary Characters, which are non frequently-used.

Unicode has cardinal encryption schemes:

UCS-2 (Linguistic universal Character Set - 2 Byte): Uses 2 bytes (16 bits), covering 65,536 characters in the BMP. BMP is sufficient for most of the applications. UCS-2 is now superannuated.
UCS-4 (Universal Character Set - 4 Byte): Uses 4 bytes (32 bits), cover BMP and the subsidiary characters.

UTF-8 (Unicode Transformation Format - 8-bit)

The 16/32-moment Unicode (UCS-2/4) is grossly inefficient if the document contains chiefly American Standard Code for Information Interchange characters, because for each one eccentric occupies deuce bytes of storage. Variable-length encoding schemes, such as UTF-8, which uses 1-4 bytes to represent a fibre, was devised to improve the efficiency. In UTF-8, the 128 ordinarily-secondhand U.S.-ASCII characters habit only 1 byte, just much less-ordinarily characters may require prepared to 4 bytes. Total, the efficiency improved for document containing mainly US-ASCII texts.

The translation betwixt Unicode and UTF-8 is as follows:

Bits	Unicode	UTF-8 Code	Bytes
7	00000000 0xxxxxxx	0xxxxxxx	1 (ASCII)
11	00000yyy yyxxxxxx	110yyyyy 10xxxxxx	2
16	zzzzyyyy yyxxxxxx	1110zzzz 10yyyyyy 10xxxxxx	3
21	000uuuuu zzzzyyyy yyxxxxxx	11110uuu 10uuzzzz 10yyyyyy 10xxxxxx	4

In UTF-8, Unicode numbers like to the 7-bit ASCII characters are padded with a in the lead zero; thus has the same value every bit ASCII. Thu, UTF-8 can be used with every last software package victimization American Standard Code for Information Interchange. Unicode numbers of 128 and above, which are little frequently used, are encoded using more bytes (2-4 bytes). UTF-8 by and large requires less storehouse and is compatible with ASCII. The drawback of UTF-8 is more processing top executive needed to unpack the computer code referable its variable length. UTF-8 is the most popular formatting for Unicode.

Notes:

UTF-8 uses 1-3 bytes for the characters in BMP (16-bit), and 4 bytes for subsidiary characters external BMP (21-bit).
The 128 ASCII characters (basic Italic letters, digits, and punctuation signs) use one byte. Most Continent and Middle East characters use a 2-byte chronological sequence, which includes extended Italic letters (with tilde, macron, penetrative, sober and other accents), Greek, Armenian, Hebrew, Semite, and others. Island, Japanese and Korean (CJK) use three-byte sequences.
All the bytes, except the 128 ASCII characters, have a starring '1' bit. In past words, the ASCII bytes, with a leading '0' moment, rear end be identified and decoded easily.

Example: 您好 (Unicode: 60A8H 597DH)

Unicode (UCS-2) is 60A8H = 0110 0000 10 101000B ⇒ UTF-8 is 11100110 10000010 10101000B = E6 82 A8H Unicode (UCS-2) is 597DH = 0101 1001 01 111101B ⇒ UTF-8 is 11100101 10100101 10111101B = E5 A5 BDH

UTF-16 (Unicode Transmutation Format - 16-bit)

UTF-16 is a variable-distance Unicode character encoding scheme, which uses 2 to 4 bytes. UTF-16 is not commonly used. The transformation defer is A follows:

Unicode	UTF-16 Code	Bytes
xxxxxxxx xxxxxxxx	Same as UCS-2 - no encoding	2
000uuuuu zzzzyyyy yyxxxxxx (uuuuu≠0)	110110ww wwzzzzyy 110111yy yyxxxxxx (wwww = uuuuu - 1)	4

Observe that for the 65536 characters in BMP, the UTF-16 is the same Eastern Samoa UCS-2 (2 bytes). However, 4 bytes are used for the supplementary characters outside the BMP.

For BMP characters, UTF-16 is the same as UCS-2. For supplementary characters, each character requires a pair 16-morsel values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the scurvy-surrogates range (\uDC00-\uDFFF).

UTF-32 (Unicode Transformation Format - 32-morsel)

Same as UCS-4, which uses 4 bytes for to each one character - unencoded.

Formats of Multi-Byte (e.g., Unicode) Text Files

Endianess (or byte-order): For a multi-byte character reference, you need to take care of the order of the bytes in storage. In big endian, the most significant byte is stored at the memory location with the lowest address (bad byte first). In little endian, the most significant byte is stored at the memory location with the highest address (small byte firstly). For example, 您 (with Unicode number of 60A8H) is stored as 60 A8 in big endian; and stored as A8 60 in little endian. Big endian, which produces a more readable hex dump, is more commonly-used, and is often the default on.

BOM (Byte Order Stigma): BOM is a special Unicode lineament having encipher number of FEFFH, which is utilized to differentiate big-endian and infinitesimal-endian. For big-endian, BOM appears as FE FFH in the storage. For little-endian, BOM appears as FF FEH. Unicode militia these two code numbers to prevent it from crashing with another character.

Unicode text files could take on these formats:

Big Endian: UCS-2BE, UTF-16BE, UTF-32BE.
Little Endian: UCS-2LE, UTF-16LE, UTF-32LE.
UTF-16 with BOM. The first character of the Indian file is a BOM eccentric, which specifies the endianess. For big-endian, BOM appears as FE FFH in the storage. For little-endian, BOM appears American Samoa FF FEH.

UTF-8 file is always stored as big endian. BOM plays atomic number 102 take off. Still, in some systems (in finical Windows), a BOM is added as the offse character in the UTF-8 file as the signature to identity the file as UTF-8 encoded. The BOM theatrical role (FEFFH) is encoded in UTF-8 as EF BB BF. Adding a BOM as the first character of the file is non recommended, as it may be incorrectly taken in other organisation. You dismiss have a UTF-8 file without BOM.

Formats of Text Files

Line Delimiter operating theatre Remainder-Of-Melody (EOL): Sometimes, when you use the Windows NotePad to open a textbook file (created in Unix or Mack), all the lines are coupled together. This is because different operating platforms use antithetic character as the and then-called billet delimiter (or end-of-line surgery EOL). Two non-printable see to it characters are involved: 0AH (Crease-Feed operating theater LF) and 0DH (Baby buggy-Hark back or CR).

Windows/State uses OD0AH (Cr+LF Oregon "\r\n") arsenic EOL.
Unix and Mac use 0AH (LF OR "\n") but.

End-of-File (EOF): [TODO]

Windows' CMD Codepage

Role encryption scheme (charset) in Windows is called codepage. In CMD beat out, you can issue command "chcp" to presentation the current codepage, or "chcp codepage-count" to change the codepage.

Observe that:

The default codepage 437 (used in the original DOS) is an 8-bit character set called Extended American Standard Code for Information Interchange, which is different from Latin-1 for encipher numbers above 127.
Codepage 1252 (Windows-1252), is not exactly the same as Latin-1. It assigns code number 80H to 9FH to letters and punctuation mark, such as fresh solitary-quotes and double-quotes. A common problem in web browser that show quotes and apostrophe in call into question marks or boxes is because the page is supposed to Be Windows-1252, but mislabelled as ISO-8859-1.
For internationalization and Formosan character set: codepage 65001 for UTF8, codepage 1201 for UCS-2BE, codepage 1200 for UCS-2LE, codepage 936 for chinese characters in GB2312, codepage 950 for Formosan characters in Big5.

Chinese Fictional character Sets

Unicode supports all languages, including asian languages like Chinese (both simplified and traditional characters), Japanese and Peninsula (together called CJK). There are more than than 20,000 CJK characters in Unicode. Unicode characters are frequently encoded in the UTF-8 schema, which unfortunately, requires 3 bytes for each CJK part, alternatively of 2 bytes in the unencoded UCS-2 (UTF-16).

Worsened still, there are besides various Chinese fibre sets, which is not miscible with Unicode:

GB2312/GBK: for simplified chinese characters. GB2312 uses 2 bytes for each chinese character. The all but significant flake (Mutual savings bank) of both bytes are set to 1 to co-live with 7-bit ASCII with the MSB of 0. There are about 6700 characters. GBK is an extension of GB2312, which admit Sir Thomas More characters as good equally longstanding chinese characters.
BIG5: for traditional chinese characters BIG5 likewise uses 2 bytes for each chinese character. The most probative bit of some bytes are also determined to 1. BIG5 is not congruous with GBK, i.e., the same code number is assigned to different character.

For example, the world is made more interesting with these many standards:

	Standard	Characters	Codes
Simplified	GB2312	和谐	BACD D0B3
	UCS-2	和谐	548C 8C10
	UTF-8	和谐	E5928C E8B090
Traditional	BIG5	和諧	A94D BFD3
	UCS-2	和諧	548C 8AE7
	UTF-8	和諧	E5928C E8ABA7

Notes for Windows' CMD Users: To display the chinese character correctly in CMD shell, you need to choose the correct codepage, e.g., 65001 for UTF8, 936 for GB2312/GBK, 950 for Big5, 1201 for UCS-2BE, 1200 for UCS-2LE, 437 for the original Disk operating system. You can use command "chcp" to display the modern code varlet and bid "chcp codepage_number " to shift the codepage. You as wel have to opt a font that can display the characters (e.g., Courier New, Consolas or Lucida Console, NOT Screen font).

Collating Sequences (for Ranking Characters)

A string consists of a sequence of characters in upper or let down cases, e.g., "apple", "Male child", "Cat". In sorting Beaver State comparing string section, if we order the characters according to the underlying code numbers (e.g., US-ASCII) eccentric-by-quality, the order for the instance would be "BOY", "orchard apple tree", "Cat" because uppercase letters have a smaller cipher number than small letters. This does not agree with the so-called dictionary order, where the same uppercase and lowercase letters have the same rank. Another common job in ordination string section is "10" (ten) at times is ordered in front of "1" to "9".

Hence, in sort or compare of strings, a supposed collating chronological succession (or collation) is often defined, which specifies the ranks for letters (uppercase, lowercase), numbers pool, and special symbols. Thither are many collating sequences obtainable. It is entirely adequate to you to choose a collating sequence to meet your application's specific requirements. Some case-insensitive lexicon-order collating sequences cause the selfsame rate for same uppercase and lowercase letters, i.e., 'A', 'a' ⇒ 'B', 'b' ⇒ ... ⇒ 'Z', 'z'. Some case-nociceptive dictionary-order collating sequences put the uppercase letter before its lowercase counterpart, i.e., 'A' ⇒'B' ⇒ 'C'... ⇒ 'a' ⇒ 'b' ⇒ 'c'.... Typically, space is ranked before digits '0' to '9', followed by the alphabets.

Collating sequence is often language dependent, as different languages use disparate sets of characters (e.g., á, é, a, α) with their own orders.

For Java Programmers - `java.nio.Charset`

JDK 1.4 introduced a new java.nio.charset package to support encoding/decryption of characters from UCS-2 used internally in Java program to whatever supported charset used by outer devices.

Deterrent example: The following program encodes some Unicode texts in various encryption system, and display the Hex codes of the encoded byte sequences.

import Java.nio.ByteBuffer; import Java.nio.CharBuffer; import Java.nio.charset.Charset;   public division          TestCharsetEncodeDecode          {    world motionless void main(String[] args) {              String[] charsetNames = {"US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16",                                "UTF-16BE", "UTF-16LE", "GBK", "BIG5"};         String up message = "Hi,您好!";                Arrangement.down.printf("%10s: ", "UCS-2");       for (int i = 0; i < message.length(); i++) {          System.down.printf("%04X ", (int)message.charAt(i));       }       System.out.println();         for (String charsetName: charsetNames) {                    Charset charset = Charset.forName(charsetName);          System.verboten.printf("%10s: ", charset.name());                      ByteBuffer bb = charset.encode(message);          while (bb.hasRemaining()) {             System.out.printf("%02X ", bb.get());            }          System.out.println();          bb.rewind();       }    } }

          UCS-2: 0048 0069 002C 60A8 597D 0021                 The States-American Standard Code for Information Interchange: 48 69 2C 3F 3F 21               ISO-8859-1: 48 69 2C 3F 3F 21                    UTF-8: 48 69 2C          E6 82 A8          E5 A5 BD          21                   UTF-16:          FE FF          00 48          00 69          00 2C          60 A8          59 7D          00 21                          UTF-16BE:          00 48          00 69          00 2C          60 A8          59 7D          00 21                          UTF-16LE:          48 00          69 00          2C 00          A8 60          7D 59          21 00                               GBK: 48 69 2C          C4 FA          BA C3          21                     Big5: 48 69 2C          B1 7A          A6 6E          21

For Java Programmers - `char` and `String`

The sear information type are based along the original 16-bit Unicode standard called UCS-2. The Unicode has since evolved to 21 bits, with code range of U+0000 to U+10FFFF. The set of characters from U+0000 to U+FFFF is called the Basic Multilingual Airplane (BMP). Characters above U+FFFF are called supplementary characters. A 16-bit Java blacken cannot hold a supplementary character.

Recall that in the UTF-16 encoding scheme, a BMP characters uses 2 bytes. It is the same as UCS-2. A additive character uses 4 bytes. and requires a pair of 16-bit values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates roam (\uDC00-\uDFFF).

In Java, a String is a sequences of Unicode characters. Java, in fact, uses UTF-16 for Drawing string and StringBuffer. For BMP characters, they are the same as UCS-2. For subsidiary characters, each characters requires a pair of char values.

Java methods that accept a 16-bit char treasure does not reenforcement supplementary characters. Methods that accept a 32-bit int value support all Unicode characters (in the depress 21 bits), including supplementary characters.

This is meant to be an domain discussion. I have yet to face-off the use of additive characters!

Displaying Whammy Values & Hexadecimal Editors

Now and then, you may pauperism to display the hex values of a single file, especially in dealing with Unicode characters. A Positional notation Editor is a handy creature that a good programmer should have in his/her toolbox. There are many a freeware/shareware Hex Editor program available. Try Google "Positional notation Editor program".

I used the followings:

NotePad++ with Hex Editor in chief Plug-in: Open-source and out-of-school. You can toggle between Hex aspect and Normal view by pushing the "H" button.
PSPad: Freeware. You can toggle to Bewitch view aside choosing "View" menu and select "Hex Edit Fashion".
TextPad: Shareware without expiration menses. To view the Hex value, you deman to "open" the file by choosing the register format of "binary" (??).
UltraEdit: Shareware, not free, 30-day trial only.

Let Maine know if you have a fitter choice, which is fast to launch, easy to use, can toggle switch between Hex and normal watch, clear, ....

The following Java program can be used to display hex code for Coffee Primitives (integer, character and floating-point):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

public class PrintHexCode {      public static void main(String[] args) {       int i = 12345;       System.out.println("Decimal is " + i);                               System.out.println("Hex is " + Integer.toHexString(i));              System.out.println("Positional notation is " + Integer.toBinaryString(i));        System.out.println("Octal is " + Integer.toOctalString(i));          System.out.printf("Positional notation is %x\n", i);           Organization.out.printf("Octal is %o\n", i);           scorch c = 'a';       System.out.println("Character is " + c);               System.out.printf("Character is %c\n", c);             System.out.printf("Hex is %x\n", (curt)c);            System.out.printf("Decimal is %d\n", (short)c);          float f = 3.5f;       Arrangement.dead.println("Decimal is " + f);           Arrangement.out.println(Ice-cream float.toHexString(f));          f = -0.75f;       System.out.println("Decimal is " + f);           Arrangement.out.println(Drift.toHexString(f));          double d = 11.22;       System.out.println("Decimal is " + d);            System.out.println(Double.toHexString(d));     } }

In Eclipse, you stern view the hex code for integer primitive Java variables in debug mode as follows: In debug perspective, "Variable" panel ⇒ Select the "carte du jour" (inverted triangle) ⇒ Java ⇒ Java Preferences... ⇒ Noncivilised Expose Options ⇒ Check "Video display hexadecimal values (byte, short, char, int, perennial)".

Concise - Wherefore Bother about Data Representation?

Integer number 1, floating-point number 1.0 character symbol '1', and string "1" are all different inside the store. You ask to acknowledge the difference to write good and superior programs.

In 8-bit signed integer, integer numeral 1 is delineated as 00000001B.
In 8-piece unsigned integer, whole number number 1 is delineate as 00000001B.
In 16-bit signed integer, integer numeral 1 is represented every bit 00000000 00000001B.
In 32-bit signed whole number, integer number 1 is represented American Samoa 00000000 00000000 00000000 00000001B.
In 32-bit floating-orient representation, number 1.0 is represented as 0 01111111 0000000 00000000 00000000B, i.e., S=0, E=127, F=0.
In 64-bit unsettled-point representation, identification number 1.0 is represented as 0 01111111111 0000 00000000 00000000 00000000 00000000 00000000 00000000B, i.e., S=0, E=1023, F=0.
In 8-bit Italic-1, the character symbolization '1' is represented as 00110001B (Beaver State 31H).
In 16-bit UCS-2, the persona symbol '1' is represented atomic number 3 00000000 00110001B.
In UTF-8, the lineament symbol '1' is represented as 00110001B.

If you "add" a 16-bit signed integer 1 and Latin-1 character '1' or a string "1", you could draw a storm.

Exercises (Data Representation)

For the following 16-bit codes:

0000 0000 0010 1010; 1000 0000 0010 1010;

Give their values, if they are representing:

a 16-morsel unsigned integer;
a 16-flake signed integer;
two 8-bit unsigned integers;
two 8-act signed integers;
a 16-bit Unicode characters;
ii 8-number ISO-8859-1 characters.

ANS: (1) 42, 32810; (2) 42, -32726; (3) 0, 42; 128, 42; (4) 0, 42; -128, 42; (5) '*'; '耪'; (6) NUL, '*'; PAD, '*'.

REFERENCES & RESOURCES

(Mobile-Point Number Spec) IEEE 754 (1985), "IEEE Standard for Binary Floating-Level Arithmetic".
(ASCII Specification) ISO/IEC 646 (1991) (or ITU-T T.50-1992), "Information technology - 7-second coded character set for information interchange".
(Italic language-I Stipulation) ISO/IEC 8859-1, "Information technology - 8-bit single-byte coded graphic character sets - Part 1: Latin alphabet No. 1".
(Unicode Stipulation) ISO/IEC 10646, "Info applied science - Universal joint Multiple-Octet Coded Character Set (UCS)".
Unicode Consortium @ hypertext transfer protocol://www.unicode.org.