Character encoding

A character encoding is a code that pairs a set of natural language characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses. Common examples include Morse code, which encodes letters of the Roman alphabet as series of long and short depressions of a telegraph key; and ASCII, which encodes letters, numerals, and other symbols as both integers and 7-bit binary versions of those integers.

In some contexts (especially computer storage and communication) it makes sense to distinguish a character repertoire, which is a full set of abstract characters that a system supports, from a coded character set or character encoding which specifies how to represent characters from that set using a number of integer codes.

In the early days of computing, most systems used only the character repertoire of the ASCII code. This was soon seen to be inadequate, and a number of ad-hoc methods were used to extend this. The need to support multiple writing systems, including the CJK family of scripts, required a far larger number of characters to be supported, and required a systematic approach to character encoding to be used, rather than the previous ad-hoc approaches.

For example, the full repertoire of Unicode encompasses over 100,000 characters, each being assigned a unique integer code in the range 0 to hexadecimal 10FFFF (a little over 1.1 million, so not all integers in that range represent coded characters). Other common repertoires include ASCII and ISO 8859-1, which are identical to the first 128 and 256 coded characters of Unicode respectively.

The term character encoding is sometimes overloaded to also mean how characters are represented as a specific sequence of bits. This involves an encoding form where the integer code is converted to a series of integer code values that facilitate storage in a system that uses fixed bit widths. For example, integers greater than 65535 will not fit in 16 bits, so the UTF-16 encoding form mandates that these integers be represented as a surrogate pair of integers that are less than 65536 and that are not assigned to characters (e.g., hex 10000 becomes the pair D800 DC00). An encoding scheme then converts code values to bit sequences, with attention given to things like platform-dependent byte order issues (e.g. D800 DC00 might become 00 D8 00 DC on an Intel x86 architecture). A character set or character map or code page shortcuts this process by directly mapping abstract characters to specific bit patterns. Unicode Technical Report #17 explains this terminology in depth and provides further examples.

Since most applications use only a small subset of Unicode, encoding schemes like UTF-8 and UTF-16, and character maps like ASCII, provide efficient ways to represent Unicode characters in computer storage or communications using short binary words. Some of these simple text encodings use data compression techniques to represent a large repertoire with a smaller number of codes.

See also: Chinese character encoding

Popular character encodings

External links



In the News

Sony Mylo 2 Cellphone May Combine QWERTY Phone and PSP
More info on that Sony handheld mockup that's been surfacing on the internet for months: The Mylo 2 looks like a cross between the PSP and a HTC-style QWERTY sliderphone. Look for a 2008 CES announcement.

In the Light of Memory: A Spherical Panorama from the South Tower, Wor
A luminous bird's-eye view of New York City, created as a collage of images and photographs on a sphere. The artist, Christopher Evans, said his goal was to "recapture and hold on to something beautiful which could not be torn down and destroyed; to work as an artist creatively in defiance of the violenceand destruction I'd witnessed on September 11 [2001]."From the website for the New-York Historical Society.

Poison + Water = Hydrogen. New Microbial Genome Shows How
Take a pot of scalding water, remove all the oxygen, mix in a bit of poisonous carbon monoxide, and add a pinch of hydrogen gas. It sounds like a recipe for a witch's brew. It may be, but it is also the preferred environment for a microbe known as Carboxydothermus hydrogenoformans.

Predicting Survival After Liver Transplant
A new model based on specific characteristics of the donor and the recipient may help predict survival after liver transplantation, according to a new study.

LibraryLaw Blog: How Does California's New Anti-spyware Law Affect Lib
Brief information about the Consumer Protection Against Computer Spyware Act that went into effect in California on January 1, 2005. Includes a link to the text and to objections to the law raised by privacy advocates. From librarian and lawyer Mary Minow.

Mouse Genome Much More Complex Than Expected
More than 100 scientists from Australia, Asia, Europe and the US have been probing the genome of the mouse in a joint study lasting several years. Their results in some aspects have completely overturned geneticists' traditional assumptions. The findings are available in the prestigious journal Science on 2nd September.

Darwin Correspondence Project
This website presents around 5,000 letters written by and to naturalist Charles Darwin, providing "information about his intellectual development, Victorian science and society. They [throw] light on his formative years and the voyage of the Beagle, on the period which led up to the publication of 'The Origin of Species' and the subsequent heated debates."Includes a section on Darwin and religion. From the University of Cambridge Library.

Feeling Gloomy? Find Out Why ...
Feel like turning over and ignoring the alarm? An expert from Cardiff University has devised a formula to explain your Monday blues.

Epstein-Barr Virus Protein Crucial To Its Role In Blood Cancers
Researchers at the University of Pennsylvania School of Medicine have identified a link between a critical cancer pathway and an Epstein-Barr Virus (EBV) protein known to be expressed in a number of EBV-associated cancers. Their findings demonstrate a new mechanism by which EBV transforms human B cells from the immune system into cancerous cells, which can lead to development of B-cell lymphomas.

The Oil Sands Environmental Research Network (OSERN): Frequently Asked
Questions and answers about obtaining petroleum from the "naturally occurring deposits of bituminous sand"(known as oil sand or tar sand) found near the Athabasca River in Canada. Includes information about mining, the kinds of environmental disturbance oil sands mining causes, and reclamation. Also includes maps and a link to a glossary of oil sands acronyms. From the University of Alberta.




MP3 Music Downloads

Preview songs, Download Free Music,Burn CDs at ITunes.com
iTunes_RGB_9mm

 


Google




InformationQuickFind.com - Find Information Fast

Links | Privacy Policy | News |