The human voice consists of sounds generated by the opening and closing of the glottis by the vocal cords, which produces a periodic waveform. This basic sound is then modified by the nose and throat to produce differences in pitch in a controlled way, creating the wide variety of sounds used in speech. There are another set of sounds, known as the unvoiced and plosive sounds, which are not modified by the mouth in the same fashion.
The vocoder examines speech by finding this basic frequency, the fundamental frequency, and measuring how it is changed over time by recording someone speaking. This results in a series of numbers representing these modified frequencies at any particular time as the user speaks. In doing so, the vocoder dramatically reduces the amount of information needed to store speech, from a complete recording to a series of numbers. To recreate speech, the vocoder simply reverses the process, creating the fundamental frequency in an oscillator, then passing it into a modifier that changes the frequency based on the originally recorded series of numbers.
Of course, the actual qualities of speech cannot be reproduced this easily. In addition to a single fundamental frequency, the vocal system adds in a number of resonant frequencies that add character and quality to the voice, known as the formant. Without capturing these additional qualities, the vocoder will never sound "real".
In order to address this, most vocoder systems use what are effectively a number of vocoders, all tuned to different frequencies (using band-pass filters). The various values of these filters are stored not as the raw numbers, which is all based on the original fundamental frequency, but as a series of modifications to that fundamental needed to modify it into the signal seen in that filter. During playback these settings are sent back into the filters and then added together, modified with the knowledge that speech typically varies between these frequencies in a fairly linear way. The result is recognizable speech, although somewhat "mechanical" sounding. Vocoders also often include a second system for generating unvoiced sounds, using a noise generator instead of the fundamental frequency.
Even with the need to record several frequencies, and the additional unvoiced sounds, the compression of the vocoder system is impressive. Standard systems to record speech record a frequency from about 500Hz to 8kHz, where most of the frequencies used in speech lie, which requires 64k of bandwidth (due to Nyquist frequency). However a vocoder can provide a reasonably good simulation with about 3k of bandwidth, a 20x improvement.
For musical applications, a source of musical sounds is used as the oscillator, instead of extracting the fundamental frequency. For instance, one could use the sound of a guitar as the input to the filter bank, a technique that became popular in the 1970s.