Parsing

In grammar and linguistics, parsing is the process by which a person makes sense of a sentence, usually by breaking it down into words or phrases.

Parsing used to be taught as a formal activity in schools as part of language education.

In computer science, parsing is the process of splitting up a continuous stream of characters (read from a file or keyboard input, for example) into meaningful tokens, and then building a parse tree from those tokens. The name is by analogy with the usage in grammar and linguistics.

A parser is a computer program that carries out this task.

In machine translation and natural language processing systems, human languages are parsed by computer programs. Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language.

The most common use of parsers is to parse computer programming languages. These have simple and regular grammars. The description below describes the common case of parsing a language with two levels of grammar: lexical and syntactic.

The first stage is the token generation, or lexical parse phase. For example, a calculator program would look at input like "12*(3+4)^2" and split it into the tokens 12, *, (, 3, +, 4, ^ and 2, each of which is a meaningful symbol in the context of an arithmetic expression. The parser would contain rules to tell it that the characters *, ( and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be generated.

The next stage is syntactic parsing or syntax analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a context-free grammar (CFG) which recursively defines components that can make up an expression and the order in which they must appear.

The final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator, the action is to evaluate the expression; a compilier, on the other hand, would generate the machine language that performs the functionality stated in the code.