MUMPS

MUMPS, or simply M, is a programming language dedicated to building and managing databases. Whereas in most systems the database is the first-class citizen and a language is added on top, under M this is inverted, the language itself is the primary object, the database a "side effect" of one of the features of the language.

M contrasts strongly with most database systems, because the system is much "lower level". For instance, whereas most database systems will include a command to find all the records matching a particular patterm, on an M system you would have to write a program to do this search and collect up the results. As you might imagine, this makes even trivial tasks much more difficult, and has led to a number of M-based programs to act as a database management system and provide these features.

For people used to traditional database applications or database management systems, M can be difficult to understand at first. This is offset by its speed and flexibility in dealing with tasks that would cause problems under the relational database model. M has been called the best-kept secret in the IT industry. Much of this secrecy seems self-imposed however: finding good introductory information on M is difficult, and the commercial side of the M market is fractured.

Table of contents

1 History
2 Description
3 The MUMPS Datastore
4 Summary of key language features
5 External links

History

MUMPS started life as the Massachusetts General Hospital Utility Multi-Programming System, developed in Octo Barnett's animal lab at Massachusetts General Hospital in Boston in 1966 & 67. Based on the then-common hierarchical database model, MUMPS added an interpreted language to standardize interacting with the data. The MUMPS team deliberately chose to write the new language with portability in mind. Another feature not widely supported in operating systems of the era was multitasking, which was also built into MUMPS itself.

The original MUMPS system was built on a spare DEC PDP-7, but it was soon ported to a PDP-15 where it lived for some time. Developed on a government grant, MUMPS was required to be released in the public domain (no longer a requirement for grants), and was soon ported to a number of other systems including the popular PDP-8 and Data General Nova minicomputers. Word of MUMPS spread mostly through the medical community, and by the early 1970s was in widespread use, often being locally modified for their own needs.

In 1972 various MUMPS users gathered in order to standardized the now fractured language, creating the MUMPS Users Group and MUMPS Development Committee. These efforts proved successful; a standard was complete by 1974, and by 1977 they had turned it into an ANSI standard. The group was later responsible for the change of the naming from MUMPS to M in 1990, after repeatedly seeing people reject the product out of hand due to its name. Over a decade later most people still refer to the product as MUMPS however.

The Veteran's Administration (today known as the United States Department of Veterans Affairs) officially adopted MUMPS as the programming language to be used to implement an integrated laboratory, pharmacy, and a patient admission, tracking and discharge system in the early 1980s. The original version, the Decentralized Hospital Computer Program (DHCP) was delivered early and under budget. DHCP has been continuously extended in the years since, and is available at no cost in source code. In order to implement DHCP the VA also wrote an intermediate layer known as FileMan in MUMPS to act as a database management system.

Today, DHCP is known as Veterans Health Information Systems and Technology Architecture (VistA). The www.hardhats.org website is the center for the international community of VistA developers and users.

Nearly the entire VA hospital system in the United States, the Indian Health Service, and major parts of the Department of Defense hospital system (Consolidated Hospital Computer System (CHCS), different than the VA's for historical reasons) all still run the system for clinical data tracking.

M also gained a following in the financial sector, in this case due to its much higher performance compared to traditional SQL based systems. Given similar hardware, multidimentional databases like M are typically about six times faster than SQL for transaction processing, making them ideal for online systems like banking. They also range from as-good to hundreds of times faster on queries, with more complex queries always favoring the multidimentional approach. They are also particularly good at looking up related information in other data stores "for free", a task that requires an expensive JOIN operation in SQL systems.

DEC became interested in the widespread use of MUMPS and decided to create their own standardized version in the 1980s, known as DSM (DEC Standard MUMPS). It quickly became the de-facto standard on DEC machines, and was later ported to their DEC Alpha-based systems running both VMS and Unix. In 1990 DSM was purchased by InterSystems, who released it on a number of platforms as OpenM (although nothing about it was "open") and/or ISM (InterSystems M). Through the 1990s InterSystems started buying up other MUMPS vendors, including the other "standard" from the IBM mainframe world, MSM (Micronetics Standard MUMPS). Since then InterSystems has increasingly distanced itself from its M history, referring to its product as Cach� and removing any mention of M from their literature.

General use of M appears to be slowly disappearing. This appears to be a result of the industry's failure to provide a clear and compelling message comparing M with traditional SQL systems. There appears to be no M for SQL Programmers type introductory information available on the Internet, and the small number of books on M are difficult to find. To be truely usable, M systems require a "higher level" layer to act as a database manager, and while M is now well standardized, there is no such standard for these higher levels, adding to the confusion.

A recent release of the industrial-quality GT.M under the GPL may help address this to some degree, by providing a single, free, target for the M community. Several database management layers are available for GT.M, and with a little effort a single suggested platform could evolve.

Description

M is typically an interpreted language, and shares basic syntax with common 1960s data processing languages, most notable those such as COBOL. Commands are listed one to a line with whitespace being important, and grouped into procedures (subroutines) in a fashion similar to most structured programming systems. Procedures are simply strings, so they can be easily stored in the underlying datastore, meaning that there is no need for "stored procedures" as there is in SQL – anything can be stored.

A typical M procedure consists of several "blocks", each block separated by a lable (known as a tag in M-speak) in the first column. Calling into the procedure with no tag results in the entire procedure being run, whereas calling with a tag skips to that point. This allows programmers to place interactive commands at the top of the procedure and then tag the actual start of the code itself, allowing the procedure to be used both interactively and as a function to be called from other code (see the GRASS article for similar examples).

One main difference between M and most other languages is that M has only a single data type, the string, which it invisibly converts into common data types such as numbers or dates. As you might expect, M includes a complete and powerful set of string manipulation commands, grouped into libraries. Automated conversion of this sort is common to many scripting languages, but is generally considered a bad thing for most languages because it can all too easily lead to mistakes that are diffcult to debug. In the case of M this would be hard to avoid however, as the underlying datastore would grow in complexity if it had to deal with different types.

The key to the M system is that all variables are automatically multi-dimensional. For instance, this command:

SET A="abc"

creates the variable A and sets its value to the string. The same variable can then be used to hold additional information:

SET A(1,2)="def"

Will place the string def into "slot" (1,2). Slots can also be designated with strings:

SET A("first_name")="Bob" SET A("last_name")="Dobbs"

making the variables useful data stores on their own. Note that this example also demonstrates another feature of M, that assignments into variables do not erase other information already there. This makes it easy to "build up" a complex variable, using several assignments.

M variables work in a similar fashion as with other programming languages, in that when the program exits, the value will be lost. M comes onto its own with its concept of globals, variables which are automatically and invisibly stored to the datastore. Globals appear as normal variables with the caret character in front of the name. Modifying the earlier example thus:

SET ^A("first_name")="Bob" SET ^A("last_name")="Dobbs"

will result in a new record being created and inserted in the datastore.

One difference between M and the traditional SQL model of a database is the "level" of the commands in the language. M is a general purpose language with a datastore, whereas SQL is a language dedicated to database functions. This might sound like a minor distinction, but it is rather important to understand it. For instance, SQL includes a search function:

SELECT * FROM user WHERE first_name like 'Bob%'

returns a list of matching records. M has no equivalent "high level" command like SELECT, instead the programmer must construct a small routine in order to collect up the matching records returned from its lower-level functions. It should also be noted that M does not include any transaction controls, all changes to globals happen instantly, and there is no logging in the basic system. Nor does M include any sort of user-based security.

For all of these reasons one of the most common M programs is a database management system, providing all of the classic ACID properties on top of a generic M implementation. FileMan is one such example. In the 1990s many of these layers were adapted to supporting SQL, turning most M systems into SQL systems. Although the user might be "fooled" (to some degree) into seeing the system as a SQL database, the system nevertheless retains the speed advantages of the M datastore (see below).

A side effect of the way M evolved is that the M system includes fairly complete support for multi-tasking, multi-user, multi-machine programming. The former two features are now commonplace on most operating systems, but the later is still not cleanly supported by most systems. To demonstrate the ease of multi-machine support, consider:

SET ^|DENVER|A("first_name")="Bob" SET ^|DENVER|A("last_name")="Dobbs"

which sets up A as before, but this time on the remote machine called "DENVER". M programs are thus trivial to distribute over many machines, a feature that is still difficult on most SQL systems. This support also made it easy to expose the same sorts of distribution in the SQL (and other) layers with ease, and it's not uncommon for M systems to be a better distributed SQL solution than a "real" SQL system.

Another use of M in more recent times has been to create object databases. By "flattening" objects into a string representation, like XML, M systems can be used to store objects. An M program then converts back and forth between the "real" objects and the string representations under it, and is able to do so much faster than similar object-relational mapping systems running over relational databases. This should be expected, if M can be used to make a SQL that's faster than SQL, making an object database that does't require conversion to and from SQL in the middle is bound to be even faster.

The MUMPS Datastore

In the relational model, datastores consist of a number of tables, each one holding records for some particular object (or "entity"). For instance, an address book application would typically contain tables for PEOPLE, ADDRESSES and PHONE_NUMBERS. The tables consist of a number of fixed-width columns holding one basic piece of data (like "first_name"), and each record is a row.

In this example any row in ADDRESSES is "for" a particular row in PEOPLE. SQL does not understand the concept of "ownership" however, and requires the user to collect this information back up. SQL supports this through procedure, using the concept of a foreign key; copying some unique bit of data found in the PEOPLE table into the ADDRESS table.

To re-create a single "real" record for the address book, the user must instruct the database to collect up the row in PEOPLE they are interested in, extract the key, and then search the ADDRESSES and PHONE_NUMBERS tables for all the rows containing this key. SQL offers a simple way to do this however:

SELECT * from PEOPLE p, ADDRESSES a, PHONE_NUNBERS n

 WHERE p.id = a.person_id

   AND p.id = n.person_id

   AND p.first_name = "Bob"

In this example the WHERE looks for all the a's and n's (addresses and phone numbers) that have the person's ID tag stored inside them, but only for p's named Bob.

This trivial example already requires three lookups in different tables to return the data, which, as you might expect, is very slow. In order to improve performance, the database administrator will place an index on heavily-searched columns, in this example the person_id columns. An index consists of a column containing the data to be found, and the record number of the matching row in the table. This is the reason that tables are fixed width, so that the database can easily figure out the physical location of the record given the location of the start of the table, the length of any row, and the number of rows to skip. Without this simplification, performance of the relational model would be unusable.

M's datastore stores only the physical locations. This means that records can be of any length, placed anywhere, and contain anything. Searching is not needed to find any record, a pointer directly to that record is easily retrieved and followed to the data in question. The physical data in an M is typically stored in a "blob" of strings, one after the other. This provides another advantage over the relational model, as empty cells do not take up room as they do in the fixed-length relational table. M databases are therefore smaller than relational ones, which is another reason for their increased performance (less disk operations).

So why doesn't the relational model do the same thing? Historical accident. At the time the difference in speed between storage and processor was much smaller than it is today, and the cost of having the CPU follow a pointer was expensive compared to the simple arithmatic needed for an index. Today the CPU's have grown many times faster than the storage, so this cost is effectively zero. This is the main reason why multidimensional datastores outperform relational ones today, something that was not true in the 1970s when the two models were in competition.

M globals are, in fact, indexes. Each node in the global contains a pointer to the data, just as an index does in the relational model. Unlike the relational model, where indexes are a special-purpose object included as a necessary evil used in some lookups, under M indexes are first-class citizens that are used for all data access. This is yet another reason for M's performance.

This makes M systems particularly well suited to looking up related data, as in the example above. The equivalent M statement would be something more akin to:

SELECT * from PEOPLE p, ADDRESSES a, PHONE_NUNBERS n

 WHERE p.first_name = "Bob"

The biggest consequence of this internal representation is that database operations are economical (in both disk space and execution time). M is extremely well suited to real world data, which is often 'sparse' (ie has missing fields). There is no penalty in storage space if a defined data value is not present. This is an extremely helpful feature in a clinical context.

M includes almost no operating system specific command syntax, very few file system interface commands, and no machine specific commands. It is thus quite portable. Additionally, database manipulation code is extremely brief. A M routine implementing a complex database interaction might be a page or two of code. The equivalent in a less high level language (C, Pascal, Fortran, ...) is likely to be an order of magnitude larger. M is a highly cost effective application programming tool.

Summary of key language features

This incomplete, informal sketch seeks to give programmers familiar with other languages a feeling for what M is like. Neither the language description and the descriptions of each feature are complete, and many significant features have been omitted for brevity. These notes reflect the language circa 1994.

Data types: one universal datatype, interpreted/converted to string, integer, or floating-point number as context requires. Like Visual BASIC "variant" type.

Booleans: In IF statements and other conditionals, any nonzero value is treated as True. a Declarations: NONE. Everything dynamically created on first reference.

Lines: important synactic entities. Multiple statements per line are idiomatic. Scope of IF and FOR is "remainder of current line."

Case sensitivity: Commands and intrinsic functions are case-insensitive. Variable names and labels are case-sensitive. No specified meaning for upper vs. lower-case and no widely accepted conventions. Percent sign (%) is legal as first character of variables and labels.

Postconditionals: SET:N<10 A="FOO" sets A to "FOO" if N is less than 10; DO:N>100 PRINTERR performs PRINTERR if N is greater than 100. Provides a conditional whose scope is less than the full line.

Arrays: created dynamically, stored sparsely as B-trees, any number of subscripts, subscripts can be strings or integers. Always automatically stored in sorted order. $ORDER and $QUERY functions allow traversal.

       for i=10000:1:12345 set sqtable(i)=i*i
       set address("Smith","Daniel")="dpbsmith@world.std.com"

Local arrays: names not beginning with caret; stored in process space; private to your process; expire when process terminates; available storage depends on partition size but is typically small (32K)
Global arrays: ^abc, ^def. Stored on disk, available to all processes, persist when process terminates. Very large globals (hundreds of megabytes) are practical and efficient. This is M's main "database" mechanism. Used instead of files for internal, machine-readable recordkeeping.
Indirection: in many contexts, @VBL can be used and effectively substitutes the contents of VBL into the statement. SET XYZ="ABC" SET @XYZ=123 sets the variable ABC to 123. SET SUBROU="REPORT" DO @SUBROU performs the subroutine named REPORT. Operational equivalent of "pointers" in other languages.
Piece function: Treats variables as broken into pieces by a separator. $PIECE(STRINGVAR,"^",3) means the "third caret-separated piece of STRINGVAR." Can appear as an assignment target. After

SET X="dpbsmith@world.std.com"

$PIECE("world.std.com",".",2) yields "std". SET $P(X,"@",1)="office" causes X to become "office@world.std.com".
Order function
Set stuff(6)="xyz",stuff(10)=26,stuff(15)=""

$Order(stuff("")) yields 6, $Order(stuff(6)) yields 10, $Order(stuff(8)) yields 10, $Order(stuff(10)) yields 15, $Order(stuff(15)) yields "".

Set i="" For Set i=$O(stuff(i)) Quit:i="" Write !,i,?10,stuff(i)

The argumentless For iterates until stopped by the Quit. Prints a table of i and stuff(i) where i is successively 6, 10, and 15.
Commands: may be abbreviated to one letter, case-insensitive.

DO XYZ call subroutine at label XYZ DO PQR(arg1,arg2,arg3) call with parameter passing ELSE stmnt1 stmnt2 stmnt3 opposite of last IF FOR stmnt1 stmnt2 stmnt3 repeat until a QUIT breaks you out FOR i=1:1:100 stmnt1 ... iteration, i=1, 2, 3, ... 100 GOTO yes, there is one IF cnd stmnt1 stmnt2 stmnt3 conditionally execute rest of line KILL vbl return vbl to "undefined" state NEW vbl1,vbl2,vbl3 stack old values, create fresh "undefined" state. Pop on QUIT. QUIT return from subroutine QUIT value return from extrinsic function READ "Prompt:",x on current I/O stream, first write "Prompt:", then read line into variable x SET a=22,name="Dan",(c,d)=0 variable assignment USE 23 switch I/O stream to device 23 WRITE !,"x=",x output to current I/O stream. ! = new line XECUTE("set a=5 do xyz") execute arbitrary data as M code

Operators: No precedence, executed left to right, parenthesize as desired. 2+3*10 yields 50.
+ - * / sum, difference, product, quotient \\ integer division, 1234\\10 yields 123 # modulo _ concatenation, "nice"_2_"use" --> "nice2use" & ! ' < > and, or, not, less, greater, equal [ string contains. "ABCD"["BC" --> 1 ] string lexically follows. "Z"]"A" --> 1 ? pattern match operator

Intrinsic (built-in) functions:
Important structural components of the language (not commonly found in other languages):

$DATA(V) tests if a V is defined (has data) or not $ORDER, $QUERY traverse arrays in sorted order $ORDER(a("abc")) value v is the next subscript, following "abc", according to the M collating sequence, such that a(v) is defined. $PIECE see above $SELECT(c1:v1,c2:v2,1:v3) if c1 is true yields v1, else if c2 is true yields v2, otherwise yields v3 $TEXT(FOO+3) returns text of source code at line FOO+3

Convenience functions similar to library functions in other languages:

$ASCII, $CHAR text-to-ASCII-code and inverse $EXTRACT(string,5,10) characters 5 through 10 of string; may be assignment target $FIND(string,find,from) substring search $FNUMBER floating point formatting $LENGTH(string) just what you think $RANDOM(100) random # in range 0 to 99 inclusive $TRANSLATE("abcd","ab","AB") character substitution; yields "ABcd"

External links

M Technology and MUMPS Language FAQ