ProXML - XML Processing Language Version 2.0

Language Reference

(April 2001)

Mark Huckvale
University College London
M.Huckvale@ucl.ac.uk

1. Introduction

ProXML is a general purpose programming language designed for processing text files marked up in the Extensible Mark-up Language (XML) format. Input to the ProXML interpreter is an XML file and a ProXML script. Output can be an analysis of the input file or an XML file modified according to the instructions in the script.

This document describes the ProXML language which is used to write XML processing scripts. The ProXML interpreter, called PRX, runs on WIN32 and Unix computing platforms and is available free of charge under the GNU public software licence. See the web page http://www.phon.ucl.ac.uk/project/prosynth/ for more details.

2. Overview

The ProXML language is a general purpose programming language with some special functionality for working with XML files. Its basic syntax is loosely modelled on 'C'. The particularly novel characteristics of ProXML are:

  • Data-driven function calls for each XML element type.
  • Polymorphic variables holding characters, integers, floating point numbers and strings.
  • Built-in facilities to traverse hierarchical structures.
  • Special access to XML attribute values.
  • XML validity checking during operation.
  • Facilities to add and delete parts of XML tree.

To get an idea of what ProXML can do, consider first this simple XML file:

    <?xml version='1.0'?>
    <!DOCTYPE BOOK [
    <!ELEMENT BOOK (TITLE?, CHAPTER*) >
    <!ELEMENT CHAPTER (P*) >
    <!ELEMENT TITLE (#PCDATA) >
    <!ELEMENT P (#PCDATA) >
    <!ATTLIST CHAPTER NUM CDATA #IMPLIED >
    <!ATTLIST P ID CDATA #IMPLIED >
    ]>
    <BOOK>
    <TITLE>This is the title</TITLE>
    <CHAPTER>
    <P>This is a paragraph.</P>
    <P>This is a paragraph.</P>
    </CHAPTER>
    <CHAPTER>
    <P>This is a paragraph.</P>
    <P>This is a paragraph.</P>
    </CHAPTER>
    </BOOK>
    

Here is a simple ProXML script that demonstrates some features of the language.

    /* intro.prx - number chapters and give paragraphs a unique ID */
    BOOK {
        /* declare a counter variable */
        var cnt=1;
        /* declare a node variable */
        node n;
        /* repeat for every CHAPTER under BOOK */
        foreach n (./CHAPTER) {
            /* set NUM attribute of chapter */
            n:NUM = cnt;
            /* increase counter */
            cnt += 1;
        }
    }
    CHAPTER {
        /* declare variable containing number of paragraphs */
        var num=numchild(.);
        /* loop over paragraphs */
        for (var i=1;i<=num;i+=1)
            /* set ID attribute of P based on NUM attribute of CHAPTER */
            ./P[i]:ID = :NUM ++ "-" ++ i;
    }
    

In this code the block labelled with 'BOOK' is executed for every BOOK element in the XML file, while the block labelled 'CHAPTER' is executed for every CHAPTER element in the file. Normal variables that can hold numbers and strings are declared with the 'var' statement, while node variables which can hold nodes of the XML parse tree are declared with the 'node' statement. The 'foreach' statement executes a block of code on each of a set of nodes that match a given pattern, while the 'for' statement executes a block of code a defined number of times. Either statement can be used to walk the parse tree. The built-in function 'numchild()' returns the number of direct children of the given node.

Here is the output XML file:

    <?xml version='1.0'?>
    <!DOCTYPE BOOK [
    <!ELEMENT BOOK (TITLE?, CHAPTER*) >
    <!ELEMENT CHAPTER (P*) >
    <!ELEMENT TITLE (#PCDATA) >
    <!ELEMENT P (#PCDATA) >
    <!ATTLIST CHAPTER NUM CDATA #IMPLIED >
    <!ATTLIST P ID CDATA #IMPLIED >
    ]>
    <BOOK>
    <TITLE>This is the title</TITLE>
    <CHAPTER NUM="1">
    <P ID="1-1">This is a paragraph.</P>
    <P ID="1-2">This is a paragraph.</P>
    </CHAPTER>
    <CHAPTER NUM="2">
    <P ID="2-1">This is a paragraph.</P>
    <P ID="2-2">This is a paragraph.</P>
    </CHAPTER>
    </BOOK>
    

3. Language Elements

3.1 Procedural Structure

There are two types of procedural blocks in ProXML, element procedures and user-defined functions. Element procedures are blocks of code that are executed once per matching element in the XML file, and a declared as

    <element-list> {
        <statement-list>
    }
    

Where <element-list> is a list of element names separated by commas. The system-defined names '.' refers to every element, while '/' refers to the root element. For example:

    . {
        /* process every node */
    }
    / {
        /* process root node */
    }
    CHAPTER, TITLE, P {
        /* process CHAPTER, TITLE and P elements */
    }
    

Element names can appear on more than one procedure. When an input XML file is processed, all matching functions are called once for each element occurrence, with a hierarchical context located at that node in the parse tree.

User-defined functions return either a simple value or a node reference, and are declared as:

    var <function-name> ( <dummy-argument-list> ) {
        <statement-list >
    }
    node <function-name> ( <dummy-argument-list> ) {
        <statement-list >
    }
    

Where <dummy-argument-list> is a sequence of pairs of variable type and variable name separated by commas. Variables can be of type 'var' or type 'node'. 'var' arguments are always passed by value, 'node' variables are always passed by reference. Functions may be recursive, but they must always be declared before they are used. The 'return' statement is used to specify the value to return. If a function does not contain a return statement (or falls off the end) it returns 0 or nil. Functions may also be called as procedures, in which case the return value is ignored. Here are some examples:

    var average(var v1,var v2,var v3)
    {
        return (v1+v2+v3)/3;
    }
    var numPchildren(node n)
    {
        var cnt=0;
        node m;
        foreach m (n/P) cnt += 1;
        return cnt;
    }
    node forceparent(node n)
    {
        if (n/..==NIL)
            exit(1,"Node has no parent\n");
        else
            return(n/..);
    }
    

3.2 Variables and Expressions

Normal variables and expressions in ProXML are polymorphic: they automatically switch between holding characters, integers, floating-point numbers and strings as required by the context. Thus all of the following have the value 25:

    20+5
    "20"+"5"
    "5"*5
    20+"5"
    "2"++"5"
    (1*2)++(2+3)
    

The '++' operator means string concatenation. The built-in functions char(), integer(), float() and string() enforce type conversions if required, for example:

    "test"++char(65)
    

is equal to 'testA'. Illegal conversions (for example integer("X") cause a run-time warning.

Normal variables are declared with the 'var' statement, of the form

    var <var-name> ( = <initial-value> ) 
        {, <var-name> (= <initial-value>) }
    

For example:

    var counter;
    var num=1;
    var n1=1,n2=2;
    

Variables can be declared at any point in advance of their first use. Variables may be either local or global. Local variables are declared inside functions and procedures; they only have scope within that procedure. Global variables are declared outside functions and are accessible to all functions. They retain their value across elements within a given XML input file, but not across files. Variables are not initialised to any value, so explicit initialisation is recommended.

Node variables are declared with the 'node' statement in an analogous way to normal variables. However node variables cannot be used in expressions and can only be assigned node values from other node variables. Also node variables can only be tested for equality and inequality. There are a number of built-in node variables:

    .   = <current node>
    ..  = <parent of current node>
    /   = %lt;root node of current document>
    nil = <the empty node>
    

Document nodes form a strict hierarchy, and ProXML provide means for referring to nodes that dominate a given node or are children of a given node. Thus given some node n, the expressions

    n/..      = <parent of n>
    n/P       = <first child element of n called P>
    n/.       = <first child element of n>
    n/.[1]    = <first child element of n>
    n/.[2]    = <second child element of n>
    n/../.[1] = <first child of parent of n>
    n/./P     = <first grandchild element of n called P>
    n/./P[2]  = <second grandchild element of n called P>
    

In normal statements, no backtracking is performed to force a match. Thus in the last example, if the first child of n had no child nodes of element type P, the resulting expression would be 'nil' (even if the second child of n did have child nodes of element-type P) . To explicitly search the tree for a pattern, with backtracking on failure, you must use the 'foreach' statement or set up user functions to perform searching.

Nodes can also point to text data. The children of a node can be a mixture of elements and text data. These are all counted by numchild() and indexed by the [] operator. Thus given this piece of XML:

    <P>Hello <B>Mark</B> !</P>
    

Then the node <P> has 3 children: "Hello ", <B> and " !". The second child has its own child node contanining "Mark". If node n points to <P>, then we can access these pieces as:

    n/.[1]    = "Hello "
    n/.[2]    = <B>
    n/.[3]    = " !"
    

To extract the name of a piece of structure use the element() function. To extract the text string from a node pointing to text data, use the text() function. To determine whether a node is an element or text use the iselement() and istext() functions:

    if (iselement(n))
        output("n is node called "++element(n));
    else if (istext(n))
        output("n is text string "++text(n));
    else
        output("n is nil");
    

Attribute names can be used to qualify a node specification. These allow attribute values on nodes to be tested and set. The expression

    <node-specification> : <attribute-name>
    

is a normal term representing the string value of the given attribute. If the node specification is empty, the current node is taken by default. Thus given this extract of an XML file:

    <P ALIGN="LEFT">
    

The following code converts LEFT to RIGHT:

    P {
        if (:ALIGN=="LEFT") :ALIGN="RIGHT";
    }
    

Here are other examples

    :ID     = <the ID attribute of the current node>
    .:ID    = <the ID attribute of the current node>
    ..:ID   = <the ID attribute of the parent node>
    n:ID    = <the ID attribute of node n>
    n/..:ID = < the ID attribute of the parent of node n>
    n/CHAPTER/TITLE/P:ID 
            = <the ID attribute of a specific node below n >
    

Again, remember that this last expression refers to an attribute on a single node, not all nodes that might match this specification. If you need to calculate the name of a node or an attribute, put the element name or the attribute name expression in parentheses, as in:

    var aname = "10A"
    var ename = "CHAPTER"
    var avalue = (ename):("ATT"++aname);      /* same as CHAPTER:ATT10A */
    

One- and two-dimensional arrays of variables or nodes are also supported. The declaration

    var table[10];
    

declares an array of 10 variables, indexed from 0 to 9. Access these in expressions and assignment statements using the "[]" operator:

    var i = table[5];
    table[7] = 26;
    

The declarations

    var table2[5,10];
    node ntab[12];
    node ntab[2,3];
    

declare a two dimensional table of variables with 5 rows and 10 columns, a one dimensional table of nodes with 12 elements, and a two-dimensional table of nodes with 2 rows of 3 columns. Access these elements by arname[rowno,colno].

Array bounds checking is implemented, so that access to elements outside the limits of the array does not cause an error, nor even a warning. To find the size of an array, use the sizeof() function. To modify the size of an array at run time, use the built-in resize() procedures:

    var vtab[100];
    node ntab[10,10];
    resize(vtab,1000);
    resize(ntab,5,50); 
    

The resize command preserves existing values of the array as far as possible.

Variable arrays may be initialised to constant values with the following syntax:

    var table[10]={ "name", 1, 3.4, "hello" };
    

Note that each cell need not contain the same variable type. Two dimensional variable arrays must be initialised as a one dimensional sequence in row-dominant order.

    var table2[2,3] = {
        1, 2, 3,     /* row 0 */
        4, 5, 6      /* row 1 */
    };
    

The initialisation of global variables occurs before any procedures are called. You can use this to create standalone programs. For example:

    /* hello world! program in ProXML */
    var run()
    {
        output("Hello World!\n");
    }
    var start=run();
    

3.3 Statement Types

The assignment statement sets a variable or an attribute to a given expression, or assigns a node variable to a new node value:

    <variable> = <expression>;
    <node> = <nodevalue>;
    

For example

    P {
        var v;     /* declare variable v */
        v = :ID;   /* set v to the current ID attribute */
        node n;    /* declare node variable n */
        n = ..;    /* set n to be the parent of the current node */
        :ID = n:ID ++ v;
                   /* set the ID attribute from parent and old value */
    }
    

Variable assignments may also take these forms which update the current value of the variable:

    <variable> += <expression>;
    <variable> *= <expression>;
    

These are the same as <variable> = <variable> + <expression>, etc.

The if statement executes a block of statements if a simple expression evaluates as true (non-zero).

    if ( <simple-expression> ) 
        <simple-or-compound-statement>
    

Simple expressions may incorporate the relational operators:

    ==    equal
    !=    not equal
    >     greater than
    >=    greater than or equal
    <     less than
    <=    less than or equal
    &&    and
    ||    or
    !     not
    

When using comparison operators take care that the expressions are of appropriate type. You may not get the expected result if "<" or ">" are applied to strings instead of numbers. For example

    CHAPTER {
        var chno = integer(:CHAPNO);	/* force chapno to integer */
        if ((1 < chno) && (chno <= 10)) :CHAPNO += 1;
    }
    

If statements can also include an 'else' block of statements which are executed if the conditional expression evaluates as false (zero).

    if ( <simple-expression> ) 
        <simple-or-compound-statement> 
    else
        <simple-or-compound-statement>
    

For example

    var factorial(var n)
    {
        if (n <= 1)
            return 1;
        else
            return n*factorial(n-1);
    }
    

The while statement executes a block of statements repetitively while a simple expression evaluates as true (non-zero).

    while ( <simple-expression> ) 
        <simple-or-compound-statement>
    

For example:

    /* find left-most descendant node */
    node n = .;
    while (numchild(n) > 0) n = n/.;
    

The for statement is a convenient means for executing a block of statements a fixed number of times.

    for (<assignment-statement-1>;<simple expression>;
        <assignment-statement-2) <simple-or-compound-statement>
    

Logically, this is just a compact means of writing:

    <assignment-statement-1>
    while ( <simple-expression> ) {
        <simple-or-compound-statement>
        <assignment-statement-2>
    }
    

For example:

    var i;
    for (i=1;i<=12;i+=1) 
        output(i ++ "\t" ++ i*i ++ "\n");
    for (i=1;i<numchild(.);i+=1) {
        output(element(./.[i]));
        output("\n");
    }
    

The foreach statement provides a sophisticated mechanism for searching the parse tree to find all elements that match a given node specification. The format of the statement is as follows:

    foreach node ( <node-pattern> )
        <simple-or-compound-statement>
    

Where node is the name of a node variable which must have been defined previously, node-pattern is a specification of the nodes to be located by their element names, and the statement block specifies the instructions that need to be executed for each matching node. For each match, the node variable is set to the matching element and the block of statements is executed. For example:

    node n;
    foreach n (./.) output(element(n)++"\n");
    

Would produce a list of the element names of the children of the current node. Note that in the foreach statement, the underspecified node selector './.' refers to all children, not just the first one. This is different to the behaviour in the assignment statement, where

    node n;
    n = ./.;
    

sets n to be equal to the first child of the current node only.

The node pattern can contain the additional search operator '...' which means 'any number of levels down'. This allows a recursive search down through the tree. In every case, the foreach statement uses backtracking to try to exhaustively find all nodes that match the given pattern. Here is an example of backtracking and the '...' operator. Take the following extract of an XML file:

    <CHAPTER>
      <AUTHOR>Mark</AUTHOR>
      <TITLE>
        <P ID=1>The title</P>
      </TITLE>
      <P ID=2>A paragraph</P>
    </CHAPTER>
    

The following node selectors find these nodes:

    CHAPTER/P          - finds P ID=2
    CHAPTER/TITLE/P    - finds P ID=1
    CHAPTER/./P        - finds nil
    

However these node patterns, used in the foreach statement, find these nodes:

    CHAPTER/P          - finds P ID=2
    CHAPTER/TITLE/P    - finds P ID=1
    CHAPTER/./P        - finds P ID=1
    CHAPTER/.../P      - finds P ID=1 and P ID=2
    

To search through the document parse tree to find nodes that match arbitrary expressions, use the built-in functions: ancestor(), descendant(), following(), previous().

3.4 Built-in Variable Functions

var char(var e)
The char() function converts a variable expression into a single character.
    var letterA = char(65);
var element(node n)
The element() function returns the element name from the supplied node if it contains an element.
    output("<"++element(.)++">");
var exit(var code,var messg)
The exit() function terminates the PRX interepreter immediately, printing any message provided in the second argument on the standard error channel. The exit code returned to the operating system is taken from the first parameter.
    exit(1,"Fatal error.\n");
var float(var e)
The float() function converts a variable expression into a floating-point number. A run-time warning message is issued for illegal conversions.
    var fnum = float("1.5");
var format(var fmt,var val)
The format() function performs formatted conversions from values into strings. The first argument is a 'C' style printf format string containing at most one '%' format specification. The second argument contains the value to be converted and formatted.
    output(format("Answer=%5d\n",val));
var integer(var e)
The integer() function converts a variable expression into an integer. A run-time warning message is issued for illegal conversions.
    var inum = integer("134");
var iselement(node n)
Tests to see if the node is an element (returns 1) or is text (returns 0).
    if (iselement(./.))
        output("First child is element "++element(./.));
    else
        output("First child is text "++text(./.));
    
var istext(node n)
Tests to see if the node is text data (returns 1) or is an element (returns 0).
    if (istext(./.)) output("node has content\n");
var log10(var e)
Returns the logarithm base 10 of the expression.
    var logval = log10(val);
var newattribute(node n,var name)
The newattribute() function temporarily permits a new attribute on the given node with the given name. It does not change the document DTD, so this is only useful in suppressing the error message normally indicated when an unknown attribute is set. No attribute of the given name is actually added to the node until an assignment statement for the attribute is made.
    newattribute(.,"NEWATTR");
    .:NEWATTR = "Y";
    
var numchild(node n)
The numchild() function returns a count of the number of child nodes directly below the current node. Children can be either other nodes or text data.
    for (i=1;i<=numchild(.);i+=1)
        if (istext(./.[i])) output(text(./.[i]));
    
var output(var e)
The output() function outputs expressions to the standard output. Note that this output can interfere with the XML output if that too is directed to the standard output.
    output("Count = "++count);
var pow10(var e)
Returns 10 to the power of the expression (antilogarithm).
    var d1000 = pow10(3.0);
var random()
Returns a pseudo random number in the range 0 <= x > 1. The random seed is initialised to the system time at the first call.
    var digit = integer(10*random());
var sizeof(array-name)
The sizeof() function returns the number of elements allocated to the given array. For two-dimensional arrays, the value returned is the product of the number of rows and the number of columns.
    var table[100];
    var tsize = sizeof(table);  /* returns 100 */
    
var streamclose(var stream)
Closes the specified stream. No further I/O operations are possible on the stream.
    streamclose(ip);
var streamget(var stream)
Reads a character from the specified stream. Returns -1 on end of file.
    var c = streamget(ip);
var streamgetline(var stream)
Reads a line from the specified stream. Lines are terminated by '\n' characters. Returns "" (the empty string) on end of file.
    var line = streamgetline(ip);
var streamopen(var fname,var mode)
Opens a stream to the supplied filename or URL. A mode of 0 opens the stream for reading, a mode of 1 opens the stream for writing. Only file URLs may be opened for writing. Returns 0 on error.
    var ip = streamopen("http://www.phon.ucl.ac.uk/",0);
    var ip2 = streamopen("file:test.txt",0);
    var op = streamopen("output.txt",1);
    
var streamput(var stream,var expr)
Writes a character, integer, floating-point number or string to the currently opened output stream. Conversions from numbers to strings are performed automatically. Use the char() function to guarantee writing a single character, or the format function to control number formatting. Returns the number of characters actually written, or 0 on error.
    var op = streamopen("output.txt",1);
    var cnt = streamput(op,char(32));
    streamput(op,"This is a line\n");
    streamclose(op);
    
var string(var e)
The string() function converts a variable expression into a string.
    var filename = "temp"++string(count);
var text(node n)
Returns the text associated with the node n, if it contains text data.
    output("Text="++text(n)++"\n");

3.5 Built-in Node Functions

node ancestor(node start,var match())
Searches the ancestors of node start for the first node for which the supplied match() function returns a non-zero value. You supply a match function to make whatever judgements you wish on nodes. The match function is passed a single argument, a reference to a node and should return 0 or 1 depending on whether the node is unsatisfactory or satisfactory.
    var isaP(node n)
    {
    	if (element(n)=="P")
    		return(1);
    	else
    		return(0);
    }
    
    B {
    	/* find containing P element */
    	node p = ancestor(.,isaP);
    	if (p==NIL) return;
    	.:HREF = p:ID;
    }
    
node appendchild(node parent,var elname)
Creates a new empty element with the name elname and appends it as the last child of the parent node. A reference to the new node is returned. If the interpreter is currently processing a document with a DTD, the name is validated against the DTD.
    node n = appendchild(.,"P");
    n:ALIGN = "LEFT":
    
node appenditem(node parent,node item)
Adds the node item (and its children, recursively) after the end of the last current child of the parent node. A node reference to the new appended item is retyurned. This is a distinct copy of the original item. Items can be constructed from newitem(), appendchild(), or from parse().
    BODY {
    	node sig=parse("<P>Author: Mark Huckvale</P>","xml","");
    	appenditem(.,sig);
    } 
    
deleteitem(node n)
Unlinks the given node from its parent and deletes it. The node referenced by n no longer exists after this function is called.
    deleteitem(sig);
node descendant(node start,var match())
Searches the descendants of node start for the first node for which the supplied match() function returns a non-zero value. You supply a match function to make whatever judgements you wish on nodes. The match function is passed a single argument, a reference to a node and should return 0 or 1 depending on whether the node is unsatisfactory or satisfactory.
    var isjustified(node n)
    {
    	if (n:ALIGN=="FULL")
    		return(1);
    	else
    		return(0);
    }
    
    CHAPTER {
    	node n = descendant(.,isjustified);
    	while (n!=nil) {
    		n:ALIGN = "LEFT";
    		n = descendant(.,isjustified);
    	}
    }
    
node following(node start,var match())
Searches the nodes (other than descendants) that come after the node start for the first node for which the supplied match() function returns a non-zero value. You supply a match function to make whatever judegments you wish on nodes. The match function is passed a single argument, a reference to a node and should return 0 or 1 depending on whether the node is unsatisfactory or satisfactory.
    var eltype;
    var issametype(node n)
    {
    	if (element(n)==eltype)
    		return(1);
    	else
    		return(0);
    }
    
    . {
    	eltype=element(.);
    	node n = following(.,issametype);
    	if (n!=nil) output("Found later "++element(.)++"\n");
    }
    
node newitem(var elname)
Creates a new empty item of name elname and returns a reference to it. If the interpreter is currently processing a document with a DTD, the name is validated against the DTD.
    node new = newitem("CHAPTER");
node parse(var str,var type,var elname)
Creates a new item by parsing XML or text passed as a string argument. If the type argument has the value "xml", then the str argument contains at most one item expressed in XML which is parsed and returned. If the type argument has the value "text" then the str argument is broken up at whitespace and each component returned as a child of a new parent element. If the elname argument is supplied then it is used to name the parent element. If the elname argument is blank then a text parse will use an element name of TEXT.
    node newP=parse("<P ALIGN='LEFT'>Hello <B>World!</B></P>","xml","");
    node tokens=parse("proxml.exe 2.0 110236 13-April-2001","text","FILE");
    node n;
    foreach n (tokens) output("Token='"++text(n)++"'\n");
    
node preceding(node start,var match())
Searches the nodes (other than ancestors) that come before the node start for the first node for which the supplied match() function returns a non-zero value. You supply a match function to make whatever judgements you wish on nodes. The match function is passed a single argument, a reference to a node and should return 0 or 1 depending on whether the node is unsatisfactory or satisfactory.
    var eltype;
    var issametype(node n)
    {
    	if (element(n)==eltype)
    		return(1);
    	else
    		return(0);
    }
    
    . {
    	eltype=element(.);
    	node n = preceding(.,issametype);
    	if (n!=nil) output("Found earlier "++element(.)++"\n");
    }
    
node sortchildren(node parent,var compfunc())
Re-orders the immediate child nodes of the parent node according to the decisions of a supplied match function. The match function is of the form var compfunc(node n1,node n2) and should return -1 if node n1 sorts earlier than node n2, return +1 if node n2 sorts earlier than n1, and return 0 if the order does not matter. The match function can use any criteria for making this decision, basing it on element names, attribute values or on children of those nodes.
    var sortbyelement(node n1,node n2)
    {
    	if (element(n1)<element(n2))
    		return(-1);
    	else if (element(n1)>element(n2))
    		return(1);
    	else
    		return(0);
    } 
    / {
    	/* sort children by element name */
    	sortchildren(.,sortbyelement);
    }
    

3.6 Miscellaneous

Constants may be assigned with the syntax const <identifier> = <value>;, as in:

    const NUMLINES=100;
    const FILENAME="out.lst";
    

Comments may be 'C'-style or 'C++' style:

    /* ignore this text */
    // ignore this text up to the end of the line
    

Include Files: Other PRX source files may be included in the script to provide a simple library mechanism with include "<filename>", as in:

    include "myfunc.prx";
    

Include files are searched for in this order: source directory, current directory, "/usr/local/include".

The assert() procedure is an aid to debugging. It takes an expression and optionally a message string and terminates the program with the message if the expression evaluates as false. When the assert fails, the program counter is printed and this can be tracked back to individual script statements using the map option of the interpreter.

    node getsecondchild(node n)
    {
    	assert(n!=nil,"get second child passed nil node");
    	assert(numchild(n)>1,"get second child invalid");
    	return(n/.[2]);
    }
    

4. Examples

Traverse entire tree printing out element names and text data (in indented format)

    /* traverse.prx -- ProXML script to traverse entire tree */
    var report(node n,var depth)
    {
        var    i,num;
    
        for (i=1;i<=depth;i+=1) output(" ");
    
        if (istext(n)) {
            output("\"" ++ text(n) ++ "\"\n");
        }
        else {
            output(element(n) ++ "\n");
            num=numchild(n);
            for (i=1;i<=num;i+=1) report(n/.[i],depth+1);
        }
    }
    
    / {
        report(.,0);
    }
    

Set the ID attribute of each node according to the number of descendants it has

    /* descount.prx -- count number of descendants of each node */
    
    var numdescendant(node n)
    {
        var num = numchild(n);
        var cnt = num;
        for (var i=1;i<=num;i+=1) cnt += numdescendant(n/.[i]);
        return(cnt);
    }
    
    . {
        :ID = numdescendant(.);
    }
    

Store and access nodes from an array:

    /* leaves - put <P> nodes into a global array and access linearly */
    
    node ntable[1000];
    var  ncount=0;
    
    / {
        node    n;
    
    	/* count number of <P> descendants */
    	ncount=0;
        foreach n (./.../P) ncount += 1;
    	
    	/* resize array to fit */
    	resize(ntable,ncount);
    
        /* descend to all <P> nodes assigning indexes */
    	ncount=0;
        foreach n (./.../P) {
            ntable[ncount] = n;   // save node reference
            n:ID = ncount;        // set ID attribute for later
            ncount += 1;
        }
    }
    
    P {
        var    num=:ID;
    
        /* print text in this and previous and next nodes */
        output("\nThis: /"++text(./.)++"/\n");
        if (num > 0)
            output("Prev: /"++text(ntable[num-1]/.)++"/\n");
        if (num+1 < ncount)
            output("Next: /"++text(ntable[num+1]/.)++"/\n");
    }
    

5. Running the PRX interpreter

The ProXML interprter, PRX has the following command line:

    prx (-I) (-o outfile|-O|-n) (-m mapfile) script.prx *.xml
    

The switches are as follows:
-I Print the current version number of the PRX interpreter
-o outfile Send all XML output to the file outfile
-O Send all XML output back into the same file it came from
-n Suppress all XML output, do analysis only.
-m mapfile Output a mapping of program counter values to source line numbers. This is useful for finding the source lines corresponsing to assert() failures and run-time warnings.

If the input XML filename is just '-', then the XML is read from the standard input channel. In this way, the PRX interpreter can function in a pipeline:

    xmlgenerate | prx script.prx - | xmldosomething
    

6. Current Limitations

Functionality to come:

  • Access to substrings.
  • Ability to call element functions on newly generated parse trees.
  • Facility to move structure from place to place in parse tree.

Restricted functionality:

  • No functionality to read or modify the DTD
  • Loads whole XML file into memory before processing
  • 'Foreach' statement is heavy on stack use: possibly only limited to about 1000 matches.
  • Variable names must be different from element names and attribute names.
  • There is no parameter type checking on function arguments.
  • Huge memory leaks to be plugged
  • Arrays may not be assigned as wholes, nor passed as parameters to functions


© 2001 Mark Huckvale University College London