SFS Manual for Programmers

3. Data File Processing

3.1

Archetypal Program Structure

Data Analysis Programs

The example program in section 1.1 is a data analysis program in that it takes data from one or more input items and calculates some numerical values. Programs for data analysis have an archetypal structure that is described in the following paragraphs. Programs that generate new data sets, i.e. that create output items are described subsequently.

The archetypal structure for a data analysis program is:

revision history
manual page
program name & version
include files
manifest constants
global variables
main program:
  local variables
  decode options
  open data file
  locate input data set
  create buffer area to hold data records
  process data records
  close files
  free memory
  exit

The revision history is a record of the origination and changes to the program source, including details of any fixed bugs.

The manual page is an up-to-date nroff format manual page describing the use of the program, see section 4.1.

The program name & version are machine readable definitions of the program state, used for status and error reporting.

The include files section will have requests for the inclusion of 'sfs.h', the standard SFS data structure definitions, as well as calls to 'stdio.h', etc.

The manifest constants consist of definitions of processing variables that affect the operation of the analysis. For example, it would include definitions of threshold values. These definitions are usually performed using the 'C' pre-processor statement '#define'.

The global variables section normally includes SFS item headers and buffer pointers, hence made accessible to all routines.

The decode options section uses the standard routine getopt(3) to perform decoding of command-line switches, see section 4.4.

The open data file section refers to the location of the data file and its opening for access. As in the example program, the pathname of a file is found from the routine sfsfile(SFS3), while the opening of the file can be performed by sfsopen(SFS3):

fid = sfsopen(filename,"r",&head);

sfsopen() returns an integer (a file descriptor) that identifies the data file to subsequent routines. The file descriptor is a small positive integer for a successful open, and a small negative number for an error. The second argument is a request for an access mode (here "r" means reading only). The third argument is for the return of the file's main header if required.

The locate input data set component is the process of locating an input data set in the opened file. This is performed using the SFS routine sfsitem(SFS3). For example:

found = sfsitem(fid,datatype,match,&item) 

where fid is the file descriptor returned by sfsopen(), datatype is the major type indicator of the data set required, match is a character string identifying the item to be located, and head is a variable of type item_header where the item header of the located data set is to be placed. sfsitem() returns 1 on success and 0 on failure. The file descriptor fid is now located on a particular data set, and calls to read from fid will return frames from that data set. The file descriptor can be repositioned to a different data set by further calls to sfsitem(). If two data sets are required to be input simultaneously, two file descriptors are required.

The create buffer area component consists of the dynamic allocation of memory area to hold data from the located data set. This could be of a size to hold a single frame or large enough for all the frames in the data set. Dynamic memory allocation of SFS data areas is performed by sfsbuffer(SFS3). This routine takes two arguments, the item header for the data set and the size of the required buffer in frames. Thus to create a buffer to hold a single annotation:

struct an_rec *an;
an = (struct an_rec *) sfsbuffer(&anitem,1); 

or to hold an entire speech data set:

short *sp;
sp = (short *) sfsbuffer(&spitem,spitem.numframes);

In the example program, the input item location, buffer allocation and loading was performed by the single routine getitem(SFS3), but this is not always the most memory efficient method.

To process data records, it is necessary to locate individual frames of data in the data set. This is performed using sfsread(SFS3). This routine takes four arguments:

sfsread(fid,start,num,buff) 

where fid is the located file descriptor, start is the starting index of a frame, and num is the number of frames to be read. buff is the address of the data buffer created by sfsbuffer(). sfsbuffer() returns the number of frames actually read, so a returned value of zero indicates an error condition (or end-of-data). Thus to read the single annotation at frame 12:

sfsread(fid,12,1,an); 

(note frame numbering starts at 0, so this is actually the 13th frame). Or to read an entire speech data set:

sfsread(fid,0,spitem.numframes,sp);

To close files, a call is made to sfsclose(SFS3).

Memory buffers may be freed using free(3):

Data Processing Programs

The archetypal structure of a program that generates a new data set from an input item in the same file extends the structure described in the last section. The main program structure would now be something like:

main program:
  local variables
  decode options
  open data file
  locate input data set
  create buffer area to hold input records
  create output item header
  open output channel to file
  create buffer area to hold output records
  process data records
  close input file
  update input file
  free memory
  exit

The open data file component would still be a call to sfsopen(), but with a request for update access (using the "w" flag). If a file is write-protected, it is useful to find this out before processing takes place.

To generate a new item in a data file, it is first necessary to create an output item header. The SFS routine sfsheader(SFS3) should be used to initialise most of the fields in an item header structure. The routine is called with the address of the output item header, and a set of parameters detailing the data type, the data format, the frame duration, the item offset, etc. The item header history field and params field must be set up after the call to sfsheader().

With the new item header initialised, it is then necessary to open an output channel to the data file where the new item is to be placed. The routine sfschannel(SFS3) takes a file name and an item header and returns a file descriptor that may be used for writing data to the file. The data is in fact buffered in a temporary area, and not written to the file until a a call to sfsupdate(SFS3).

The new item header can also now be used to create a buffer area for the output records in an analogous way with the buffer created for input records.

The processing of data records will now involve both the reading and writing of records. The reading of records will be performed by sfsread() using the input file descriptor opened by sfsopen() and located by sfsitem(). The writing of records is performed by sfswrite(SFS3) using the file descriptor returned by sfschannel(). sfswrite() takes three arguments: the output file descriptor, the number of frames to be written, and the address of the buffer area (as returned by sfsbuffer()) where the output data has been stored.

A processing loop might look like:

while input data available:
  read input data records
  process records
  write records to output

Once all the output data has been written, the input file descriptor may be closed as in the data analysis structure. The output file descriptor should not be closed with sfsclose() however. Instead, a call to the file update routine sfsupdate() will add all output items with currently opened file descriptors to the given file name. That is, sfsupdate() searches for output file descriptors opened on the file by sfschannel() and adds the data written down those channels to the data file on the disk.

The following sections will expand upon the use of the SFS library routines for processing SFS data files.

3.2

Opening/Closing Files

The routine sfsopen(SFS3) takes the name of an SFS file and returns a 'file descriptor' for use in accessing the data sets in the file. This file descriptor is used in subsequent calls to routines that locate and read data sets. sfsopen() takes three arguments: (i) the filename, (ii) a character string specifying the access mode required, and (iii) a pointer to a 'main_header' structure where the main header of the file may be returned.

The access mode string may take the following values:

    "r"  read main header, check file ok for reading, 
         return file descriptor.
    "w"  read main header, check file ok for update, 
         return file descriptor.
    "h"  write main header, return success code.
    "c"  create new file using main header supplied, 
         return success code.

The main_header pointer may be supplied as NULL for any mode, in which case default action is taken. Note that an access mode of "w" does not open the file for writing, it merely checks that the access permissions are ok for writing to the file later in the program.

Once opened, the file descriptor is associated with a particular position in the file. This position may be changed with sfsitem() or sfsnextitem(), see below. When programs require access to two independent data sets in a file, it is more efficient to duplicate an existing file descriptor rather than to open the file again. File descriptors may be duplicated with sfsdup(SFS3), which takes as its single argument an opened file descriptor.

sfsopen() and sfsdup() return negative values on error.

Thus the following code opens a file for reading and duplicates the file descriptor so that more than one data set may be accessed simultaneously:

int fid1,fid2;
/* : */
if ((fid1=sfsopen(filename,"r",NULL)) < 0)
  error("access error on '%s'",filename);
fid2=sfsdup(fid1);

To close file descriptors at the end of the program or so that file descriptors may be re-used, use sfsclose(SFS3). Note that there is a system-dependent maximum number of file descriptors that may be open at any one time.

3.3

Locating Items

Once a file descriptor has been opened on a file, it must be located on an item before any data may be read. The location of an item is usually performed by sfsitem(SFS3), which takes four arguments: (i) the file descriptor for the file to be searched, (ii) the item data type to be located (there are manifest constants declared in 'sfs.h' for SP_TYPE, FX_TYPE, etc), (iii) the item history match string, and (iv) a pointer to an item_header structure where the item header for the data set is to be returned.

The item history match string should be supplied in the format returned by itspec(SFS3); briefly the string contains (i) a decimal ASCII coding of the item subtype, e.g. "02", which locates an item with the given number, or (ii) the string "0", which locates the last item of the given type, or (iii) a history match string, e.g. "*" or "spectran(*)", in the conventions of histmatch(SFS3).

If the file is to be scanned for items that match some more complex criteria than allowed for by sfsitem(), the routine sfsnextitem(SFS3) is supplied. sfsnextitem() takes two arguments: (i) the file descriptor, and (ii) a pointer to an item_header structure. It locates the item following the one currently accessed by the file descriptor. On its first call after sfsopen(), sfsnextitem() locates the first item in the file; subsequent calls will locate the file descriptor at the second and subsequent items. To restart the scan at the first item in the file, call sfsnextitem() with a NULL pointer instead of the item_header pointer. The next call to sfsnextitem() will then locate the file descriptor on the first item in the file.

Calls to sfsitem() and sfsnextitem() can be intermixed. The routines return 1 on success and 0 on failure.

3.4

Data Buffering

Once a file has been opened and the appropriate item located, the routine sfsread(SFS3) may be called to read data into the program. First however it is commonly necessary to create dynamic buffer space to hold the data. It is recommended that all buffer space is allocated dynamically using the routine sfsbuffer(SFS3). There are two reasons for using sfsbuffer() over the use of static data or dynamic data allocated by malloc(3) or calloc(3): (i) by using dynamic data, the amount of data able to be processed is limited by system resources rather than by artificial limits, and (ii) items with variable size structures (structured items) require buffer areas with a pre-set internal structure for data storage.

sfsbuffer() takes two arguments: a pointer to the item header for the data set, and a count of the size of the buffer in frames. It returns a pointer to a suitably initialised buffer of sufficient size or the value NULL on error.

All buffers returned by sfsbuffer() can be considered to be arrays of the primitive structures defined for the item type. Thus the item type for speech items is 'short', and buffers are arrays of 'short':

short* speech;
speech=(short *)sfsbuffer(&spitem,1000);

This holds even when the primitive record for an item type is itself a structure, take the definition of co_rec, the basic component of a COEFF type:

struct co_rec {
  long posn;
  long size;
  long flag;
  float mix;
  float gain;
  float *data;
}

Buffers of coefficient type are arrays of these records:

struct co_rec * cobuff;
cobuff=(struct co_rec *)sfsbuffer(&coitem,1000);

and parts of the structure can be accessed as

cobuff[0].posn, cobuff[i].data[j], etc

However, when the buffer area is of length 1 remember to first de-reference the pointer:

struct co_rec * corec;
corec=(struct co_rec *)sfsbuffer(&coitem,1);
corec->size = corec->posn + 100;
corec->data[j] = 0.0;

When data buffers are no longer required they should be discarded using free(3).

3.5

Reading Data

The routine sfsread() takes four arguments: the opened and located file descriptor, a starting frame index for data transfer, the number of frames to transfer, and a pointer to the buffer where the data is to be placed. It returns the number of frames actually read. Thus the following is a prototypical piece of code for accessing an item one frame at a time:

struct item_header item;
struct co_rec * co;
int fid;
int i;
/* : */
if ((fid=sfsopen(filename,"r",NULL)) < 0)
  error("access error on '%s'",filename);
if (!sfsitem(fid,CO_TYPE,"0",&item))
  error("cannot find coefficient item in '%s'",filename);
if ((co=(struct co_rec *)sfsbuffer(&item,1))==NULL)
  error("cannot get buffer for input data",NULL);
for (i=0;sfsread(fid,i,1,co)==1;i++)
  /* process record co */
}

Here access to a coefficient item is used as an example.

The routine calls required to load an entire data set into memory are:

fid=sfsopen(filename,"r",NULL);
sfsitem(fid,it,ty,&item);
buff=sfsbuffer(&item,item.numframes);
sfsread(fid,0,item.numframes,buff);
sfsclose(fid);

These calls occur often enough to be packaged into a special routine, called getitem(SFS3):

getitem(filename,it,ty,&item,&buff);

which when supplied with a filename, an item type, an item history match, a pointer to an item_header, returns the item header of the located data set, plus the address of a buffer in memory where the data has been loaded.

3.6

Creating New Items

An item header contains many fields; the routine sfsheader(SFS3) is a convenient way to initialise all but two of them. sfsheader() takes 10 arguments detailing the most relevant item header fields, but also resets all of the fields in the header to default values. Thus sfsheader() should always be called to initialise a new item header for output.

The arguments taken by sfsheader() are:

struct item_header *item;  /* output item header */
int datatype;              /* item.datatype field */
int floating;              /* item.floating field */
int datasize;              /* item.datasize field */
int framesize;             /* item.framesize field */
double frameduration;      /* item.frameduration field */
double offset;             /* item.offset field */
int windowsize;            /* item.windowsize field */
int overlap;               /* item.overlap field */
int lxsync;                /* item.lxsync field */

The item.history and item.params fields are reset ready for initialisation by separate code.

Example calls for 'unstructured' data types are:

sfsheader(&item,SP_TYPE,0,2,1,0.0001,0.0,1,0,0);
sfsheader(&item,TX_TYPE,0,4,1,0.0001,0.0,0,0,1);
sfsheader(&item,TR_TYPE,1,4,1,0.01,0.0,1,0,0);

For 'structured' types with a fixed framesize, it is necessary to know the size of the 'header' portion of each record. This is stored in a global structure called sfsstruct[] and accessed by item type. sfsstruct[] holds the size of the fixed part of an item data record in bytes: thus sfsstruct[CO_TYPE]=20, and sfsstruct[SP_TYPE]=0. An appropriate call to sfsheader() to initialise a coefficient item header would be:

sfsheader(&item,CO_TYPE,-1,4,sfsstruct[CO_TYPE]/4+framesize,
0.001,0.0,0,0,1);

Annotation items have a variable framesize, this is indicated by a -1 value in the item.framesize field:

sfsheader(&item,AN_TYPE,-1,1,-1,0.0001,0.0,0,0,1);

The creation of an output item header is completed by the initialisation of the item.history and item.params fields described in the next section.

3.7

Item Histories

The item_header fields 'history' and 'params' must be initialised with parameters of the process and parameters of the data set respectively. Details of the program name, the input item numbers and the setting of command line options are given in the 'history' field. Other parameters of the data set such as maximum and minimum frequency are given in the 'params' field.

The format of the history field has been described in detail in section 2.2. Briefly the history field consists of:

  1. program name
  2. (optional) item type mnemonic
  3. (optional) list of input item numbers
  4. (optional) list of processing parameters

The routine sprintf(3) is a suitable method for initialising the item history, e.g.:

sprintf(opitem.history,"%s(%d.%02d;flag=%d),
  PROGNAME,
  ipitem.datatype,ipitem.subtype,
  processflag);

or

sprintf(opitem.history,"%s/FX(%d.%02d,%d.%02d;param=%d%s)",
  PROGNAME,
  ipitem1.datatype,ipitem1.subtype,
  opitem2.datatype,ipitem2.subtype,
  processparam,
  (flag==1)?",flagged":"");

The 'params' field can be initialised in a similar way:

sprintf(item.params,"minf=%g,maxf=%g",
  0.0,0.5/spitem.frameduration);

Be aware that the history field has a fixed maximum length of 256 bytes, while the params field has a fixed maximum length of 128 bytes.

3.8

Output Channels

Once an item header has been created for a new data set, it is necessary to obtain a file descriptor for storing the results of processing in a temporary file before updating the data file. A temporary file is created and a file descriptor returned by the routine sfschannel(SFS3). This routine takes two arguments: the filename of the final destination file for the data set (probably the name of the input data file), and the output item header as initialised in the last two sections. If a temporary file can be opened, sfschannel() returns an opened file descriptor:

int ofid;
/* : */
if ((ofid=sfschannel(filename,&opitem)) < 0)
  error("cannot open output channel to '%s'",filename);

Note that there are a maximum number of file descriptors allowed to be open at any one time. File descriptors returned by sfschannel() should not be closed by sfsclose(), they are automatically closed when the data set is added to the input file with sfsupdate(), see below.

When a program aborts execution with a call to error(SFS3), all temporary files created with sfschannel() are deleted.

3.9

Writing Data

With the file descriptor returned by sfschannel() it is now possible to write out processed data using the routine sfswrite(SFS3). It may be necessary however to first create a buffer to hold output records. A dynamically allocated data buffer suitable for holding SFS data records can be created by a call to sfsbuffer() in the same way as was done for creating a buffer for reading data records (section 3.4).

The routine sfswrite() takes three arguments: (i) a file descriptor returned by sfschannel(), (ii) the number of frames of data to write, and (iii) the address of a buffer where the data records are stored. It returns the number of frames actually written.

Thus a piece of code to write out coefficient records would look like:

#define NELEMENT 10        /* # data values in co_rec */
struct item_header ipitem; /* input item header */
struct item_headercoitem;  /* output item header */
struct co_rec * co;        /* output item data record */
int ofid; /* output file descriptor */
int i;    /* input item record counter */
/* : */
/* create output item header */
sfsheader(&coitem,CO_TYPE,-1,4,sfsstruct[CO_TYPE]/4+NELEMENT,
  ipitem.frameduration,ipitem.offset,0,0,1);
sprintf(coitem.history,"%s(%d.&02d)",
  PROGNAME,
  ipitem.datatype,ipitem.subtype);
sprintf(coitem.params,"minf=%g,maxf=%g",
  0.0,0.5/ipitem.frameduration);
/* open output channel */
ofid = sfschannel(filename,&coitem);
/* get buffer */
co = (struct co_rec *)sfsbuffer(&coitem,1);
/* process records */
for (i=0;i<ipitem.numframes;i++) {
  /* ... process input data */
  /* write output record */
  if (sfswrite(ofid,1,co) != 1)
    error("write error on output file",NULL);
}

At present it is not possible to retrieve data committed by sfswrite() to a temporary file with a call to sfsread().

3.10

Updating Files

Once all the input data has been processed, and the results written to 1 or more temporary files using sfschannel() and sfswrite(), the data can be added back into the input file as new data sets with a single call to sfsupdate(SFS3).

The internal operation of sfsupdate() is complex and briefly described in the reference section. Its use however is very simple, it takes the name of the destination file for the items as its only parameter. The data in all temporary files opened by sfschannel() on the given filename is appended to the file. The item.numframes and item.length fields in the item headers for the data sets are set automatically.

Since there is a maximum number of opened file descriptors it may be necessary to call sfsupdate() periodically inside processing loops that generate many data sets.

The sequence of routine calls for updating a file with a data set is:

fid = sfschannel(filename,&item);
sfswrite(fid,numframes,buff);
sfsupdate(filename);

This sequence is also available in a single routine: putitem(SFS3):

putitem(filename,item,numframes,buff)

A sister routine putNitems() may be used to save a number of output items with only one data file update.

3.11

Linking Items

Link items have the appearance of normal data sets to processing programs but in fact consist of an item header and a link header only, with the data set existing in some other file.

The creation of a link item in a file is straightforward using the routine sfswritelink(SFS3):

sfswritelink(&item,numf,&link,filename)

This routine takes an item header describing the source data set, the number of frames in the data set, a link header detailing where the source data set may be found and the file into which the linked item is to be placed. sfswritelink() uses the standard output channel mechanism. When all processing is complete, the link item can be committed to the file along with other items generated by the program using sfsupdate() as normal.

The fields in the link header must be initialised independently in the user program. Details of the fields in a link header were described in section 2.3. In the following example, the first 100ms of a speech item located in file 'srcfile' is used as the basis for a linked item in the file 'dstfile':

/* data structures */
struct item_headerspitem;
struct item_headeropitem;
struct link_headerlink;
/* .. */
/* locate input item in source file */
fid = sfsopen(srcfile,"r",NULL);
sfsitem(fid,SP_TYPE,"0",&spitem);
/* init output item header */
sfsheader(&opitem,spitem.datatype,spitem.floating,
  spitem.datasize,spitem.framesize,
  spitem.frameduration,0.0,
  spitem.windowsize,spitem.overlap,
  spitem.lxsync);
sprintf(opitem.history,"%s(%d.%02d)",
  PROGNAME,
  spitem.datatype,spitem.subtype);
/* init link header */
strcpy(link.filename,pathname(srcfile)); /* absolute pathname */
strcpy(link.filepath,"");                /* not used */
link.filetype=1;                         /* SFS file */
link.datatype=spitem.datatype;           /* speech */
link.subtype=spitem.subtype;             /* entry number */
link.offset=0;                           /* starting at 0 */
link.multiplex=0;                        /* not multiplexed */
link.linkdate=time((long *)0);           /* current time */
link.swab=0;                             /* bytes not swapped */
link.dcoffset=0;                         /* no DC offset */
link.shift=0;                            /* no binary shift */
link.machine=SFSMACHINE;                 /* machine type */
/* add link to destination file */
sfswritelink(&opitem,(int)(0.1/spitem.frameduration),
  &link,dstfile);
sfsupdate(dstfile);
Next Section


© 2000 Mark Huckvale University College London