UCL Department of Phonetics and Linguistics

Introduction to Computer Programming with MATLAB

Extra Lecture : Application Development

Objectives

 

By the end of the session you should:

q       have more experience of how functions and control statements are used in the construction of programs

q       have more experience with the processes of software design, development and test

Outline

 

1. Program Specification

 

In this lecture we will develop a simple ‘in-memory’ database application for storing data 'values' indexed by 'keys'.  The specification is as follows:

  1. Data values are strings of characters
  2. Index keys are strings of characters
  3. Both values and keys can be held as fixed length strings
  4. Each value can have one or more keys
  5. Each key can point to one or more values
  6. We need a function to initialise the database
  7. We need a function to add a record consisting of a value with its keys
  8. We need a function to find all values matching a given key

 

2. Program Design

 

We will limit ourselves to simple MATLAB tables to store the data.  This means that we will need one table to hold all the different values, one table to hold all the different keys, and one table to hold the association of keys with values.  Other organisations would have been possible.

 

We will deal with the keys and the values as fixed length strings.  This means that the tables will then have a fixed number of columns.  However, we will need a function to pad the keys to the width of the key table, and the values to the width of the value table.

 

We will need these top-level functions:

 

dbinitialise(maxsizevalue,maxsizekey)

dbaddrecord(value,key,key,…)

values=dbfindrecords(key)

 

The dbinitialise() function will set up or clear the tables for the database.  To initialise these it needs to know how many columns to use to hold the largest value and the largest key.

 

The dbaddrecord() function will take a value as a string and one or more keys as strings and add them to the database.

 

The dbfindrecords() function will take a key as a string and return an array of all the matching values in the database.

 

This is how we expect it to work

 

dbinitialise(10,10);

dbaddrecord('huckvale','mark');

dbaddrecord('bloggs','fred','arthur');

dbaddrecord('smith','fred');

dbaddrecord('smith','andrew','mark');

disp(dbfindrecords('mark'));

huckvale

smith

 

Notice that these commands also test out different aspects of the functionality of the program, including the use of multiple and duplicate keys.

 

The data tables themselves need to be hidden 'inside' the functions and shared between them This suggests a set of global variables.  Each function will contain the same global data statement which will give it access to the three tables and their current sizes, as follows:

 

global dbkey, dbnumkey, dbindex, dbnumindex, dbvalue, dbnumvalue

 

Note here also that by making all the names of functions and variables start with a common prefix, we reduce the possibility of clashes with MATLAB or user names.

 

The dbinitialise() function will create the data tables and initialise the current number of values stored in the tables to zero.

 

We are ready to design the function to add a record.  We see that it comprises three more basic operations: add a value to the value table, add a key to the key table, and associate the two using the index table.  Furthermore we will only need to call the first function once, but the last two functions repeatedly for each key.  This suggests we break up this function into more basic functions.  We might write some 'pseudocode' for the idea:

 

ptr=dbaddvalue(value)

while (keyslefttoadd)

    code=dbaddkey(key)

    dbaddindex(code,ptr)

end

 

We could also break down the database retrieval function into more basic functions, like

 

code=dbfindkey(key)

ptrs=dbfindindex(code)

vals=dbvalue(ptrs)

 

In the last line we are using the list of pointers (ptrs) to retrieve all the matching values from the value table.

 

3. Implementation

 

function dbinitialise(maxsizevalue,maxsizekey)

% dbinitialise(maxsizevalue,maxsizekey) creates a new database

% for values up to size maxvaluesize, and keys up to size

% maxkeysize

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

dbkey=char(zeros(1,maxsizekey));

dbnumkey=0;

dbindex=zeros(1,2);

dbnumindex=0;

dbvalue=char(zeros(1,maxsizevalue));

dbnumvalue=0;

return;

 

We initialise the tables to a single row containing zeroes so that the correct number of columns is established.  The 'num' variables tell us how many entries there are in each table, and also where the next value can be put in each table.

 

function code=dbaddkey(key)

% code=dbaddkey(key) adds a key to the key table

% returning a numeric code

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

% pad the key

key=pad(key,size(dbkey,2));

%

% see if already in table

for i=1:dbnumkey

    if (strcmp(dbkey(i,:),key))

        code=i;

        return;

    end

end

%

% add to table

dbnumkey=dbnumkey+1;

dbkey(dbnumkey,:)=key;

code=dbnumkey;

return;

          

In this function we just run down the table checking to see if the key value is in the table already.  If it isn't we add it.

 

function p=pad(s,n)

% p=pad(s,n) pads string s with spaces to make it n char long

p=[s char(abs(' ')*ones(1,n)) ];

p=p(1:n);

 

This function pads out any string with spaces until it is a given width.  This may mean the string is truncated, too.

 

function ptr=dbaddvalue(value)

% ptr=dbaddvalue(value) adds a value to the value table

% returning a data pointer

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

% pad the record

value=pad(value,size(dbvalue,2));

%

% add to end of table

dbnumvalue=dbnumvalue+1;

dbvalue(dbnumvalue,:)=value;

ptr=dbnumvalue;

return;

 

Values are always added to the end of the table, we don't bother checking for duplicate values.

 

function dbaddindex(code,ptr)

% dbaddindex(code,ptr) links a key code to a value pointer

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

% add code, ptr pair to end of index table

dbnumindex=dbnumindex+1;

dbindex(dbnumindex,1)=code;

dbindex(dbnumindex,2)=ptr;

return;

 

Each index table row associates two numbers: the numeric code for the key in column 1, and the numeric code for the value in column 2.

 

function dbaddrecord(value,varargin)

% dbaddrecord(value,key,key,...) adds a record to the database

% indexed by each of the supplied keys

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

ptr=dbaddvalue(value);

for i=1:length(varargin)

    code=dbaddkey(char(varargin(i)));

    dbaddindex(code,ptr);

end

return;

 

This function gets the numeric pointer for the value and a numeric code for each key and adds the {code, pointer} pairs to the index table.  To cope with the variable number of input arguments, we use the 'varargin' variable – this will contain a vector of all the arguments after the first argument.  We can then take each key argument in turn, convert it to a string and add it as a key.

 

function code=dbfindkey(key)

% code=dbfindkey(key) finds code for key or returns 0 for not found

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

% pad the key

key=pad(key,size(dbkey,2));

%

% see if in table

for i=1:dbnumkey

    if (strcmp(dbkey(i,:),key))

        code=i;

        return;

    end

end

code=0;

return

 

This function runs down the key table looking for a match.  It returns the key code, or zero if the key is not found.

 

function ptrs=dbfindindex(code)

% ptrs=dbfindindex(code) finds all record pointers indexed

% by a given code

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

% select from table

ptrs=[];

for i=1:dbnumindex

    if (dbindex(i,1)==code)

        ptrs=[ptrs dbindex(i,2)];

    end;

end;

return;

 

This function runs down the index table looking for entries matching the key code.  It builds a vector of matching value pointers.

 

function values=dbfindrecords(key)

% values=dbfindrecords(key) returns all records indexed by key

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

code=dbfindkey(key);

if (code > 0)

    ptrs=dbfindindex(code);

    if (length(ptrs) > 0)

        values=dbvalue(ptrs,:);

    else

        values=[];

    end

else

    values=[];

end;

return;

 

This function looks up the given key then if found looks up the list of value pointers.  If there are some values to retrieve it gets them directly from the value table.

 

4. Testing

 

This function does a very basic test of the functionality:

 

function dbtest

dbinitialise(10,10)

dbaddrecord('huckvale','mark');

dbaddrecord('bloggs','fred','arthur');

dbaddrecord('smith','fred');

dbaddrecord('smith','andrew','mark');

disp(dbfindrecords('mark'));

 

This test estimates how quickly it works

 

function dbperf

% dbperf get some performance measures for database

%

dbinitialise(10,10);

%

% generate 1000 records

keys=char(zeros(1000,10));

vals=char(zeros(1000,10));

for i=1:1000

    keys(i,:)=sprintf('%06d%04d',fix(100000*rand(1,1)),i);

    vals(i,:)=sprintf('%06d%04d',fix(100000*rand(1,1)),i);

end;

%

% add 1000 records

tic;

for i=1:1000

    dbaddrecord(vals(i,:),keys(i,:));

end;

fprintf('Took %g seconds to add 1000 records\n',toc);

%

% look up 1000 records

tic;

for i=1:1000

    v=dbfindrecords(keys(i,:));

    if ((length(v)==0)|~strcmp(v,vals(i,:)))

        error('incorrect value returned!');

    end;

end;

fprintf('Took %g seconds to find 1000 records\n',toc);

 

The MATLAB functions 'tic' and 'toc' run an internal stopwatch.

 

5. Refinement

 

Our dbfindindex function does not make use of MATLAB's functionality for dealing efficiently with matrices.  If we use the MATLAB find() function for searching a table, the determination of the pointers that match a given code can be simplified to:

 

function ptrs=dbfindindex(code)

% ptrs=dbfindindex(code) finds all record pointers indexed

% by a given code

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

% select from table

ind=find(dbindex(:,1)==code);

ptrs=dbindex(ind,2);

return;

 

More optimisation could be achieved by keeping the index table sorted by key code value, then it would be quicker to find rows containing a particular code.

 

A significant source of inefficiency is the fact that we have to scan the key table when adding a key to see if it has been added already, and when finding a key to get the corresponding code.  If we know in advance the maximum number of different keys, we can simply store each key in a particular row in the table, then directly test to see if that key is present or not.  The idea is to build an empty table large enough to store the maximum number of keys, then compute a 'hash' value which turns a key into a row number of that table.  If that row is empty, then the key is not in the table, if that row contains the correct key then we have found the correct row, if that row contains a different key then we have a collision (two keys with the same hash code) so we scan forward in the table until we find an empty record or the correct key.

 

First we change the initialise function to add a parameter indicating the maximum number of keys:

 

function dbinitialise(maxsizevalue,maxsizekey,maxkeys)

% dbinitialise(maxsizevalue,maxsizekey,maxkeys) creates a new database

% for values up to size maxvaluesize, and up to maxkeys keys up to size

% maxkeysize

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

dbnumkey=fix(1.5*maxkeys);

dbkey=char(zeros(dbnumkey,maxsizekey));

dbindex=zeros(1,2);

dbnumindex=0;

dbvalue=char(zeros(1,maxsizevalue));

dbnumvalue=0;

return;

 

To prevent too many collisions, we actually make the hash table 50% bigger than needed.

 

To compute the hash value from the key, we multiply the character codes by a set of factors and sum them up.  The key lines are:

 

% calculate a hash value

fac=99*(1:size(dbkey,2));

code=1+rem(fac*abs(key)',dbnumkey);

 

Almost any computation would work, but ideally you want a hash computation which gives you very few collisions. Notice how we use the rem() function to ensure that the hash value is a valid row index.

 

The code to add a key to the hash table then becomes:

 

function code=dbaddkey(key)

% code=dbaddkey(key) adds a key value to the database

% returning a numeric code

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

% pad the key

key=pad(key,size(dbkey,2));

%

% calculate a hash value

fac=99*(1:size(dbkey,2));

code=1+rem(fac*abs(key)',dbnumkey);

%

% loop until found or empty

while (abs(dbkey(code,1))~=0)

    if (strcmp(dbkey(code,:),key))

        return;

    end

   code=1+rem(code,dbnumkey);

end

%

% not in table, add in

dbkey(code,:)=key;

return;

 

And the code for finding a key in the hash table becomes:

 

function code=dbfindkey(key)

% code=dbfindkey(key) finds code for key or returns 0 for not found

%

global dbkey dbnumkey dbindex dbnumindex dbvalue dbnumvalue

%

% pad the key

key=pad(key,size(dbkey,2));

%

% calculate a hash value

fac=99*(1:size(dbkey,2));

code=1+rem(fac*abs(key)',dbnumkey);

%

% loop until found or empty

while (abs(dbkey(code,1))~=0)

    if (strcmp(dbkey(code,:),key))

        return;

    end

   code=1+rem(code,dbnumkey);

end

code=0;

return;

 

Together, these improvements make the code work about 10 times faster.

Exercises

 

For these exercises, use the editor window to enter your code, and save your answers to files under your account on the central server. When you save the files, give them the file extension of ".m".  Run your programs from the command window.

1.      Write a function 'dbsave(filename)' that saves the database into the named file.

2.      Write a function 'dbload(filename)' that loads the database from the named file.

3.      (Advanced) Write a function 'dbdelete(value)' that removes a value and all its keys from the database.  Ensure that you only delete a key completely if it does not refer to any other value!

4.      (Homework) Implement the database functions in this handout, then modify them so that values can be retrieved with a partial match to the key.  For example, a key value of "ma" would return all values that have keys which start "ma…".