UCL Department of Phonetics and Linguistics

Introduction to Computer Programming with MATLAB

Lecture 6: Manipulating Text

Objectives

 

By the end of the session you should:

q       be able to write simple functions and programs that manipulate text files and tables of strings.

q       be able to re-use a number of simple programming templates for some common programming tasks.

Outline

 

1.      Writing to a text file

 

To save the results of some computation to a file in text format reqires the following steps:

a.         Open a new file, or overwrite an old file, keeping a ‘handle’ for the file.

b.        Print the values of expressions to the file, using the file handle

c.         Close the file, using the file handle

The file handle is a just a variable which identifies the open file in your program.  This allows you to have any number of files open at any one time.

 

% open file

fid = fopen('myfile.txt','wt');     % 'wt' means "write text"

if (fid < 0)

    error('could not open file "myfile.txt"');

end;

% write some stuff to file

for i=1:100

    fprintf(fid,'Number = %3d Square = %6d\n',i,i*i);

end;

% close the file

fclose(fid);

 

2.      Reading from a text file

 

To read some results from a text file is straightforward if you just want to load the whole file into memory.  This requires the following steps:

a.         Open an existing file, keeping a ‘handle’ for the file.

b.        Read expressions from the file into a single array, using the file handle

c.         Close the file, using the file handle

The fscanf() function is the inverse of fprintf().  However it returns the values it reads as values in a matrix.  You can control the 'shape' of  the output matrix with a third argument.

 

A = fscanf(fid,"%g %g %g\n",[3,inf])       % A has 3 rows and 1 col per line

disp(A(1,1))          % display first value on first line

disp(A(1,2))          % display first value on second line

disp(A(2,1))          % display second value on first line

 

Thus to read back the data we saved above:

 

% open file

fid = fopen('myfile.txt','rt');     % 'rt' means "read text"

if (fid < 0)

    error('could not open file "myfile.txt"');

end;

% read from file into table with 2 rows and 1 column per line

tab = fscanf(fid,'Number = %d Square = %d\n',[2,inf]);

% close the file

fclose(fid);

rtab = tab';          % convert to 2 columns and 1 row per line

 

Reading a table of strings is more complex, since the strings have to be the same length.  We can use the fgetl() function to get a line of text as characters, but we'll first need to find out the length of the longest string, then ensure all strings are the same length.  Here is a complete function for loading a text file as a table of fixed-length strings:

 

function tab=readtextfile(filename)

% Read a text file into a matrix with one row per input line

% and with a fixed number of columns, set by the longest line.

% Each string is padded with NUL (ASCII 0) characters

%

% open the file for reading

ip = fopen(filename,'rt');          % 'rt' means read text

if (ip < 0)

    error('could not open file');   % just abort if error

end;

% find length of longest line

max=0;                              % record length of longest string

cnt=0;                              % record number of strings

s = fgetl(ip);                      % get a line

while (ischar(s))                   % while not end of file

   cnt = cnt+1;

   if (length(s) > max)           % keep record of longest

        max = length(s);

   end;

    s = fgetl(ip);                  % get next line

end;

% rewind the file to the beginning

frewind(ip);

% create an empty matrix of appropriate size

tab=char(zeros(cnt,max));           % fill with ASCII zeros

% load the strings for real

cnt=0;

s = fgetl(ip);

while (ischar(s))

   cnt = cnt+1;

   tab(cnt,1:length(s)) = s;      % slot into table

    s = fgetl(ip);

end;

% close the file and return

fclose(ip);

return;

 

Here is an example of its use:

 

% write some variable length strings to a file

op = fopen('weekdays.txt','wt');

fprintf(op,'Sunday\nMonday\nTuesday\nWednesday\n');

fprintf(op,'Thursday\nFriday\nSaturday\n');

fclose(op);

% read it into memory

tab = readtextfile('weekdays.txt');

% display it

disp(tab);

 

3.      Randomising and sorting a list

 

Assuming we have a table of values, how can we randomise the order of the entries?  A good way of achieving this is analogous to shuffling a pack of cards.  We pick two positions in the pack, then swap over the cards at those two positions.  We then just repeat this process enough times that each card is likely to be swapped at least once.

 

function rtab=randomise(tab)

% randomise the order of the rows in tab.

% columns are unaffected

[nrows,ncols]=size(tab);            % get size of input matrix

cnt = 10*nrows;                     % enough times

while (cnt > 0)

    pos1 = 1+fix(nrows*rand);       % get first random row

    pos2 = 1+fix(nrows*rand);       % get second random row

    tmp = tab(pos1,:);              % save first row

    tab(pos1,:) = tab(pos2,:);      % swap second into first

    tab(pos2,:) = tmp;              % move first into second

    cnt=cnt-1;

end;

rtab=tab;                           % return randomised table

return;

 

Sorting a list is easy if you just want some standard alphabetic ordering.  But what if you want to choose some arbitrary ordering function?  For example, how could you sort strings when case was not important?  Here we use the ability of MATLAB to evaluate a function by name (feval()) so that we can provide the name of a function for doing the comparisons the way we want.  This function should take two rows and return –1 if the first row sorts earlier than the second, 1 if the second row sorts earlier than the first and 0 if there is no preference.  Here is a case-independent comparison function:

 

function flag=comparenocase(str1,str2)

% compares two strings without regard to case

% returns –1, 0, 1 if str1 is less than, equal, greater than str2.

len1=length(str1);

len2=length(str2);

for i=1:min(len1,len2)

    c1 = str1(i);

    c2 = str2(i);

    if (('a' <= c1)&(c1 <= 'z'))

        c1 = char(abs(c1)-32);            % convert lower case to upper

    end;

    if (('a' <= c2)&(c2 <= 'z'))

        c2 = char(abs(c2)-32);            % convert lower case to upper

    end;

    if (c1 < c2)

        flag = -1;                        % str1 sorts earlier

        return;

    elseif (c2 < c1)

        flag = 1;                         % str2 sorts earlier

        return;

    end;

end;

% strings match up to length of shorter, so

if (len1 < len2)

    flag = -1;                             % str1 sorts earlier

elseif (len2 < len1)

    flag = 1;                              % str2 sorts earlier

else

    flag = 0;                              % no preference

end;

return;

 

Here is a sort function that might be used with this comparison function.

 

function stab=functionsortrows(tab,funcname)

% sorts the rows of the input table using the supplied

% function name to provide an ordering on pairs of rows

[nrows,ncols]=size(tab);

for i=2:nrows                              % sort each row into place

    j = i;

    tmp = tab(j,:);                        % save row

    % compare this row with higher rows to see where it goes

    while ((j > 1)&(feval(funcname,tmp,tab(j-1,:))<0))

        tab(j,:) = tab(j-1,:);            % shift higher rows down

        j = j - 1;

    end;

    tab(j,:) = tmp;                        % put in ordered place

end;

stab = tab;                                % return sorted table

return;

 

4.      Searching a list

 

How might we search a list of items for an item matching a specific value?  If the list is unordered, all we can do is run down the list testing each entry in turn.  This function finds the index of a row in a table that contains (anywhere) the characters in the supplied match string:

 

function idx=findstring(tab,str)

% find the row index containing a matching string

% returns 0 if the string is not found

[nrows,ncols]=size(tab);

for idx=1:nrows

    matches = findstr(tab(idx,:),str);

    if (length(matches)>0)

        return;

    end;

end;

idx=0;

return;

 

However, the process can be much faster if the listed is sorted and we are searching for an exact match only.  A so-called binary search is the fastest possible way of finding an item in a sorted list:

 

function idx=binarysearch(tab,val)

% returns the row index of val in sorted table tab

% returns 0 if val is not found

[nrows,ncols]=size(tab);

lo=1;

hi=nrows;

while (lo <= hi)

    idx = fix(lo+hi)/2;

    if (val < tab(idx,:))

        hi = idx - 1;

    elseif (val > tab(idx,:))

        lo = idx + 1;

    else

        return;

    end;

end;

idx=0;

return;

 

5. Cell Arrays

 

Many operations with text and tables of strings are made simpler in MATLAB through the use of "cell arrays".  These are a generalisation of MATLAB matrices such that cells can contain any type of object.  This allows MATLAB to manipulate tables of variable length strings.  We will not be going into cell arrays in this course.

Reading

 

MATLAB Online Manual: Using MATLAB: Index of Examples

Exercises

 

For these exercises, use the editor window to enter your code, and save your answers to files under your account on the central server. When you save the files, give them the file extension of ".m".  Run your programs from the command window. You may want to start by implementing the readtextfile() function from this handout.

1.      Write a program (ex61.m) to ask the user to input the name of a text file containing a list of WAV format sound files.  Play these sounds out in random order.

2.      Write a program (ex62.m) to ask the user to input the name of a text file containing a list of general knowledge TRUE/FALSE questions.  Prompt the user with each question in turn and save his/her responses to an output file.

3.      Write a program (ex63.m) to input a word list from a file and then to spell check another text file.  Treat as mis-spelled all words not in the word list, and report a list of all mis-spelled words, ensuring that each mis-spelling is reported once only.

4.      (Homework) Write a program that takes the name of a text file containg a list of WAV format sound files.  Concatenate the audio files into a single WAV file with 3 seconds of silence between them.  Be sure to check that sampling rates are compatible.