/* MWU'S : an ndfa representation */

% ----------------------------------








% ------------


/* comments in Prolog are enclosed within a pair of complex signs; the opening complex sign is a slash immediately followed by a star (/*); the closing sign is a star immediately followed by a slash (*/). The Prolog compiler simply neglects whatever is offered as a comment. Consequently, comments are used mainly to document the program. They can also be used to rem (make inoperative) a section of Prolog code; in that way, by remming and de-remming (commenting out and then decommenting) a bit of code, one can see how the program behaves when that bit of code is included or excluded */



/* comments that do not include a carriage return (one-line comments) can be introduced by the percentage sign (%). There is then no corresponding end-of-comment sign the line end signals the end of the comment */




% -----------------------


/* the style_check is a compiler directive : it tells the Prolog compiler not to complain if a named variable is used only once in the definition of a clause; such named variables can always be replaced without loss by the anonymous variable (noted with an underline : _) whose instantiations Prolog does not compute


the Prolog compiler (helped by the linker) is that part of the Prolog software that turns the users program into executable code */

:- style_check(-singleton).



/* GO STEP */




/* To make the users life easier, its a good idea to provide a go-step, that is to say to define a go predicate without arguments. This predicate should be top-level, i.e. it should call other predicates, but be called by none. The user then starts the whole procedure by simply typing the word go followed by a dot: go.


The go-step should begin by writing on the screen what the program is all about, and then ask the user to enter the relevant data, either at the keyboard or from a file, whose filename the user should then be asked to provide. In the same way, the results of the Prolog program should be stored where the user wants, i.e. in the specific output file that he chooses; if needed, this file should be created by the system */




% GO step


go :- nl,




/* nl is a predefined predicate : it puts in a (carriage return)/(new line) character, i.e. causes subsequent output to appear on a new line */

/* write is also a predefined predicate; it can take as argument a character string enclosed in single quotes */


write('A.Michiels, University of Lige'),nl,nl,

write('Input file? [stdin. or file_name.] --> '),


/* stdin is the Prolog name for the standard input stream, i.e. the keyboard */




/* read is a predefined predicate it takes as argument a Prolog term; here the term is expected to be a Prolog atom, i.e. basically a word beginning with a small letter, which here is supposed to point to a filename without extension, or to be stdin for standard keyboard input */




write('Output file? [file_name.] --> '),




/* concat is a predefined predicate taking three atomic arguments; it concatenates its second argument to its first to build a new atom (its third argument). Suppose the first arg here (the Output variable) is instantiated to test; the second arg is the quoted atom .lst; the third arg will then be test.lst, a quoted atom */






/* THE CUT (!) */

% ---------------



/* The cut (noted !) tells Prolog not to backtrack past the point where the cut is placed; here, if the next predicate fails (i.e. if the predicate start can no longer be satisfied (i.e. if all solutions for the start goal are exhausted)), Prolog will not attempt to find new solutions for concat or for any of the preceding goals involved in the solution of the call to go.


The cut can be used to tell Prolog something like this : if you have reached this point, do not try to reach it via any other way all the decisions that you have taken so far (i.e. in the solution of the parent predicate, here the go predicate) should be regarded as final.


If no cut is used, then in the case of failure Prolog will be able to backtrack to all the choice points in the solution of the current parent goal, attempting to find new ways of solving the predicates that are called as goals by the parent predicate */













% -------------


% Input is from stdin(user's terminal) or from file



% standard input

% ---------------


/* if in the call of the dealwith predicate the Input variable is instantiated to the atom stdin, input is from the keyboard, and we dont have to do anything, i.e. there is no file to be opened */


dealwith(stdin,_):- !.



% input from a file

% ------------------


/* open is a predefined Prolog predicate taking three args:

the first is a filename

the second is the mode in which the file must be handled : the atoms read and write are used to specify whether the file is an input or output file

the third argument is the file handle, the value assigned by the system to the stream; this value is used as stream identifier;


Prolog will address a file through the stream identifier in reading from and writing to the file; the programmer need not know yhe value of the variable used to store it; he can simply use the variable name and Prolog instantiation will ensure that the right stream is being addressed */








/* START */

% ---------



% standard input

% --------------



% ------------------------


/* The repeat-fail pair is used to build a repeat-fail LOOP;

such a loop will ensure that all the predicates between the repeat and the fail will be called repeatedly and endlessly

such a loop should therefore include an ESCAPE HATCH, i.e. a call which allows the program to leave the loop, by halting or aborting the system


in this particular instance it is provided by the call to the abort predicate in the first definition of process when the input sentence is the single word stop the latter will be converted to the one-element list [stop], the output stream will be closed and the program aborted, passing control to the Prolog system itself

the repeat-fail loop works in the way it does because of the definition of the two predicates building it:

repeat is a predefined predicate which behaves as if it were defined by the following two clauses:


repeat :- repeat.


i.e. every time a new called is made to repeat (because of the backtracking induced by the fail predicate at loop closure) repeat succeeds with a new invocation to itself guaranteed to succeed by virtue of the repeat fact (the second clause in the definition)

fail is also a predefined predicate, which works as if it were a predicate without any definition, i.e. without the chance of ever succeeding. It is therefore guaranteed to fail; backtracking is triggered by the failure of the fail predicate and will be allowed to involve all predicates up to the repeat one, which will always succeed, making sure that backtracking does not involve any of the preceding predicates (the ones preceding the call to the repeat predicate */




start(Outlist,_,stdin) :-




write('Key in your sentence or stop. to quit'),



% getting the sentence to be parsed from the user's terminal






% input from a file

% ------------------


start(Outlist,HandleIn,Input) :-

Input \= stdin,















% ----------


/* getsentence is discussed below; it turns the user input string into a list of atoms built out of the words making up the string; for instance it turns

the teacher bore the brunt of the cuts.









/* Note that what we close here is a stream, not a file; although the result of closing a stream is of course to close the file that was used to feed this stream, it is the stream that needs to be closed; we do this by using the predefined predicate close with as single argument the variable used to point to the stream */


/* abort is a predefined predicate aborting the users program and returning control to the Prolog reader (top-level). The Prolog reader can itself be sent back home by the use of the halt predicate. */


% stopping...



process(Sentence,Lists) :-

Sentence = [stop],




/* a simpler, but slightly less conspicuous, version of this code would put the [stop] list in the arg list of the call to process; unification then succeeds only if the arg in first position, i.e. the position of the Sentence variable, is [stop] and we no longer need the check performed by the built-in unification predicate (namely =) :


process([stop],Lists) :- close(Lists), abort. */



/* the process predicate calls the ndfa abstract machine leaving the mwu-name as a variable whose value is to be worked out by the ndfa machine itself; if ndfa succeeds, then Mwu_name will get instantiated to the name of the multi-word unit exemplified in the users string; such a name is an atom permitting the identification of the multi-word unit (bear_the_brunt, for instance, will point to the whole family of bear the brunt and its variants (carry the brunt etc.) */


/* the results worked out by the ndfa will be written to the stream identifying the users selected output file; the Output_Stream variable is passed by simple unification */



% normal processing



process(Word_List,Output_Stream) :-










% -----


/* stepping down a word list in search of a multi-word unit */


/* note that since Mwu_name is an uninstantiated variable in the call to the ndfa predicate, the latter will be allowed to explore any multi-word unit graph until it finds one that succeeds (which it then returns at the beginning of its results list), or none (in which case no results at all are returned, and the sentence is passed by silently, the repeat-fail loop ensuring that the next sentence is presented for examination */


/* ndfa tries to find an initial state in any multi-word unit graph that belongs to the data base this call will succeed with a very large number of solutions; there will be as many solutions as there are initial states belonging to mwu-graphs; the fact that we have only two is because we are looking only for two mwu families, and each of them is limited to a single initial state */


/* once we have such an initial state in such a mwu graph, we must attempt to traverse the graph, starting at the initial state and confronting the graph with the word list resulting from the transformation of the users string into a wordlist performed by getsentence */


/* if such a traversal succeeds, the top of the results list will include the mwu name and the initial state that the successful traversal started from remember that there can be more than one */


ndfa(Mwu_name,Word_List,[Mwu_name,initial(Initial_State)|Mwu]) :-









% -----------------



/* a standard move involves following an arc in the graph by way of a given transition; the transition involves a lemma which must match the word currently being traversed in the list traversal route specified by the set of transitions */


/* we ensure that we work our way through the input word list by always looking at the first element of that list and returning the remainder of the list for examination by a fresh call to the traverse predicate; in that way we ensure that the list to be looked at grows shorter by one element each time a transition succeeds; we will be allowed to leave the traversal loop only when we reach a final state in the graph and are left with an empty list as current word list the job of checking for such a condition is performed by the third definition of traverse */


/* the matching of word and graph element is accomplished by the link predicate */


/* notice that the results list consists mainly in a list of Prolog terms with the transition functor and a four-arg list, each element of which is itself a functor with a single argument; such a term gets added on top of a list whose remainder (the Lemma_List argument) will be instantiated by further calls to the traverse predicate */


/* suppose we are looking at the first word of the input list [the,exams,are,getting,nearer], namely the atom the

the variables will be instantiated as follows:


Mwu_name : bear_the_brunt
State : q0
Word : the
Word_List: [exams,are,getting,nearer]

Next_State : q0

Word : the
Lemma : any_Word
Lemma_List : left uninstantiated */









of_category(Lemma))|Lemma_List]) :-








% ---------------


/* in a silent move we step down the graph without stepping down the input list in this case, the Word_List variables instantiation is left untouched; we consequently find the Word_List variable passed untouched to the fresh (recursive) call to the traverse predicate at the close of the predicate definition */


/* the silent transition from one state to the next should of course be specified as such, i.e. as silent, within the specs for the transition itself we use the atom silent_move as last argument to the transition predicate, in the arg where we normally store the lemma to be matched by the input string word being traversed */


% silent move






silent_move)|Lemma_List]) :-










/* A predicate that receives a recursive definition (i.e. one in which the predicate being defined is called in the body of the definition) is liable to lead to trouble (endless recursive calls building up a stack which soon overflows the memory allocated to the stack, i.e. the list of goals still waiting to be called) if steps are not taken to make sure that:


-          there is a non-recursive definition that will succeed in time

-          the recursive goal is called with an argument that tends towards the argument needed by the non-recursive call


In this case the non-recursive definition of traverse calls for an empty input word list; since the recursive call is one element shorter than the goal that calls it (one element has been taken off the list, the one to be examined by the call to the link predicate), we are moving slowly but surely, i.e. one word at a time, towards the empty list as instantiation of the Word_List arg in the recursive call to the traverse predicate */

/* when we reach the end of the word list (when we are left with an empty word list), we must be in a final state, otherwise the graph we are looking at has not been completely traversed; the condition works the other way as well: we cant have hit a final state with words still to be dealt with in the input list since we take care of the non mwu constituents at either end of the mwu exemplifying string (this is the use of the any_Word atom which is specified to be matchable by anything (use of the uninstantiated variable in the relevant transition)) */


% end of recursion - final state


traverse(Mwu_name,Final_State,[],[final(Final_State)]) :-







% -------------------------


/* The set of transition predicates for a given mwu defines the graph to be traversed while matching the input string turned into a word list on the one hand and the mwu representation on the other */


/* The mwu name is just a name it is not meant to reveal the internal structure of the mwu (that is the job of the whole graph, i.e. the whole set of transition predicates for that mwu), but to be mnemonic and to allow identification of the matched mwu without having to look at the traversal itself */



/* The state names are also purely conventional any atom will do it is of course the use of the same state names in several transitions, sometimes as from-state and sometimes as to-state, that builds up the very graph */


/* the last arg of transition is an atom ; it will be the job of the link predicate to match that atom with the word being looked at in the input word list */



/* bear_the_brunt : the transitions */









transition(bear_the_brunt,q1, q2, a).



transition(bear_the_brunt,q1,q4, brunt).






/* play_havoc_with : the transitions */






















% ------------------------------------------------




/* an input word matches the condition on a transition if either


it is morphologically related, i.e. the input string word is a morphological variant of the lemma specified as condition on the transition, e.g. bearing can match bear specified as transition condition in the following transition, for instance




(whether the match takes place at all is also conditioned by the state configuration the transition specifies also that the abstract machine should be in state q0 when looking at bearing)




the input word is morphologically linked to an element that belongs to the word category list making up the category specified as required by the transition predicate e.g. causing is morphologically related to cause and cause belongs to the word list associated with the havoc-verb category, so that the input word causing can match the requirement havoc-verb specified for instance in the following transition:






the input word is anything at all and the transition condition is the atom any_Word; the can match any_Word by virtue of the following transition for instance:



(again, whether it does depends on whether the match is attempted at all here, we must be in the graph associated with play_havoc_with, looking at the when in state q4) */




link(Word,Lemma) :- morph(Word,Lemma).

link(Word,Category) :- listof(Category,Category_List),



link(_,any_Word). % anything matches any_Word




/* morphological variants */



morph(causes, cause).

morph(caused, cause).


morph(creates, create).

morph(created, create).







morph(makes, make).

morph(made, make).


morph(wreaks, wreak).

morph(wreaked, wreak).



morph(plays, play).

morph(played, play).



morph(bears, bear).

morph(bore, bear).




morph(takes, take).

morph(took, take).




morph(carries, carry).

morph(carried, carry).



morph(catches, catch).




morph(feels, feel).




morph(faces, face).





/* no morphological variation */


/* the following is a way of not having to specify for each lemma that it is also a morphological variant.

Take is a lemma, i.e. a dictionary form, but it is also the indicative present tense form, except for the third present singular form, which is takes */








/* initial and final states */














/* the category lists */








/* membership of a list - identical to pre-defined member predicate */


/* a member (here we call the predicate inlist, but it is fully equivalent to the pre-defined member predicate) of a list is either the first element of the list, or a member of the tail of the list, a condition which is checked for by means of a recursive call to member; the recursive call complies with the requirements mentioned above for such calls : by losing an element at each call, the arg tends to look more and more like the arg of the non-recursive call, namely the empty list */




inlist(Element,[_|Tail_List]):- inlist(Element,Tail_List).




/* printing a list */

% -------------------


/* when write takes two arguments, the first is the stream to which the writing is directed, the second the element to be written; here, we use write with two args to write to the output file, and with one arg to write to the default output stream, i.e. the screen; we therefore write the same things twice */


/* writelist writes each element on a separate line; what is being written here are the various elements of the results files, i.e. Prolog terms; they neednt be quoted, not being string, but well-formed Prolog terms */


/* the recursive handling is typical of lists */


writelist([H|T],Handle) :- write(Handle,H),write(H),nl,nl(Handle),












%% reads in a S from the user's terminal or from a file

%% standard in the literature


% this can be regarded as a black box, i.e. we need only know its input and output

% but we may feel uneasy about this, so why not try to understand how it works?


/* the queer thing here is that characters seem to get read in in advance of their treatment; this is called working with a look-ahead character; we can decide what to do with it once we have read in the next char ; this procedure is useful because a character, once read in, can no longer be re-placed on the input stream to get read in again */



/* input is a string of words separated by word delimiters and ending in a string delimiter

word and string delimiters belong to the parameters used by the black box

they can be changed at will according to the specifications of the language to be handled

for instance, an apostrophe should be regarded as a word delimiter for French, but not for English

dont is best handled as a single word, later rewritten as the pair do not if necessary

but jimagine is best handled directly as two words, j and imagine, j being the elided form of je


the characters are referred to in this program by means of their ascii codes an ascii table is readily to be found in computer books; one is provided as an appendix to this file */



/* note that the second argument to getsentence is the name of the input stream; in the case of standard, keyboard input, the stream name is the atom user (a Prolog convention)


get0 is a predefined predicate that reads in a single character from the input stream



getsentence(Wordlist,Handle) :- get0(Handle,Char),



/* getrest uses the character read in by get0 to start building the word list

when it reaches the end of the string the ascii code corresponding to the char read is that of a sentence delimiter, and the process stops */


/* sentence delimiters */



getrest(Handle,46,[]):-!. %% period

getrest(Handle,63,[]):-!. %% ?


/* here we specify the string (sentence) delimiters ; if we want to use the exclamation mark as an end of sentence mark, we add to the code the following line:


getrest(Handle,33,[]):-!. %% !





/* when getrest is passed a word delimiter, it restarts getsentence, storing in the first arg the list of words read in so far */


getrest(Handle,32,Wordlist) :- !,getsentence(Wordlist,Handle). % 32 is ascii for the space char

getrest(Handle,45,Wordlist) :- !,getsentence(Wordlist,Handle). % 45 is ascii for the hyphen



/* when passed a normal character (i.e. one that does not end a word or the whole string), getrest calls getletters to collect the remaining letters so as to be able to build the word.


The actual word building is done by the predefined predicate name, which turns an atom into a list of characters, or the other way round, a list of characters into an atom, depending on which of its args is instantiated when the call is made; the list of chars is presented as a list of ascii codes; for instance the following call yields true in the Prolog interpreter:


name(but,[98,117,116]). */


/* the Word is then added as first element of a list whose remainder will be built by a further call to getrest, which starts with Nextchar as look-ahead character, i.e. a character read in which will trigger the appropriate action: end of word treatment or simply character collection in order to build up a new word to be added to the current word list */









/* end of word treatment */

% -------------------------


getletters(H,46,[],46):-!. %% period


getletters(H,32,[],32):-!. %% space


getletters(H,45,[],45):-!. %% hyphen used as word delimiter


getletters(H,63,[],63):-!. %% ?


/* here we specify the word delimiters, which include the sentence delimiters; to use the apostrophe as a word delimiter we should add the following two lines:


getrest(Handle,39,Wordlist) :- !,getsentence(Wordlist,Handle).

getletters(H,39,[],39):-!. %% apostrophe */



/* character collection */

% ------------------------


/* the ascii code corresponding to the character read is stored, and the remaining chars are recursively read in by getletters; the actual reading in of a character from the input stream is the job of the get0 predefined predicate, as we have seen */












/* the transform predicate can be used to transform any character in the input string


the following call ensures that capital letters are turned into small letters; it exploits the fact that there is a constant relation between the ascii code for a capital letter and that for the corresponding small letter, namely that the ascii code of the small letter is the sum of 32 and that of the corresponding capital letter, e.g. the ascii code for C is 67 and that for c is 67+32, i.e. 99 */


%% transform(C,C1):-C>64,C<91,!,C1 is C+32. no capital letters

%% notice that this bit of code is remmed, i.e. inoperative because commented out


/* we can also use transform as a filter; an ascii code gets converted into itself if it corresponds to a printing character, i.e. its ascii code is superior to 31; if it is not, transform simply fails */


transform(C,C) :- C>31. %% chucks non-printing chars out


/* note the calls to transform in the last two definitions for the getletters predicate above;

if the character is a printable one (if the call to transform succeeds, that is), the character is added to the word being read in; Let is (most of the time vacuously) transformed into Let1, which gets added to the word by being included as first element of the [Let1|Letters] list; the cut ensures that a successful call to transform will not be attempted again with the same char


if the character is not printable, the call to the first getletters fails, because transform itself fails

the second getletters is called, in which a new character is read in, and the non-printing character is simply forgotten, i.e. nothing is done with it */


% ----------------------------------------------------------------



/* appendix : list of ascii codes

% ---------------------------------


32 33 ! 34 " 35 # 36 $ 37 % 38 & 39 ' 40 ( 41 ) 42 * 43 +

44 , 45 - 46 . 47 / 48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7

56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ? 64 @ 65 A 66 B 67 C

68 D 69 E 70 F 71 G 72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O

80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W 88 X 89 Y 90 Z 91 [

92 \ 93 ] 94 ^ 95 _ 96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g

104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o 112 p 113 q 114 r 115 s

116 t 117 u 118 v 119 w 120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127 

128 129 130 131 132 133 134 135 136 137 138 139

140 141 142 143 144 145 146 147 148 149 150 151

152 153 154 155 156 157 158 159 160   161 162 163

164 165 166 167 168 169 170 171 172 173 174 175

176 177 178 179 180 181 182 183 184 185 186 187

188 189 190 191 192 193 194 195 196 197 198 199

200 201 202 203 204 205 206 207 208 209 210 211

212 213 214 215 216 217 218 219 220 221 222 223

224 225 226 227 228 229 230 231 232 233 234 235

236 237 238 239 240 241 242 243 244 245 246 247

248 249 250 251 252 253 254 255