title:Word Break

issue:Mulit-value Solutions - Dec 97

author:Nathan Rector

company:Natec Systems

email:nater@northcoast.com

http:www.northcoast.com/~nater

In the last few issues I've primarily cover indexing hard to work with information. I've talked about creating a common format for addresses and how to index and search them. I've talked about soundexing information to help search for information even when it is misspelled.

This issue I'll be covering how to index a string with more than one word in it. This is a common problem I've found when setting up inventory files. Most inventory files have a short description of an item and an expanded description. Neither of these two pieces of information are normally one word in length. The expanded description can also end up being more than one sentence long.

For example, an inventory file may have a short description of "Black and Decker Vacuum" and a expanded description of "Black & Decker Vacuum, 5 HP, with hose attachment". If a user wants to display all the items that are "Black and Decker Vacuum with hose", then this program helps create an index that can be searched.

Most people just break a sentence down by parsing the strings by the spaces, but there is other information that act as delimiters in strings of words. For example, periods or commas.

Since this program is used both in creating the index string and in creating the list of words to search for, the program needs to be able to handle abbreviations and short hand.

The first step in the program is to find the first word. Once it has the word then it check to see if it is an abbreviation. There are also times when a word should be ignored. For example, an 'a' or the words 'the' or 'in'.

There are some words that are considered prefix words that the program needs to understand and add to the correct place. For example, 'DE' in 'De Lay' or 'parcel' in 'parcel 5'. The program must also understand suffix words. Suffix words work the same as prefix words in that they are attached to the previous word. The application that this program was originally designed for required that 'construction' be considered a suffix word.

The variables IGNORE.WORD, PREFIX.WORD, SUFFIX.WORD, and WILD.CARD can be changed or added as the user sees fit without causing any problems with the rest of the code.

When working with names, initials are always hard to work with. Sometimes a user will input a name with the initials spaced out, for example "N. T. R. Inc", but other times the initial will be input without spaces, for example "NTR Inc". The program needs to work with both. It does this by assuming that any word with only one character is an initial and pastes all the initials together until the program finds a word that has more than one character.

When the program has found the word it is to use, the program stores the word in a multi-value field. The multi-value field is then stored in the index file or used to search the file. The searching program uses a similar algorithm to the address search routine to find the matches. I'll talk more about the search program next month.

Here are some examples of results for this program:

Program Input: Natec Systems

Program Output: Natec]Systems

Program Input: A tree on Elm Street

Program Output: tree]elm

 

Break.word2

001 SUBROUTINE BREAK.WORD2(VALUE,NAME.BREAK)

002 EQUATE AM TO CHAR(254), VM TO CHAR(253), SVM TO CHAR(252)

003 *

004 *

005 *CREATED BY NATHAN RECTOR, 11/08/96

006 *

007 *USED IN DICT ITEMS

008 *

009 *

010 * D O C U M E N T A T I O N

011 *

012 * this program is used to break down a sentence into words.

013 * This program will also convert any abbrv to full words and discard

014 * any needless words.

015 *

016 * INPUT 'Y' to continue OR 'N' to return to Menu.

017 *

018 *********************************************************************

019 *OPEN FILES

020 *********************************************************************

021 OPEN "INCONTROL" TO INCONTROL.FILE ELSE STOP 201, "INCONTROL, program 'BTREE.WORK.BREAK.SOUNDEX1'"

022 *********************************************************************

023 *DEFINE VARS

024 *********************************************************************

025 READ ABBRV FROM INCONTROL.FILE, "ABBRV.CONTROL" ELSE ABBRV = ""

026 ABBRV<1,-1> = "I" ; ABBRV<2,-1> = "ONE"

027 ABBRV<1,-1> = "II" ; ABBRV<2,-1> = "TWO"

028 ABBRV<1,-1> = "III" ; ABBRV<2,-1> = "THREE"

029 ABBRV<1,-1> = "IV" ; ABBRV<2,-1> = "FOUR"

030 ABBRV<1,-1> = "VI" ; ABBRV<2,-1> = "SIX"

031 *

032 IGNORE.WORD = ".STREET.LANE.AVENUE.WAY.COURT.CIRLE.PLACE.ROAD.DRIVE.MR.MRS.MS.JR.SR.APT.AND.OR.CO.INC.ON.FOR.TO.OF.MD.THE.A.AN."

033 PREFIX.WORDS = ".PARCEL.PHASE.DE.LE.LA.MC."

034 SUFFIX.WORDS = ".CONSTRUCTION."

035 WILD.CARD = \!@#$%^&*()+=-|/"{}[];:?<>.,~` \

036 *

037 *********************************************************************

038 *PROGRAMMING LOGIC

039 *********************************************************************

040 NAME = TRIM(VALUE) 'CU' ; NAME.COMPLETE = 0

041 NAME.SOUNDEX = "" ; NAME.BREAK = ""

042 NEW.NAME = "" ; LAST.DELIM = "" ; PREV.WORD = "" ; COUNT = 1

043 INITIALS = "" ; NO.SOUNDEX = 0

044 *

045 LOOP

046 UNTIL TRIM(NAME = "") OR NAME.COMPLETE DO

047 GOSUB 5000 ;* gets word

048 NAME = NAME[FN.POS + 1,LEN(NAME)]

049 *

050 BEGIN CASE

051 CASE WORD = ""

052 CASE WORD = "NO" ; WORD = "NORTH"

053 CASE INDEX(VM: ABBRV<1> :VM,VM: WORD :VM,1)

054 LOCATE(WORD,ABBRV,1;MV) ELSE MV = ""

055 IF NOT(MV = "") THEN WORD = ABBRV<2,MV>

056 END CASE

057 *

058 *** adds the previous word to make a new word

059 *

060 BEGIN CASE

061 CASE WORD = ""

062 CASE NUM(WORD)

063 * the word is a number. do not add to initials

064 CASE LEN(PREV.WORD) = 1 AND LEN(WORD) = 1

065 * length of word is 1 and lenght of previous word is 1

066 * add to the initials list, clear word

067 CASE LEN(PREV.WORD) = 1 AND LEN(WORD) > 1 AND NOT(INITIALS = "")

068 * Length of the previous word is 1 and the current word is

069 * larger than 1 and initials var exists. Then added the

070 * initials to the list of words

071 SAVE.WORD = WORD ; WORD = INITIALS

072 NO.SOUNDEX = 1 ; GOSUB 4000 ;* adds word to list

073 WORD = SAVE.WORD ; PREV.WORD = ""

074 CASE LEN(PREV.WORD) = 1 AND NOT(INDEX(PREFIX.WORDS,".": WORD :".",1)) AND INITIALS = ""

075 * Length of the previous word is 1 and it is NOT is the list

076 * of prefix words and there is no information in the initials

077 * var. Clear the Previous word

078 PREV.WORD = ""

079 CASE NOT(PREV.WORD = "")

080 IF NUM(WORD) THEN

081 CALL NUMBER.TO.ALPHA1(WORD,INT,"",0)

082 WORD = INT

083 END

084 *

085 WORD = PREV.WORD :" ": WORD

086 PREV.WORD = ""

087 END CASE

088 *

089 *** makes sure that phase and the phase number are in the same

090 *** word

091 *

092 BEGIN CASE

093 CASE WORD = ""

094 * no word exists

095 CASE INDEX(PREFIX.WORDS,".": WORD :".",1)

096 PREV.WORD = WORD

097 WORD = ""

098 CASE COUNT = 1

099 * nothing existing in list. do not check sufix

100 CASE INDEX(SUFFIX.WORDS,".": WORD :".",1)

101 WORD = NAME.BREAK<1,COUNT - 1> :" ": WORD

102 END CASE

103 *

104 BEGIN CASE

105 CASE WORD = ""

106 * no word

107 CASE NUM(WORD)

108 * number... skip

109 CASE INDEX(IGNORE.WORD,".": WORD :".",1)

110 * skip

111 CASE LEN(WORD) = 1 AND NOT(NUM(WORD))

112 * skip. Only one chars

113 PREV.WORD = WORD

114 INITIALS = INITIALS : WORD

115 CASE INDEX(VM: NAME.BREAK :VM,VM: WORD :VM,1)

116 * skip. Already in list

117 CASE 1

118 GOSUB 4000 ;* adds word to list

119 END CASE

120 REPEAT

121 900*

122 RETURN

123 *********************************************************************

124 *SUBROUTINE

125 *********************************************************************

126 4000*

127 NAME.BREAK<1,COUNT> = WORD

128 NEW.NAME = NEW.NAME :" ": WORD

129 COUNT = COUNT + 1

130 RETURN

131 5000*

132 WORD = "" ; SEARCH = NAME ; STOP.DELIM.LOOP = 0

133 FN.POS = 0

134 BEGIN CASE

135 CASE LAST.DELIM = \"\

136 WORD = FIELD(NAME,\"\,1) ; FN.POS = COL2()

137 LAST.DELIM = ""

138 CASE 1

139 LOOP

140 DELIM = OCONV(SEARCH,"MC/A":VM:"MC/N")[1,1]

141 IF DELIM = "" THEN

142 POS = LEN(SEARCH) + 1

143 END ELSE

144 POS = INDEX(SEARCH,DELIM,1)

145 END

146 *

147 NEXT.CHAR = SEARCH[POS + 1,1]

148 *

149 STOP.DELIM.LOOP = 1 ; CHAR = ""

150 IF DELIM = "&" THEN CHAR = "&"

151 BEGIN CASE

152 CASE NOT(DELIM = ".")

153 CASE NUM(NEXT.CHAR) AND NOT(NEXT.CHAR = "")

154 CASE 1 ; STOP.DELIM.LOOP = 0

155 END CASE

156 *

157 *** decides what to do with an 's. Check to see if it is actual

158 *** 's or 's'

159 *

160 BEGIN CASE

161 CASE NOT(DELIM = \'\)

162 CASE SEARCH[1,1] = "S" AND INDEX(WILD.CARD,SEARCH[2,1],1)

163 CHAR = \'S\ ; STOP.DELIM.LOOP = 1

164 SEARCH = SEARCH[2,999]

165 POS = POS + 1

166 CASE 1

167 CHAR = \\ ; STOP.DELIM.LOOP = 0

168 END CASE

169 *

170 *** keeps words with '-' in them together

171 *

172 BEGIN CASE

173 CASE NOT(DELIM = "-")

174 CASE INDEX(" .,",NEXT.CHAR,1)

175 CASE 1

176 CHAR = "-" ; STOP.DELIM.LOOP = 0

177 END CASE

178 *

179 WORD = WORD : SEARCH[1,POS - 1] : CHAR

180 SEARCH = SEARCH[POS + 1,LEN(SEARCH)]

181 *

182 FN.POS = FN.POS + POS

183 UNTIL STOP.DELIM.LOOP DO

184 REPEAT

185 *

186 LAST.DELIM = DELIM

187 END CASE

188 RETURN

189 END