Indexing and Searching Address

Mutli-value Solutions - Oct '97

Nathan Rector

Natec Systems

nater@northcoast.com

http://www.northcoast.com/~nater/

Last month, I provided a routine that would break out the common parts of an address to help indexing and validation of address input. If you missed that issue, it can be found on the web site for download at http://www.northcoast.com/~nater/solutions.html.

This routine takes an address and extracts the key elements from it. Using the following example address, it creates an output that can be easily checked or indexed:

Example Address Program Output

Wilson Street, #B = Wilson;;B;;Street

100 Wilson Street, #B = Wilson;100;B;;Street

100 Wilson, Apt. b = Wilson;100;B;;

100 Wilson Street = Wilson;100;B;;Street

To the human eye, these address all look similar, but a computer has a hard time finding the similarities, let alone close matches. The routine that I created uses percentages to return matches.

The main part of the routine deals with the indexing. The indexing used stores the formatted addresses in a separate file with '.INDEX' on the end of the file name; for example, CUST.INDEX for the CUST file. The records IDs are the same as in the main file and the formatted address is stored in attribute 1.

Since the street name can be misspelled, and often is, it makes it hard to find matches by comparing exact street name. By processing the street name through a soundex routine, a misspelled word matters less. Do keep in mind that soundex routines will not cure the problem of misspelled words completely.

Once the street name has been soundexed, it is written to the index file. Then this files is Btreed for faster searching. The routine displayed here uses Advanced Pick Btree, but any type of indexing can be used.

The search routine pulls all the records that match the soundex street name being searched. After pulling all the records that match the search, it checks the other parts of the address to see if there is a match.

When comparing the street type, it includes all items found that don't have a street type as well as all items that have the matching street type. For example, if you were searching for 'Wilson Street' the following would result:

Wilson Street, #B = Wilson;;B;;Street - Match

Wilson, Apt. b = Wilson;;B;; - Match

Wilson Lane = Wilson;;;;Lane - Not match

The routine follows the same logic with address numbers. If there is no address number, then it includes the item, but if the address numbers do not match, then the item is not included. Same follows with the apartment or suite numbers.

As the routine finds matches, it creates a top to bottom list of matches. You can tell the routine to display only those items with the closest matches to the search string, or set the variable LISTTYPE = 1 to display all matches found sorted top to bottom with the best matches found at the top.

INDEX.ADDR

001 SUBROUTINE INDEX.ADDR(FILENAME,ID,ADDRESS)

002 *

003 * CREATED BY NATHAN RECTOR

004 * NATEC SYSTEMS

005 * nater@northcoast.com

006 * http://www.northcoast.com/~nater/

007 *

008 OPEN FILENAME :".INDEX" TO INDEX.FILE ELSE STOP 201, FILENAME:".INDEX"

009 *

010 CALL FORMAT.ADDR1(ADDRESS)

011 WRITE ADDRESS ON INDEX.FILE, ID

012 RETURN

013 END

TEST

001 * TEST

002 *

003 SEARCH = "100 Wilson Street, #B"

004 LISTTYPE = 0

005 CALL SEARCH.ADDRESS(SEARCH,DISPLAY,LISTYPE)

006 *

007 NUM = DCOUNT(DISPLAY,CHAR(254))

008 FOR I = 1 TO NUM

009 CRT DISPLAY<I,1>

010 NEXT I

011 END

output:

{display output}

TEST

001 * TEST

002 *

003 SEARCH = "100 Wilson Street, #B"

004 LISTTYPE = 1

005 CALL SEARCH.ADDRESS(SEARCH,DISPLAY,LISTYPE)

006 *

007 NUM = DCOUNT(DISPLAY,CHAR(254))

008 FOR I = 1 TO NUM

009 CRT DISPLAY<I,1>

010 NEXT I

011 END

output:

Wilson;;B;;Street

Wilson;;B;;

Wilson;;;;Lane

 

CUST RECORDS:

Key Address.....

1 Wilson Street, #B

2 100 Wilson Street, #B

3 100 Wilson, Apt. b

4 100 Wilson Street

5 500 Johnson Ave, #B

TEST

001 * TEST - Shows example of closest match only

002 *

003 SEARCH = "100 Wilson Street, #B"

004 FOR LISTTYPE = 0 TO 1

005 CRT "LISTTYPE= ": LISTTYPE

006 CALL SEARCH.ADDRESS(SEARCH,DISPLAY,LISTTYPE)

007 *

008 NUM = DCOUNT(DISPLAY,CHAR(254))

009 FOR I = 1 TO NUM

010 CRT DISPLAY<I,1> :" ": OCONV(DISPLAY<I,1>,"TCUST;X;;2")

011 NEXT I

012 NEXT LISTTYPE

013 END

output:

:TEST

LISTTYPE= 0

2 100 Wilson Street, #B

LISTTYPE = 1

2 100 Wilson Street, #B

3 100 Wilson, Apt. b

1 Wilson Street, #B

4 100 Wilson Street

 

INDEX.ADDR

001 SUBROUTINE INDEX.ADDR(FILENAME,ID,ADDRESS)

002 *

003 OPEN FILENAME :".INDEX" TO INDEX.FILE ELSE STOP 201, FILENAME:".INDEX"

004 *

005 CALL FORMAT.ADDR1(ADDRESS)

006 WRITE ADDRESS ON INDEX.FILE, ID

007 RETURN

008 END

 

SEARCH.ADDR1

001 SUBROUTINE SEARCH.ADDR1(FILENAME,ADDRESS,LISTTYPE,LIST)

002 *

003 * CREATED BY NATHAN RECTOR

004 * NATEC SYSTEMS

005 * nater@northcoast.com

006 * http://www.northcoast.com/~nater/

007 *

008 *

009 * D O C U M E N T A T I O N

010 *

011 * This routine is used to return a list of address that match the address

012 * being looked for. This routine uses FORMAT.ADDR1.

013 *

014 * FILENAME = the file to search off of.

015 * ADDRESS = the address being looked for

016 * LISTTYPE = 1 - return all items that have a close match

017 * 0 - returns only the items that are closest

018 * LIST<-1> = list of address found that match

019 *

020 ************************************************************************

021 *** Open Files

022 ************************************************************************

023 OPEN FILENAME,"ADDRESS.INDEX" TO BTREE.FILE ELSE STOP 201, FILENAME :",ADDRESS.INDEX"

024 ************************************************************************

025 *** Main Program

026 ************************************************************************

027 DIM SORT.LIST(5) ; MAT SORT.LIST = ""

028 LIST = ""

029 *

030 *** converts the search string into the same format that is indexed.

031 *

032 CALL FORMAT.ADDR(ADDRESS)

033 *

034 *** searches the index to find the list of records that match the

035 *** street name

036 *

037 ROOT.FILE = FILENAME :",ADDRESS.INDEX"

038 ROOT.NAME = "A1"

039 CUR.SEARCH = SOUNDEX(FIELD(ADDRESS,";",1))

040

041 ROOT ROOT.FILE, ROOT.NAME TO C.ROOT ELSE GOTO 900

042 KEY("X",C.ROOT,CUR.SEARCH,FIRST.LIST) ELSE GOTO 900

043 *

044 *** loop through list to sort into closest match lists.

045 *

046 FOR I = 1 TO I + 1 UNTIL FIRST.LIST<I> = ""

047 READV BTREE.ADDR FROM BTREE.FILE, LIST<I>, 1 ELSE

048 * Item not found. Do not add to list.

049 GOTO 800

050 END

051 *

052 LIST.NO = 1

053 *** checks address number

054 BEGIN CASE

055 CASE FIELD(ADDRESS,";",2) = ""

056 CASE FIELD(BTREE.ADDR,";",2) = ""

057 CASE FIELD(BTREE.ADDR,";",2) = FIELD(ADDRESS,";",2) ; LIST.NO = LIST.NO + 1

058 END CASE

059 *** check street type

060 BEGIN CASE

061 CASE FIELD(ADDRESS,";",3) = ""

062 CASE FIELD(BTREE.ADDR,";",3) = "" ; LIST.NO = LIST.NO + 1

063 CASE FIELD(BTREE.ADDR,";",3) = FIELD(ADDRESS,";",3) ; LIST.NO = LIST.NO + 1

064 END CASE

065 *** check appartment#

066 BEGIN CASE

067 CASE FIELD(ADDRESS,";",4) = ""

068 CASE FIELD(BTREE.ADDR,";",4) = ""

069 CASE FIELD(BTREE.ADDR,";",4) = FIELD(ADDRESS,";",4) ; LIST.NO = LIST.NO + 1

070 END CASE

071 *** check lot#

072 BEGIN CASE

073 CASE FIELD(ADDRESS,";",5) = ""

074 CASE FIELD(BTREE.ADDR,";",5) = ""

075 CASE FIELD(BTREE.ADDR,";",5) = FIELD(ADDRESS,";",5) ; LIST.NO = LIST.NO + 1

076 END CASE

077 *

078 SORT.LIST(LIST.NO)<-1> = FIRST.LIST<I>

079 800*

080 NEXT I

081 *

082 *** creates the final list

083 *

084 EXIT.FOR.I = 0

085 FOR I = 5 TO 1 UNTIL EXIT.FOR.I

086 IF NOT(SORT.LIST(I) = "") THEN

087 LIST<-1> = SORT.LIST(I)

088 IF LISTTYPE = 0 THEN EXIT.FOR.I = 1

089 END

090 NEXT I

091 900*

092 RETURN

093 END

Indexing and Searching Address

Indexing and Searching Address

Mutli-value Solutions - Oct '97

Nathan Rector

Natec Systems

nater@northcoast.com

http://www.northcoast.com/~nater/

Last month, I provided a routine that would break out the common parts of an address to help indexing and validation of address input. If you missed that issue, it can be found on the web site for download at http://www.northcoast.com/~nater/solutions.html.

This routine takes an address and extracts the key elements from it. Using the following example address, it creates an output that can be easily checked or indexed:

Example Address Program Output

Wilson Street, #B = Wilson;;B;;Street

100 Wilson Street, #B = Wilson;100;B;;Street

100 Wilson, Apt. b = Wilson;100;B;;

100 Wilson Street = Wilson;100;B;;Street

To the human eye, these address all look similar, but a computer has a hard time finding the similarities, let alone close matches. The routine that I created uses percentages to return matches.

The main part of the routine deals with the indexing. The indexing used stores the formatted addresses in a separate file with '.INDEX' on the end of the file name; for example, CUST.INDEX for the CUST file. The records IDs are the same as in the main file and the formatted address is stored in attribute 1.

Since the street name can be misspelled, and often is, it makes it hard to find matches by comparing exact street name. By processing the street name through a soundex routine, a misspelled word matters less. Do keep in mind that soundex routines will not cure the problem of misspelled words completely.

Once the street name has been soundexed, it is written to the index file. Then this files is Btreed for faster searching. The routine displayed here uses Advanced Pick Btree, but any type of indexing can be used.

The search routine pulls all the records that match the soundex street name being searched. After pulling all the records that match the search, it checks the other parts of the address to see if there is a match.

When comparing the street type, it includes all items found that don't have a street type as well as all items that have the matching street type. For example, if you were searching for 'Wilson Street' the following would result:

Wilson Street, #B = Wilson;;B;;Street - Match

Wilson, Apt. b = Wilson;;B;; - Match

Wilson Lane = Wilson;;;;Lane - Not match

The routine follows the same logic with address numbers. If there is no address number, then it includes the item, but if the address numbers do not match, then the item is not included. Same follows with the apartment or suite numbers.

As the routine finds matches, it creates a top to bottom list of matches. You can tell the routine to display only those items with the closest matches to the search string, or set the variable LISTTYPE = 1 to display all matches found sorted top to bottom with the best matches found at the top.

INDEX.ADDR

001 SUBROUTINE INDEX.ADDR(FILENAME,ID,ADDRESS)

002 *

003 * CREATED BY NATHAN RECTOR

004 * NATEC SYSTEMS

005 * nater@northcoast.com

006 * http://www.northcoast.com/~nater/

007 *

008 OPEN FILENAME :".INDEX" TO INDEX.FILE ELSE STOP 201, FILENAME:".INDEX"

009 *

010 CALL FORMAT.ADDR1(ADDRESS)

011 WRITE ADDRESS ON INDEX.FILE, ID

012 RETURN

013 END

TEST

001 * TEST

002 *

003 SEARCH = "100 Wilson Street, #B"

004 LISTTYPE = 0

005 CALL SEARCH.ADDRESS(SEARCH,DISPLAY,LISTYPE)

006 *

007 NUM = DCOUNT(DISPLAY,CHAR(254))

008 FOR I = 1 TO NUM

009 CRT DISPLAY<I,1>

010 NEXT I

011 END

output:

{display output}

TEST

001 * TEST

002 *

003 SEARCH = "100 Wilson Street, #B"

004 LISTTYPE = 1

005 CALL SEARCH.ADDRESS(SEARCH,DISPLAY,LISTYPE)

006 *

007 NUM = DCOUNT(DISPLAY,CHAR(254))

008 FOR I = 1 TO NUM

009 CRT DISPLAY<I,1>

010 NEXT I

011 END

output:

Wilson;;B;;Street

Wilson;;B;;

Wilson;;;;Lane

 

CUST RECORDS:

Key Address.....

1 Wilson Street, #B

2 100 Wilson Street, #B

3 100 Wilson, Apt. b

4 100 Wilson Street

5 500 Johnson Ave, #B

TEST

001 * TEST - Shows example of closest match only

002 *

003 SEARCH = "100 Wilson Street, #B"

004 FOR LISTTYPE = 0 TO 1

005 CRT "LISTTYPE= ": LISTTYPE

006 CALL SEARCH.ADDRESS(SEARCH,DISPLAY,LISTTYPE)

007 *

008 NUM = DCOUNT(DISPLAY,CHAR(254))

009 FOR I = 1 TO NUM

010 CRT DISPLAY<I,1> :" ": OCONV(DISPLAY<I,1>,"TCUST;X;;2")

011 NEXT I

012 NEXT LISTTYPE

013 END

output:

:TEST

LISTTYPE= 0

2 100 Wilson Street, #B

LISTTYPE = 1

2 100 Wilson Street, #B

3 100 Wilson, Apt. b

1 Wilson Street, #B

4 100 Wilson Street

 

INDEX.ADDR

001 SUBROUTINE INDEX.ADDR(FILENAME,ID,ADDRESS)

002 *

003 OPEN FILENAME :".INDEX" TO INDEX.FILE ELSE STOP 201, FILENAME:".INDEX"

004 *

005 CALL FORMAT.ADDR1(ADDRESS)

006 WRITE ADDRESS ON INDEX.FILE, ID

007 RETURN

008 END

 

SEARCH.ADDR1

001 SUBROUTINE SEARCH.ADDR1(FILENAME,ADDRESS,LISTTYPE,LIST)

002 *

003 * CREATED BY NATHAN RECTOR

004 * NATEC SYSTEMS

005 * nater@northcoast.com

006 * http://www.northcoast.com/~nater/

007 *

008 *

009 * D O C U M E N T A T I O N

010 *

011 * This routine is used to return a list of address that match the address

012 * being looked for. This routine uses FORMAT.ADDR1.

013 *

014 * FILENAME = the file to search off of.

015 * ADDRESS = the address being looked for

016 * LISTTYPE = 1 - return all items that have a close match

017 * 0 - returns only the items that are closest

018 * LIST<-1> = list of address found that match

019 *

020 ************************************************************************

021 *** Open Files

022 ************************************************************************

023 OPEN FILENAME,"ADDRESS.INDEX" TO BTREE.FILE ELSE STOP 201, FILENAME :",ADDRESS.INDEX"

024 ************************************************************************

025 *** Main Program

026 ************************************************************************

027 DIM SORT.LIST(5) ; MAT SORT.LIST = ""

028 LIST = ""

029 *

030 *** converts the search string into the same format that is indexed.

031 *

032 CALL FORMAT.ADDR(ADDRESS)

033 *

034 *** searches the index to find the list of records that match the

035 *** street name

036 *

037 ROOT.FILE = FILENAME :",ADDRESS.INDEX"

038 ROOT.NAME = "A1"

039 CUR.SEARCH = SOUNDEX(FIELD(ADDRESS,";",1))

040

041 ROOT ROOT.FILE, ROOT.NAME TO C.ROOT ELSE GOTO 900

042 KEY("X",C.ROOT,CUR.SEARCH,FIRST.LIST) ELSE GOTO 900

043 *

044 *** loop through list to sort into closest match lists.

045 *

046 FOR I = 1 TO I + 1 UNTIL FIRST.LIST<I> = ""

047 READV BTREE.ADDR FROM BTREE.FILE, LIST<I>, 1 ELSE

048 * Item not found. Do not add to list.

049 GOTO 800

050 END

051 *

052 LIST.NO = 1

053 *** checks address number

054 BEGIN CASE

055 CASE FIELD(ADDRESS,";",2) = ""

056 CASE FIELD(BTREE.ADDR,";",2) = ""

057 CASE FIELD(BTREE.ADDR,";",2) = FIELD(ADDRESS,";",2) ; LIST.NO = LIST.NO + 1

058 END CASE

059 *** check street type

060 BEGIN CASE

061 CASE FIELD(ADDRESS,";",3) = ""

062 CASE FIELD(BTREE.ADDR,";",3) = "" ; LIST.NO = LIST.NO + 1

063 CASE FIELD(BTREE.ADDR,";",3) = FIELD(ADDRESS,";",3) ; LIST.NO = LIST.NO + 1

064 END CASE

065 *** check appartment#

066 BEGIN CASE

067 CASE FIELD(ADDRESS,";",4) = ""

068 CASE FIELD(BTREE.ADDR,";",4) = ""

069 CASE FIELD(BTREE.ADDR,";",4) = FIELD(ADDRESS,";",4) ; LIST.NO = LIST.NO + 1

070 END CASE

071 *** check lot#

072 BEGIN CASE

073 CASE FIELD(ADDRESS,";",5) = ""

074 CASE FIELD(BTREE.ADDR,";",5) = ""

075 CASE FIELD(BTREE.ADDR,";",5) = FIELD(ADDRESS,";",5) ; LIST.NO = LIST.NO + 1

076 END CASE

077 *

078 SORT.LIST(LIST.NO)<-1> = FIRST.LIST<I>

079 800*

080 NEXT I

081 *

082 *** creates the final list

083 *

084 EXIT.FOR.I = 0

085 FOR I = 5 TO 1 UNTIL EXIT.FOR.I

086 IF NOT(SORT.LIST(I) = "") THEN

087 LIST<-1> = SORT.LIST(I)

088 IF LISTTYPE = 0 THEN EXIT.FOR.I = 1

089 END

090 NEXT I

091 900*

092 RETURN

093 END