Soundex Alternative

MV Solutions

Nathan Rector

Natec Systems

http://www.northcoast.com/~nater/

nater@northcoast.com

One problem that I've found in most ever database is the indexing of names. In every database I've seen, names are a key field for indexing, yet they are either not indexed or users have to live with a left justifed index.

A left justifed index will match, from left to right, each character from the lookup to the index. If a match is found, then it is an exact match, and leaves no room for misspelling or multiple spellings of a name. This forces the user to do multiple lookups to potentually find the information they are looking for.

Some databases have worked around the problem with misspellings by soundexing their names before indexing them. Soundexing is a way of turning text into a common format based on how a word is suppose to sound.

There are two commond soundex formats. One is the English language format that is just and expanded version of the cencus format. And the original Cencus format.

These formats comprise of the first letter of the word and then a numarc repersentation of the reast of the letters afer that, excluding the vowels. The Cencus format is limited to a total of 4 characters. The English Language format is the same as the Cencus, but it is not limited by a length.

Soundex CODING GUIDE

1 = B P F V

2 = C S K G J Q X Z

3 = D T

4 = L

5 = M N

6 = R

Disregard the letters: A E I O U W Y H

Both of these formats work well, but are limited and in most cases not accurate enough to give the user a narrow enough selection to work with. Due to this inaccuracy, I wrote an alternate soundex program which increased the accuracy to something a more managable. It is not 100% accurate, but comes closer than the traductional soundex.

Word: WRITE Soundx: RT Cencus Soundex: W630 US Soundex: R83

Word: Wright Soundx: Rt Cencus Soundex: W623 US Soundex: R3

Word: RIGHT Soundx: RT Cencus Soundex: R230 US Soundex: R3

Word: RITE Soundx: RT Cencus Soundex: R300 US Soundex: R83

Word: Rodeo Soundx: Rd Cencus Soundex: R300 US Soundex: R3

Word: REED Soundx: RD Cencus Soundex: R300 US Soundex: R3

Word: REDWOOD Soundx: RDWD Cencus Soundex: R330 US Soundex: R3

Word: RUDY Soundx: RD Cencus Soundex: R300 US Soundex: R3

Word: REID Soundx: RD Cencus Soundex: R300 US Soundex: R3

Word: RUTH Soundx: RTH Cencus Soundex: R300 US Soundex: R3

Word: RITA Soundx: RT Cencus Soundex: R300 US Soundex: R3

Word: RENTAL Soundx: RNTL Cencus Soundex: R534 US Soundex: R534

Word: THOUGH Soundx: TH Cencus Soundex: T200 US Soundex: T1

Word: MAIL Soundx: ML Cencus Soundex: M400 US Soundex: M4

Word: RECTOR Soundx: RCTR Cencus Soundex: R236 US Soundex: R236

Word: DEPUTE Soundx: DPT Cencus Soundex: D130 US Soundex: D13

Word: SIERRA Soundx: SR Cencus Soundex: S600 US Soundex: S6

Word: SARAH Soundx: SR Cencus Soundex: S600 US Soundex: S6

Word: SARI Soundx: SR Cencus Soundex: S600 US Soundex: S6

Word: SAWYER Soundx: SWR Cencus Soundex: S600 US Soundex: S6

Word: SHERRI Soundx: SHR Cencus Soundex: S600 US Soundex: S6

Word: CHERIE Soundx: SHR Cencus Soundex: C600 US Soundex: C6

Word: CARRIE Soundx: KR Cencus Soundex: C600 US Soundex: K6

Word: CARE Soundx: KR Cencus Soundex: C600 US Soundex: K6

Word: PATROTE Soundx: PTRT Cencus Soundex: P363 US Soundex: P363

Word: PATRIOT Soundx: PTRT Cencus Soundex: P363 US Soundex: P363

Word: LETTER Soundx: LTR Cencus Soundex: L360 US Soundex: L36

Word: KNOT Soundx: NT Cencus Soundex: K530 US Soundex: N3

Word: NOTE Soundx: NT Cencus Soundex: N300 US Soundex: N3

Word: NOT Soundx: NT Cencus Soundex: N300 US Soundex: N3

Word: PHILLIS Soundx: FL Cencus Soundex: P400 US Soundex: F4

 

Instead of converting the letters to numbers, I left them as the actual letter they represent, but excluded the vowels and any punctuation. In addition to this, specific letter combinations are converted to other letters. For example, 'PH' is converted to a 'F" and 'PSY' is converted to a 'S'.

Word: KNOT Soundx: NT

Word: PHILLIS Soundx: FL

Word: CHERIE Soundx: SHR

Word: WRITE Soundx: RT

This alternate created a much more accurate list, and kept the user from having to look through a long list of items that wasn't what they were looking for. It also kept the developer from having to run an additional qualifying list after the select with the soundex which slows down the display response.

SOUNDX

001 SUBROUTINE SOUNDX(WORD,SOUNDX)

002 *

003 *CREATED BY Nathan Rector, 10/12/96

004 * Natec Systems

005 * nater@northcoast.com

006 *

007 *

008 * D O C U M E N T A T I O N

009 *

010 * This program is used to create a soundex string. It is an

011 * alternative to the Cencus and US soundex.

012 *

013 *********************************************************************

014 * MAIN PROGRAM

015 *********************************************************************

016 *

017 LENGTH = LEN(WORD)

018 COUNT = 1

019 PREV.LETTER.VALUE = ''

020 SOUNDX = ""

021 *

022 *** strips tailing 's'

023 *

024 IF WORD[LENGTH,1] = "S" THEN

025 WORD = WORD[1,LENGTH - 1]

026 LENGTH = LENGTH - 1

027 END

028 *

029 FOR I = 1 TO LENGTH

030 LETTER.VALUE = ""

031 BEGIN CASE

032 CASE WORD[I,2] = "GH"

033 * silent letters

034 I = I + 1

035 CASE WORD[I,2] = "SH"

036 LETTER.VALUE = "SH"

037 I = I + 1

038 CASE WORD[I,2] = "CH" AND I = 1

039 LETTER.VALUE = "SH"

040 I = I + 1

041 CASE WORD[I,2] = "CH"

042 LETTER.VALUE = "CH"

043 I = I + 1

044 CASE WORD[I,1] = "C" AND I = 1

045 LETTER.VALUE = "K"

046 CASE WORD[I,2] = "TH"

047 LETTER.VALUE = "TH"

048 I = I + 1

049 CASE WORD[I,3] = "PSY"

050 LETTER.VALUE = "S"

051 I = I + 2

052 CASE WORD[I,2] = "WR"

053 LETTER.VALUE = "R"

054 I = I + 1

055 CASE WORD[I,2] = "PH"

056 LETTER.VALUE = "F"

057 I = I + 1

058 CASE WORD[I,2] = "PS"

059 LETTER.VALUE = "S"

060 I = I + 1

061 CASE WORD[I,2] = "KN"

062 LETTER.VALUE = "N"

063 I = I + 1

064 CASE WORD[I,1] = "X" AND I = 1

065 LETTER.VALUE = "Z"

066 CASE WORD[I,2] = "PF"

067 LETTER.VALUE = "F"

068 I = I + 1

069 CASE WORD[I,4] = "IBLE"

070 LETTER.VALUE = "BL"

071 I = I + 3

072 CASE WORD[I,4] = "TION"

073 LETTER.VALUE = "SN"

074 I = I + 3

075 CASE WORD[I,1] = "H" AND I = 1

076 LETTER.VALUE = "H"

077 CASE INDEX('HAEIOUY', WORD[I, 1], 1)

078 * silent letter

079 CASE INDEX('BCDFGHJKLMNPQRSTVWXZ', WORD[I, 1], 1)

080 LETTER.VALUE = WORD[I,1]

081 CASE 1

082 LETTER.VALUE = ""

083 END CASE

084 *

085 BEGIN CASE

086 CASE LETTER.VALUE = ""

087 CASE NOT(LETTER.VALUE = PREV.LETTER.VALUE)

088 SOUNDX = SOUNDX : LETTER.VALUE

089 PREV.LETTER.VALUE = LETTER.VALUE

090 END CASE

091 *

092 NEXT I

093 RETURN

094 END

Soundex Alternative

Soundex Alternative

MV Solutions

Nathan Rector

Natec Systems

http://www.northcoast.com/~nater/

nater@northcoast.com

One problem that I've found in most ever database is the indexing of names. In every database I've seen, names are a key field for indexing, yet they are either not indexed or users have to live with a left justifed index.

A left justifed index will match, from left to right, each character from the lookup to the index. If a match is found, then it is an exact match, and leaves no room for misspelling or multiple spellings of a name. This forces the user to do multiple lookups to potentually find the information they are looking for.

Some databases have worked around the problem with misspellings by soundexing their names before indexing them. Soundexing is a way of turning text into a common format based on how a word is suppose to sound.

There are two commond soundex formats. One is the English language format that is just and expanded version of the cencus format. And the original Cencus format.

These formats comprise of the first letter of the word and then a numarc repersentation of the reast of the letters afer that, excluding the vowels. The Cencus format is limited to a total of 4 characters. The English Language format is the same as the Cencus, but it is not limited by a length.

Soundex CODING GUIDE

1 = B P F V

2 = C S K G J Q X Z

3 = D T

4 = L

5 = M N

6 = R

Disregard the letters: A E I O U W Y H

Both of these formats work well, but are limited and in most cases not accurate enough to give the user a narrow enough selection to work with. Due to this inaccuracy, I wrote an alternate soundex program which increased the accuracy to something a more managable. It is not 100% accurate, but comes closer than the traductional soundex.

Word: WRITE Soundx: RT Cencus Soundex: W630 US Soundex: R83

Word: Wright Soundx: Rt Cencus Soundex: W623 US Soundex: R3

Word: RIGHT Soundx: RT Cencus Soundex: R230 US Soundex: R3

Word: RITE Soundx: RT Cencus Soundex: R300 US Soundex: R83

Word: Rodeo Soundx: Rd Cencus Soundex: R300 US Soundex: R3

Word: REED Soundx: RD Cencus Soundex: R300 US Soundex: R3

Word: REDWOOD Soundx: RDWD Cencus Soundex: R330 US Soundex: R3

Word: RUDY Soundx: RD Cencus Soundex: R300 US Soundex: R3

Word: REID Soundx: RD Cencus Soundex: R300 US Soundex: R3

Word: RUTH Soundx: RTH Cencus Soundex: R300 US Soundex: R3

Word: RITA Soundx: RT Cencus Soundex: R300 US Soundex: R3

Word: RENTAL Soundx: RNTL Cencus Soundex: R534 US Soundex: R534

Word: THOUGH Soundx: TH Cencus Soundex: T200 US Soundex: T1

Word: MAIL Soundx: ML Cencus Soundex: M400 US Soundex: M4

Word: RECTOR Soundx: RCTR Cencus Soundex: R236 US Soundex: R236

Word: DEPUTE Soundx: DPT Cencus Soundex: D130 US Soundex: D13

Word: SIERRA Soundx: SR Cencus Soundex: S600 US Soundex: S6

Word: SARAH Soundx: SR Cencus Soundex: S600 US Soundex: S6

Word: SARI Soundx: SR Cencus Soundex: S600 US Soundex: S6

Word: SAWYER Soundx: SWR Cencus Soundex: S600 US Soundex: S6

Word: SHERRI Soundx: SHR Cencus Soundex: S600 US Soundex: S6

Word: CHERIE Soundx: SHR Cencus Soundex: C600 US Soundex: C6

Word: CARRIE Soundx: KR Cencus Soundex: C600 US Soundex: K6

Word: CARE Soundx: KR Cencus Soundex: C600 US Soundex: K6

Word: PATROTE Soundx: PTRT Cencus Soundex: P363 US Soundex: P363

Word: PATRIOT Soundx: PTRT Cencus Soundex: P363 US Soundex: P363

Word: LETTER Soundx: LTR Cencus Soundex: L360 US Soundex: L36

Word: KNOT Soundx: NT Cencus Soundex: K530 US Soundex: N3

Word: NOTE Soundx: NT Cencus Soundex: N300 US Soundex: N3

Word: NOT Soundx: NT Cencus Soundex: N300 US Soundex: N3

Word: PHILLIS Soundx: FL Cencus Soundex: P400 US Soundex: F4

 

Instead of converting the letters to numbers, I left them as the actual letter they represent, but excluded the vowels and any punctuation. In addition to this, specific letter combinations are converted to other letters. For example, 'PH' is converted to a 'F" and 'PSY' is converted to a 'S'.

Word: KNOT Soundx: NT

Word: PHILLIS Soundx: FL

Word: CHERIE Soundx: SHR

Word: WRITE Soundx: RT

This alternate created a much more accurate list, and kept the user from having to look through a long list of items that wasn't what they were looking for. It also kept the developer from having to run an additional qualifying list after the select with the soundex which slows down the display response.

SOUNDX

001 SUBROUTINE SOUNDX(WORD,SOUNDX)

002 *

003 *CREATED BY Nathan Rector, 10/12/96

004 * Natec Systems

005 * nater@northcoast.com

006 *

007 *

008 * D O C U M E N T A T I O N

009 *

010 * This program is used to create a soundex string. It is an

011 * alternative to the Cencus and US soundex.

012 *

013 *********************************************************************

014 * MAIN PROGRAM

015 *********************************************************************

016 *

017 LENGTH = LEN(WORD)

018 COUNT = 1

019 PREV.LETTER.VALUE = ''

020 SOUNDX = ""

021 *

022 *** strips tailing 's'

023 *

024 IF WORD[LENGTH,1] = "S" THEN

025 WORD = WORD[1,LENGTH - 1]

026 LENGTH = LENGTH - 1

027 END

028 *

029 FOR I = 1 TO LENGTH

030 LETTER.VALUE = ""

031 BEGIN CASE

032 CASE WORD[I,2] = "GH"

033 * silent letters

034 I = I + 1

035 CASE WORD[I,2] = "SH"

036 LETTER.VALUE = "SH"

037 I = I + 1

038 CASE WORD[I,2] = "CH" AND I = 1

039 LETTER.VALUE = "SH"

040 I = I + 1

041 CASE WORD[I,2] = "CH"

042 LETTER.VALUE = "CH"

043 I = I + 1

044 CASE WORD[I,1] = "C" AND I = 1

045 LETTER.VALUE = "K"

046 CASE WORD[I,2] = "TH"

047 LETTER.VALUE = "TH"

048 I = I + 1

049 CASE WORD[I,3] = "PSY"

050 LETTER.VALUE = "S"

051 I = I + 2

052 CASE WORD[I,2] = "WR"

053 LETTER.VALUE = "R"

054 I = I + 1

055 CASE WORD[I,2] = "PH"

056 LETTER.VALUE = "F"

057 I = I + 1

058 CASE WORD[I,2] = "PS"

059 LETTER.VALUE = "S"

060 I = I + 1

061 CASE WORD[I,2] = "KN"

062 LETTER.VALUE = "N"

063 I = I + 1

064 CASE WORD[I,1] = "X" AND I = 1

065 LETTER.VALUE = "Z"

066 CASE WORD[I,2] = "PF"

067 LETTER.VALUE = "F"

068 I = I + 1

069 CASE WORD[I,4] = "IBLE"

070 LETTER.VALUE = "BL"

071 I = I + 3

072 CASE WORD[I,4] = "TION"

073 LETTER.VALUE = "SN"

074 I = I + 3

075 CASE WORD[I,1] = "H" AND I = 1

076 LETTER.VALUE = "H"

077 CASE INDEX('HAEIOUY', WORD[I, 1], 1)

078 * silent letter

079 CASE INDEX('BCDFGHJKLMNPQRSTVWXZ', WORD[I, 1], 1)

080 LETTER.VALUE = WORD[I,1]

081 CASE 1

082 LETTER.VALUE = ""

083 END CASE

084 *

085 BEGIN CASE

086 CASE LETTER.VALUE = ""

087 CASE NOT(LETTER.VALUE = PREV.LETTER.VALUE)

088 SOUNDX = SOUNDX : LETTER.VALUE

089 PREV.LETTER.VALUE = LETTER.VALUE

090 END CASE

091 *

092 NEXT I

093 RETURN

094 END