The standardizer works by taking the words, phrases and abbreviations in its input and classifying them. Input strings go through a lexical scanner that makes an initial tokenization on the basis of form, identifying ordinals, fractions, numbers, words etc. Each is then looked up in the lexicon and gazeteer and if found there are given whatever definitions and standardizations that correspond with the lookup key. If a string is not found it retains its initial tokenization and input form. The various classifications of the input are what you might term "tokenization candidates". Each possible tokenization candidate, as a string of input tokens ( See Input Tokens) is examined. The examination for each entails building a clause-tree and retrieving all the rules of a permissable class (as per a transition table mapping rule types to state) from the Aho-Corasick matrix. Because Aho-Corasick finds (through failure functions) not just rules that match a particular string of tokens, but rules that match a suffix of that string, the tree is built backwards, from the end of the string forward. Each series of rules that subsumes the address will generate a standardization candidate, which is an ordered string of words or phrases classified by the output tokens ( See Postal Attributes). A maximum of six standardizations are retained for the purpose of building or matching.
In applying the rules certain assumptions are incorporated into the transition table referred to above. These are assumptions about how the clause types, each governed by a different type of rules ( See Rule Types), fit together. In particular, it is assumed that no clause will come between the house number (governed by the CIVIC_C rule type) and other MICRO attributes (governed by the ARC_C rule type). It is assumed that the CIVIC_C type always precedes the ARC_C type. It is also assumed that extra attributes - the attributes not used for geocoding and governed by the EXTRA_C rule type, will occur either before the house number or after the MICRO attributes. It is also assumed that the MICRO and MACRO ( See MICRO and MACRO) attributes can be separated. Although all of these assumptions seem reasonable for Canada and the United States, they may be completely invalid for some other parts of the world.
The standardizer classifies its standardized output by what are called here Postal Attributes. The number associated with each class (given here as the token number) appears in the rules. The postal attributes used fall into the functional classes described in the following sections. The standardizer parses the address in both the reference and target using these attributes in order to facilitate matching. The attributes used correspond closely to the SADS classification.
Because North American road network files combine in a single record the block faces on both sides of the street, it is convenient to divide the attributes into those which are always the same for both block faces (MICRO) and those which may be different (MACRO). That is, MACRO attributes belong to the contiguous polygons rather than the arc itself. This bifurcation necessitates separate standardizations in both the build and match phases.
(token number "1"). (SADS element: "COMPLETE ADDRESS NUMBER"). This is the civic address number. Example, the 3715 in 3715 TENTH AVENUE WEST. In reference records it is associated with the blockface address ranges.
(token number "5"). (SADS element: "STREET NAME"). This is the root street name, stripped of directional or type modifiers. Example, the TENTH in 3715 WEST TENTH AVENUE
(token number "7"). (SADS element: "STREET POST-DIRECTIONAL"). A directional modifier that follows the street name. Example, the WEST in 3715 TENTH AVENUE WEST
(token number "2"). (SAD element: "STREET NAME PRE-DIRECTIONAL"). A directional modifier that precedes the street name. Example, the WEST in 3715 TENTH AVENUE WEST
(token number "4"). (SADS element: "STREET PREFIX TYPE" ). A street type preceding the root street name. Example, the HIGHWAY in 3715 HIGHWAY 99.
(token number "6"). (SADS element: "STREET POST TYPE"). A street type following the root street name. Example, the AVENUE in 3715 WEST TENTH AVENUE.
(token number "3"). (combines SADS elements "STREET NAME PRE-MODIFIER" and "STREET NAME POST-MODIFIER"). Example, the OLD in 3715 OLD HIGHWAY 99
(token number "10"). Example "Albany"
(token number "11"). Example "NY".
(token number "12"). This attribute is not used in most reference files.
(token number "13"). (SADS elements "ZIP CODE" , "PLUS 4" ). This attribute is used for both the US Zip and the Canadian Postal Codes.
These attributes isolated by the standardizer but not used in matching with reference addresses, which are currently assumed to be street addresses:
(token number "0"). Unparsed building identifiers and types.
(token number "14"). The BOX in BOX 3B
(token number "15"). The 3B in BOX 3B
(token number "8"). The RR in RR 7
(token number "16"). The APT in APT 3B
(token number "17"). The 3B in APT 3B
(token number "9"). An otherwise unclassified output.
The tokenizer classifies standardizer input into the following classes. The number associated with each class appears in lexicon ( See Standardization Files) entries (which are essentially pre-classified input) and in the rules.
(13). The ampersand (&) is frequently used to abbreviate the word "and".
(9). A punctuation character.
(21). A sequence of two letters. Often used as identifiers.
(25). Fractions are sometimes used in civic numbers or unit numbers.
(23). An alphanumeric string that contains both letters and digits. Used for identifiers.
(0). A string of digits.
(15). Representations such as First or 1st. Often used in street names. Ordinals in PAGC are standardized as numbers.
(18). A single letter.
(1). A word is a string of letters of arbitrary length. A single letter can be both a SINGLE and a WORD.
(14). Words used to denote post office boxes. For example Box or PO Box.
(19). Words used to denote buildings or building complexes, usually as a prefix. For example Tower in Tower 7A.
(24). Words and abbreviations used to denote buildings or building complexes, usually as a suffix. For example, Shopping Centre.
(22). Words used to denote directions, for example North. Directions in PAGC are standardized as a full word (rather than an abbreviation).
(20). Words used to denote milepost addresses.
(6). Words and abbreviations used to denote highways and roads. Exampe: the Interstate in Interstate 5
(8). Words and abbreviations used to denote rural routes. RR.
(2). Words and abbreviation used to denote street typess. For example, ST or AVE.
(16). Words and abbreviation used to denote internal subaddresses. For example, APT or UNIT.
(28). A 5 digit number. Identifies a Zip Code
(29). A 4 digit number. Identifies ZIP4.
(27). A 3 character sequence of letter number letter. Identifies an FSA, the first 3 characters of a Canadian postal code.
(26). A 3 character sequence of number letter number. Identifies an LDU, the last 3 characters of a Canadian postal code.
STOPWORDS combine with WORDS. In rules a string of multiple WORDs and STOPWORDs will be represented by a single WORD token.
(7). A word with low lexical significance, that can be omitted in parsing. For example, THE.
In order to function, PAGC requires three files in addition to those identified on the command line. These files must reside in a place that the program can find them, either in the same directory as the reference shapeset, the current working directory, or the default installation directory. These files are:
The standardization rules are contained in the file rules.txt. It is read in at initialization and stored in such a way that the Aho-Corasick algorithm can be applied to place a string of input tokens ( See Input Tokens) in one-to-one correspondence with a string of output tokens ( See Postal Attributes).
If rules.txt is not found, after the program looks first in the reference shapeset's directory, then the current working directory and lastly the default installation directory, the program will report "Could not find file: rules.txt" and abort.
The rule file consists of a list of rules, expressed as lines of space delimited integers. Each rule consists of a set of non-negative integers representing input tokens, terminated by a -1, followed by an equal number of non-negative integers representing postal attributes, terminated by a -1, followed by an integer representing a rule type, followed by an integer representing the rank of the rule. The rules are ranked from 0 (lowest) to 17 (highest).
The file is terminated by a -1.
The following is an example rule:
2 0 2 22 3 -1 5 5 6 7 3 -1 2 6
This rule maps the sequence of input tokens TYPE NUMBER TYPE DIRECT QUALIF to the output sequence STREET STREET SUFTYP SUFDIR QUALIF. The rule is an ARC_C rule of rank 6.
Rule types .
(token number = "0"). The class of rules for parsing MACRO clauses.
(token number = "1"). The class of rules for parsing full MICRO clauses (ie ARC_C plus CIVIC_C). These rules are not used in the build phase.
(token number = "2"). The class of rules for parsing MICRO clauses, excluding the HOUSE attribute.
(token number = "3"). The class of rules for parsing the HOUSE attribute.
(token number = "4"). The class of rules for parsing EXTRA attributes - attributes excluded from geocoding. These rules are not used in the build phase.
The rule file is intended to be user-modifiable ( See Changing the Rules). Entries can be deleted, added or altered on a provisional basis. The distribution form of the file is not intended to be definitive. The program, when searching for rules.txt, will look for it first in the reference shapeset's directory. Modified versions of these files can be placed there in order to supercede other versions of the files.
The files lexicon.csv and gazeteer.csv are read in when the standardizer is initialized. They are used to classify alphanumeric input and associate that input with (a) input tokens ( See Input Tokens) and (b) standardized representations.
If one of these files is not found, after the program looks first in the reference shapeset directory, then the current working directory and lastly the default installation directory, the program will report "Could not find file: FILE_NAME" and abort.
The format of these records is that of a comma separated (or comma delimited) file. There are four fields, each of the first three terminated by a comma, and the fourth terminated by the line end. The first field is the definition number. It should be a postive integer. It is used for reference to the lookup value. The second field is the lookup key - the text of the word, abbreviation or phrase that may occur in input. The third field is the input token number, and the fourth field is the text of the standardization value. For example, the lookup key ST has these values in the lexicon: "1","ST",2,"STREET" and "2","ST",7,"SAINT". The 2 in the first entry indicates that when "ST" is standardized as "STREET", it is classified as an input token of 2, (TYPE). The 7 in the second entry indicates that when standardized as "SAINT", the key "ST" is classified as a STOPWORD.
These files are intended to be user-modifiable ( See Changing the Lexicons. Entries can be deleted, added or altered on a provisional basis. The distribution forms of the files are not intended to be definitive. Modified versions of these files can be placed in the reference shapeset directory in order to supercede other versions.