The Association for GIS Professionals
URISA
701 Lee St.
Suite 680
Des Plaines, IL 60016
(847) 824-6300

From Text to Geographic Coordinates

From Text to Geographic Coordinates: The Current State of Geocoding

 

(Version 07/31/06)

 

Daniel W. Goldberg, John P. Wilson, Craig A. Knoblock

 

ABSTRACT: This article presents a survey of the state of the art in geocoding practices through a cross disciplinary historical review of existing literature. We explore the evolving concept of geocoding and the fundamental components of the process. Frequently encountered sources of error and uncertainty are discussed as well existing measures used to quantify them. An examination of common pitfalls and persistent challenges in the geocoding process is presented, and the traditional methods for overcoming them are described.

 

INTRODUCTION

 

The process of geocoding forms a basic fundamental component of spatial analysis in a wide variety of research disciplines and application domains (e.g. Health - [Vine et al., 1998, Boulos, 2004, Rushton et al., 2006]; Crime Analysis - [Olligschlaeger, 1998, Ratcliffe, 2001]; Political Science - [Haspel and Knotts, 2005]; Computer Science - [Hutchinson and Veenendall, 2005b, Bakshi et al., 2004]). This act of turning descriptive locational data such as a postal address or named place into an absolute geographic reference has become a critical piece of the scientific workflow. However, the geocoding of today is a far cry away from the geocoding of the past. Geocoding data that used to cost $4.50 per 1000 records as recently as the mid 1980s [Krieger, 1992] quickly moved to $1.00 per record in 2003 [McElroy et al., 2003], and can now be done for free with online services (e.g. Yahoo! Inc. [2006], Locative Technologies [2006]), with far greater spatial accuracy and match rates.

 

As the availability and accuracy of reference datasets have increased over the past several decades [Dueker, 1974, Werner, 1974, Griffin et al., 1990, Higgs and Martin, 1995, Martin and Higgs, 1996, Johnson, 1998a, Martin, 1999, Boscoe et al., 2004], geocoding has undergone marked transitions to accommodate and exploit changes in both data format and user expectations. These transitions can clearly be seen in the input, output, and internal processing of the geocoding process. The input data suitable for geocoding have expanded from simple postal addresses [O’Reagan and Saalfeld, 1987] to include textual descriptions of relative locations [Levine and Kim, 1998, Davis et al., 2003, Hutchinson and Veenendall, 2005b]. The output capabilities of the geocoding process have moved from simple nominal geographic codes [Tobler, 1972, Dueker, 1974, Werner, 1974, O’Reagan and Saalfeld, 1987]to full fledged three dimensional (3D) geospatial entities [Beal, 2003, Lee, 2004]. Likewise, the internal processing mechanisms that produce the geographic output have moved from simple feature assignment [O’Reagan and Saalfeld, 1987] to complex interpolation algorithms using a variety of heterogeneous data sources [Bakshi et al., 2004, Hutchinson and Veenendall, 2005a,b].

 

While significantly improving the usability, reliability, and accuracy of the geocoding process, these developments have brought with them a host of issues that a potential user must recognize and be prepared to contend with. Specific issues include the assumptions made during the interpolation process [Dearwent et al., 2001, Karimi et al., 2004], the underlying accuracy of the reference dataset [Gatrell, 1989, Block, 1995, Drummond, 1995, Martin and Higgs, 1996, Chung et al., 2004], the uncertainty in the matching algorithm [O’Reagan and Saalfeld, 1987, Jaro, 1984], and the choice of areal unit geocoded to [Krieger, 1992, Geronimus et al., 1995, Geronimus and Bound, 1998, Krieger et al., 2002a, 2003]. These topics have received considerable research in recent times, and there is a great deal of available literature. This article will survey the field of geocoding through a cross disciplinary study of the geocoding literature focusing foremost on the technical aspects of the process. The changing concept of geocoding will be described, and the fundamental components of the geocoder will be outlined. Potential sources of error in the geocoding process will be explored, and particularly difficult geocoding scenarios requiring further research will be highlighted. The primary contributions of this article will be to inform the reader of the state of the art in geocoding through a discussion of its evolution over time and to warn of potentially sticky situations that can arise in the geocoding process if one is not aware of how their decisions and assumptions can affect the geocoded results. This work should be seen as distinct from the recent work published by Rushton et al. [2006] which also offers a review of the geocoding process, but is focused on its application to health research, in particular cancer studies. Their work takes a narrow and limited view of geocoding and does not delve as deeply into the evolution or technical aspects of the geocoding process as does that presented here. As such, this paper can be seen as a more comprehensive, technically targeted, broadly visioned journey through the geocoding process and should be used as a companion article to field-specific reviews such as Rushton et al. [2006].

 

THE CONCEPT OF GEOCODING

 

Over the years, the changing availability of geographic data has forced the concept of geocoding to remain flexible and adaptive in terms of its requirements and capabilities. The increasing availability, accuracy, and reliability of digital geographic reference datasets has meant that the geocoding process has continually evolved to keep pace with the underlying datasets that facilitate its use. As such, practitioners have been pushing the boundaries of what types of information can be geocoded using different information sources from the very beginning. Early geocoding systems used by the US Census in the 1960s simply turned postal addresses and named buildings into geographical zones delineated by numerical codes [O’Reagan and Saalfeld, 1987], not the valid geographic objects like points, lines, areas, or surfaces with which consumers of geocoded data are accustomed to today. More modern attempts at geocoding have tackled the problems of assigning valid geographic codes to far more types of locational descriptions such as street intersections [Levine and Kim, 1998], enumeration districts (census delineations) [Sheehan et al., 2000], postal codes (zip codes) [Gatrell, 1989, Collins et al., 1998, Sheehan et al., 2000, Krieger et al., 2002b, Hurley et al., 2003], named geographic features [Davis et al., 2003, United Nations Economic Commission, 2005], and even free form textual descriptions of locations [Wieczorek et al., 2004, Hutchinson and Veenendall, 2005a,b].

 

These fundamental shifts in geocoding attitudes and opportunities can be traced directly to the technological advances made to the underlying reference datasets on which they are based. The early attempts at geocoding were hindered by the lack of digital geographies to use in the assignment of codes, and were limited by their use of flat text-based files. This resulted in low resolution non-geographic output, turning addresses and building names into the census block to which they belonged. The development of true digital geographies in the form of products like the US Census Bureau’s Dual Independent Map Encoding (DIME) files enabled the assignment of true geographic codes, but their structure limited the processing that could be applied to derive the output. The introduction of the vector-based geographic datasets such as the US Census Bureau’s Topographically Integrated Geographic Encoding and Referencing (TIGER) [United States Census Bureau, 2006] database have enabled new generations of geocoding algorithms to approximate representations for the geographic output using interpolation-based approaches, greatly increasing the resolution of the geographic output [Dueker, 1974, O’Reagan and Saalfeld, 1987, Martin, 1998, Ratcliffe, 2001, Nicoara, 2005]. Taking this a step further, the creation of pre-compiled geocoded national address registers such as the ADDRESS-POINT [Ordnance Survey, 2006] and GNAF [Paull, 2003] databases in the UK and Australia, respectively, have facilitated highly precise geocoding capabilities at national scales [Higgs and Martin, 1995, Martin, 1998, Ratcliffe, 2001, Churches et al., 2002, Higgs and Richards, 2002, Christen et al., 2004, Christen and Churches, 2005, Murphy and Armitage, 2005]. Further, the emergence of high resolution digital parcel and property boundary files may enable even more accurate digital geographic results to be returned [Dueker, 1974, Olligschlaeger, 1998, Dearwent et al., 2001, Ratcliffe, 2001, Rushton et al., 2006], but these developments are pushing the limits of what form the output of geocoding should take. Likewise, the development of multi-resolution gazetteers defining geographic footprints for named geographic places such as the Alexandria Digital Library Gazetteer [Frew et al., 1998, Hill and Zheng, 1999, Hill et al., 1999, Hill, 2000] are pushing the limits of what type of geographic features can have geographic codes assigned to them [Davis et al., 2003, United Nations Economic Commission, 2005], as well as the role of the geocoder in the larger geospatial information processing context. The proliferation of a variety of diverse types of locational addressing systems throughout the world precludes a “one size fits all” geocoding strategy that will work in all cases [Fonda- Bonardi, 1994, Lind, 2001, Davis et al., 2003, Walls, 2003, United Nations Economic Commission, 2005].

 

The result of this evolution is a somewhat “fuzzy” concept of geocoding, tailored to the specific requirements and data availability of the person doing the geocoding. For example, almost everyone involved in or using geocoding today would agree that turning a postal address into a geographic point is most certainly included in the set of geocoding operations. Likewise, they would probably agree that turning a portion of the postal address such as the post code (zip code) into a geographic point or polygon is also part of the geocoding process. However, continuing this line of reasoning presents a slippery slope because a series of fundamental questions arise. What should the point returned as representative of the postal code be? Should it be the center of mass (centroid)? Should it be weighted by the population distribution? Further, if the digital boundary of the postal code is available, why not return it instead of just a single point? Questions like these are just the beginning. If the postal code can be geocoded, can the city be as well? If so, what is the difference between the geocoder returning a geographic representation of the city and the gazetteer doing the same? And if they are in fact performing the same operation, why is it commonly understood that a gazetteer can provide geographic representations for a wide variety of geographic features like rivers, mountains, and shorelines, while these are seldom thought as candidates for the geocoding process? We can see through this discussion that the term geocoding can mean different things to different people, and their perception will be based on their primary experience or usage with a particular geocoding tool. To some, “geocoding” is synonymous with “address matching” [e.g. Drummond, 1995, Vine et al., 1998, Bonner et al., 2003] highlighting its prevalent use of transforming postal addresses into geographic representations [Drummond, 1995, page 250]. For others, “geocoding” is understood to produce a valid geographic output, but its input is not necessarily limited to simple postal addresses [e.g. Levine and Kim, 1998], and still further distinctions can be drawn between the two terms [Johnson, 1998a, page 25]. Taken literally, geocoding means “to assign a geographic code”. This definition stems from the two root words: geo - from the Latin for Earth, and coding - defined as “applying a rule for converting a piece of information into another” (similar to that defined early on in the geocoding literature [Dueker, 1974, page 320]). Notice that this literal definition does not imply or constrain in any way the input to the geocoding system, the processes or data sources used to assign the geographic code, or even what the geographic code returned as output must be. It is precisely this relaxation of formal constraints on the geocoding process that has allowed it to mature and prosper to the many forms that we use today, and that will in turn drive the technological advances of tomorrow.

 

GEOCODING FUNDAMENTALS

 

Even with this varied notion of geocoding, it is still possible to characterize it in terms of its fundamental components; the input, output, processing algorithm, and reference dataset [Levine and Kim, 1998, Karimi et al., 2004, Yang et al., 2004, Nicoara, 2005]. The input is the locational reference the user wishes to have geographically referenced which contains attributes capable of being matched to some datum that has been previously geographically coded. The most common data to be geocoded are postal addresses. In fact, there are very few geocoding services which geocode anything other than postal address data. The simple reason for this is that postal address data are among the most prevalent forms of information [Eichelberger, 1993], and address geocoding is cited often throughout the literature as a national health goal which will “be the basis for data linkage and analysis in the 21st century” [United States Department of Health and Human Services, 2000, goal 23-3]. Address data are how people locate, situate and navigate themselves, and are presently the easiest method by which to describe one’s location [Walls, 2003]. In the future when all cellular phones come equipped with reliable GPS units and all homes and businesses are geographically referenced with coordinates available via wireless location-based services, the postal address may in fact become obsolete. But for the foreseeable future, the postal address will remain the critical and ubiquitous data throughout most forms of information processing.

 

However as previously noted, address data are not the only type of locational data that can or should be geocoded. Even the earliest geocoding systems of the US Census accounted for the geocoding of named buildings [O’Reagan and Saalfeld, 1987], but the task of associating geocodes with geographic features other than addresses is most commonly associated with the services provided by a gazetteer [Hill, 2000]. The problem with this, though, is that a gazetteer typically does not contain the functionality to generate the geocodes that it returns, instead acting as a storage mechanism after the geocodes have already been determined using other methods. As such, the geocoder is commonly employed to produce the geocodes for features in the gazetteer that are address based, emphasizing the crucial connection between the two components as part of a larger spatial query and analysis framework. This situation is displayed in figure 1 where the geocoder is shown to be one of many possible sources of footprint data for a gazetteer, with itself being composed of several data sources.

 


Figure 1 - Relationship between the gazetteer and geocoder.

 

The output is the geographically referenced code determined by the processing algorithm to represent the input. In most situations, the output is a simple geographic point, but nothing forbids it from being any valid type of geographic object. The development of detailed spatial datasets is enabling the output of increasingly detailed multidimensional geographic features, including the emergence of 3D indoor geocoding solutions [Beal, 2003, Lee, 2004].

 

The processing algorithm determines the appropriate geographic code to return for a particular input based on the values of its attributes and the values of attributes in the reference dataset. This is by far the most complicated portion of the geocoding process in which the most research has been invested. The key topics involved in the process include the standardization and normalization of the input into a format and syntax compatible with that of the reference dataset [Johnson, 1998b, Churches et al., 2002, Laender et al., 2005, Nicoara, 2005], the matching algorithm that picks the best feature in the reference dataset [Drummond, 1995, Vine et al., 1998, Davis et al., 2003, Bakshi et al., 2004], and the final geocode generation mechanism that determines what to return based on the reference feature selected as the best match [Drummond, 1995, Levine and Kim, 1998, Ratcliffe, 2001, Cayo and Talbot, 2003, Davis et al., 2003]. Figure 2 shows a schematic diagram of how a simple deterministic processing algorithm could proceed using standardization, normalization and attribute relaxation. The standardization and normalization process can vary in complexity from simple token parsing with lookup tables for standardizing abbreviations to advanced probabilistic methods using machine learning techniques such as Hidden Markov Models that can handle attribute misspellings and misplacements [O’Reagan and Saalfeld, 1987, Fulcomer et al., 1998, Churches et al., 2002, Christen et al., 2004, Yang et al., 2004, Christen and Churches, 2005, Nicoara, 2005]. In general, the key role performed in this step is to determine what each piece of the input is and to turn each into versions consistent with those in the reference dataset.

 


Figure 2 - Schematic of deterministic address matching with attribute relaxation.

 

Once the input has been sufficiently massaged to be compatible with the reference dataset, the matching process picks the best candidate to be used to derive the final output. Tricks such as word stemming, using SOUNDEX, and relaxing the requirement of matching all attributes can be used to improve the probability of finding a match in the reference dataset [O’Reagan and Saalfeld, 1987, Drummond, 1995, Fulcomer et al., 1998, Johnson, 1998a, Levine and Kim, 1998, Gregorio et al., 1999, Boscoe et al., 2002, Churches et al., 2002, Beal, 2003, Christen et al., 2004, Yang et al., 2004, Christen and Churches, 2005, Nicoara, 2005]. Here, the issue may arise that zero, one, or more than one reference features can be the best possible match. In the case of one match, the algorithm will use it to determine a geocode. In the case of zero, the matching algorithm may prompt the user for more information, attempt to geocode at a lower resolution with additional datasets, or try to find additional information in other datasets to enable a match [Laender et al., 2005]. Likewise, in the case of multiple matches, the algorithm may prompt the user to determine the appropriate one or consult additional datasets for more information to use in breaking the tie [Hutchinson and Veenendall, 2005b,a].

 

In any case, once the appropriate reference feature has been selected, the algorithm must determine the appropriate geocode for output based on the input and the reference feature. In the case of a precompiled geocoded dataset such as the ADDRESS-POINT [Ordnance Survey, 2006] and G-NAF [Paull, 2003], the algorithm can simply return the existing geographic representation. However in the case of TIGER [United States Census Bureau, 2006], the output geography must be derived based on the line segment determined to be a match. Here, interpolation algorithms deduce the appropriate output geography based on attributes of the street segment like address ranges and polarity [Drummond, 1995, Levine and Kim, 1998, Ratcliffe, 2001, Cayo and Talbot, 2003, Davis et al., 2003]. In general, these interpolation algorithms work by first identifying the correct street segment in the reference data source based on the attributes of the address to be geocoded and the attributes of the street segment (address ranges associated with both sides of the segment, street name, street suffix, etc.). Once found, the appropriate side of the street segment is ascertained using the polarity (even/odd) of the address and each of the street segment sides. The correct location along the street segment is then determined by computing where the addresses in question would fall as a proportion of the total address range associated with the appropriate side of street segment. This proportion is then applied to the total length of the street segment to obtain a location along the centerline of the street, and additional parameters such as distance and direction from the street center and offset from the endpoints of the street can be introduced to further improve the accuracy [Ratcliffe, 2001, Cayo and Talbot, 2003]. Additional data sources can be consulted to obtain knowledge about the number of parcels on the street and their geographic distribution [Bakshi et al., 2004] to overcome the parcel homogeneity assumption [Dearwent et al., 2001] that all parcels within an address range truly exist and have the same dimensions. In figures 3 through 6 these points are illustrated.

 

Figure 3 shows the parameters for the interpolation algorithm, and , the street centerline offset distance and angle, , the corner offset distance, and the interpolated distance to the center of the parcel. Also shown are the address ranges for each side of the segment, 601 through 649 on the odd parity side, and 600 through 648 on the even parity side. Figure 4 shows a sample block segment with the geocoded position of 631 Main Street displayed. Figure 5 displays how the parcel homogeneity assumption divides the segment into equal sized portions for all addresses within the range of the street segment, placing the geocoded point for address 631 at the wrong location (shown as ring) compared to the true location (shown as shaded ring). Figure 6 also displays the parcel homogeneity assumption, but in this case the true number of parcels on the street is known and the resulting geocoded point for address 631 is at a closer location (shown as ring) to that of the true location (shown as shaded ring). When using area based reference features such as postal code and parcel polygons to compute point geographies to return as output, the algorithm must calculate an appropriate centroid [Stevenson et al., 2000, Dearwent et al., 2001, Ratcliffe, 2001]. It may simply return the center of mass of the object, or it may perform more complex calculations in conjunction with other information such as population distributions across an area to determine a more representative weighted centroid [Gatrell, 1989, Durr and Froggatt, 2002].

 


Figure 3 - Sample block showing parameters of the geocoding algorithm.

 


Figure 4 - Sample address block with true parcel arrangement showing true geocoded point as ring.

 


Figure 5 - Sample address block with parcel homogeneity assumption using address range showing erroneous geocoded point as ring and true geocoded point as shaded ring.

 


Figure 6 - Sample address block with parcel homogeneity assumption using actual number of parcels showing erroneous geocoded point as ring and true geocoded point as shaded ring.

 

The reference dataset consists of the geographically coded information that can be used to derive the appropriate geographic code for an input. As noted earlier, the datasets used as geocoding reference files have changed rapidly over time and are responsible for driving new technological breakthroughs in geocoding methodologies. The early datasets of text-based lists have given way to true digital geographic datasets, and are rapidly moving toward advanced 3D representations. The underlying advances in terms of efficient storage, retrieval, and indexing have allowed these datasets to grow expansively in size, detail of resolution, and speed of access. The only constraint on these datasets is that they need to maintain attributes in a consistent fashion throughout, such that the standardization and normalization algorithms can work toward transforming the input data to be appropriate for finding a match.

 

GEOCODING ERROR

 

This broad definition of geocoding also brings with it a significant burden in the form of anticipating and/or quantifying geocoding error. Even simply defining what the error of the geocoding process is presents an arduous task. When one speaks of geocoding error, are they referring to the positional accuracy of the returned geographic object, the probability that the feature returned is the one that they wanted, or the validity of one or more assumptions used by the geocoding algorithm? Further definitions could include the error caused by the match rate, the weighting and relaxation techniques used in the standardization process, or the confidence cutoffs used during probabilistic matching. Common causes and effects of errors in each stage of the geocoding process are listed in table 1.

 

Table 1 - Common causes and effects of errors in stages of the geocoding process.
stage cause of error effect of error

 

It becomes obvious from this (not even close to exhaustive) list of commonly described error metrics that evaluating the error associated with a geocoded result is difficult at best, and at worst, not even taken into consideration. It is an unfortunate reality that even though a broad range of literature exists specifically geared to exposing how minor error in geocoding accuracy can affect results based on detailed spatial models [e.g. Gatrell, 1989, Ratcliffe, 2001, Higgs and Richards, 2002, Bonner et al., 2003, Cayo and Talbot, 2003, Krieger, 2003, Krieger et al., 2005], recent research initiatives continue to employ geocoded data without regard for how the accuracy can introduce possible inconsistencies or bias into their results [Diez-Roux et al., 2001, Brody et al., 2002, Haspel and Knotts, 2005].

 

Several studies have attempted to quantify the error associated with the geocoding process, highlighting error introduction from specific aspects of the geocoding process [e.g. Davis et al., 2003, Karimi et al., 2004]. Upon evaluating a potential geocoding strategy, one should consider several key factors to determine if the outcome will meet their needs. First, what areal unit will the data be geocoded to? Will the output be to the granularity of individual postal addresses, or will it be to a larger delineation such as a census block or zip code, and will the implicit aggregation of using a larger unit have an effect on the results? This decision is a divisive topic in the geocoding literature and several studies have demonstrated that areal unit choices both have an effect and do not have an effect on the outcomes of the results [Geronimus et al., 1995, Geronimus and Bound, 1998, 1999a,b, Krieger and Gordon, 1999, Smith et al., 1999, Soobader et al., 2001, Krieger et al., 2002a, 2003, Gregorio et al., 2005]. Evaluating one’s confidence in the available scholarship will require personal judgment to determine if this could be an issue given a particular dataset and research objective.

 

Second, how accurate is the underlying data used as the reference dataset? Included in this discussion should be the concepts of spatial accuracy - how close are the features in the dataset to what is found on the ground [Karimi et al., 2004, Wu et al., 2005]? temporal accuracy - how close are the features in this dataset to how they were at the time period of interest to me [McElroy et al., 2003, Han et al., 2005]?, original collection purpose - what were these data originally collected for [Boulos, 2004]?, and lineage - what processes have been applied to this data [Veregin, 1999]? These aspects may be difficult to quantify because the accuracy measurements associated with datasets are estimates over the entire dataset, not on per feature basis. For example, while achieving an acceptable accuracy for short street segments in urban areas, the TIGER [United States Census Bureau, 2006] datasets most commonly used for linear interpolation geocoding in the US are known to be far less accurate for geocoding in rural areas with longer street segments [Drummond, 1995, Vine et al., 1998, Cayo and Talbot, 2003, Bonner et al., 2003, Wu et al., 2005]. Assuming a consistent accuracy value for a dataset throughout the entire area of coverage is rarely discussed or noted as a point of contention in the determination of geocoding accuracy.

 

A third related issue arises when one considers multi-tiered geocoding approaches using multiple data sources. For example, it has been shown numerous times that geocoding match rates in rural areas are far less than in urban areas [e.g. Gregorio et al., 1999, Kwok and Yankaskas, 2001, Boscoe et al., 2002, Bonner et al., 2003, Cayo and Talbot, 2003]. The typical approach to solving this problem involves a decision of whether to geocode to a less precise level or include additional detail from other sources to determine the correct geocode. Choosing either case creates a resulting dataset with varying degrees of accuracy as a function of location, a condition recently defined as “cartographic confounding” [Oliver et al., 2005] that has been alluded to many times, yet remained undefined throughout the history of geocoding research [Block, 1995, Ratcliffe, 2001, Cayo and Talbot, 2003, Nuckols et al., 2004, Ratcliffe, 2004, Gregorio et al., 2005]. A per geocode accuracy is rarely maintained as a result of the geocoding process other than the level of geography matched to (i.e. census tract vs. block group), and rarely do spatial models include variables to model this phenomena, although some researchers [Openshaw, 1989, Arbia et al., 1998, Cressie and Kornak, 2003, Gabrosek and Cressie, 2002] have begun developing models to account for it. Despite this, information describing the varying degrees of accuracy of each individual geocode is not typically represented during subsequent spatial analysis.

 

Fourth, one needs to determine if the assumptions made by the geocoding algorithm are applicable to their needs. As previously mentioned the most common form of geocoding, that which is linear interpolation based, makes several key assumptions which can affect the level of accuracy of the results. First, it assumes that all addresses within an address range exist. Thus, when it determines the correct location for a particular address along a street segment by identifying the proportion along the segment where an address should fall, it will overestimate the number of addresses placing it at the wrong location. Second, it assumes a homogeneous distribution of addresses in terms of lot placement and size, know as the parcel homogeneity assumption [Dearwent et al., 2001, page 332]. This means that each lot on the street is assumed to have the same dimensions, and oriented in the same direction, which is typically not a realistic assumption. Further, it does not take into account that the corner lot on a segment may belong to the segment in question, or to the segment that forms the corner [Bakshi et al., 2004]. While the magnitude of error introduced by these assumptions is small (on the order of half the length of the street segment [Wu et al., 2005, page 596]) it can have dramatic effects when the variable and/or relationships of interest (e.g. environmental exposure doses to pesticide - [Brody et al., 2002, Kennedy et al., 2003], air pollution - [Wu et al., 2005], or proximity to voting precincts - [Haspel and Knotts, 2005]) vary over tens or hundreds of meters, and becomes amplified as the landscape becomes more rural. Additionally, it has been shown that when geocodes are used for point in polygon operations to derive attributes from other datasets, small spatial errors in geocodes that lie along borders between the larger level features can cause serious misclassifications in combined data [Ratcliffe, 2001, Schootman et al., 2004].

 

Fifth, one needs to consider the uncertainty created by the aggregation or randomization performed on the resulting point to protect the identity of the geocoded object. This is most often the case in the geocoding of health data, where confidentiality requirements necessitate the geocode for an individual’s location to be non-identifying. Research has shown that there are ways to tradeoff between the usefulness of data returned for spatial analysis versus on specific confidentiality requirements, but further work is required to quantify the effect of this in a geocoding context [Armstrong et al., 1999]. For a more thorough description of the issues involved specifically geared toward health research, refer to Boscoe et al. [2004] and Rushton et al. [2006].

 

Finally, one needs to determine if their intended spatial analysis can deal with uncertain geographic values or not. Here, they need to make a fundamental decision as to whether probabilistic matching methods can be used, or if they are limited to strictly deterministic ones [O’Reagan and Saalfeld, 1987]. When interpreting an input query, the geocoding system must go through several steps to determine the “best” match in the reference dataset [Levine and Kim, 1998]. If the input can be matched directly to an existing geography, it can be returned immediately. However, it is more often the case that one needs to massage the input data transform it into a format consistent for finding the best match. Locational data, and in particular postal address data, are notoriously “noisy”, meaning that there are very often extraneous information, missing information, or confusing non-standardization contained in the input [Fulcomer et al., 1998, Ratcliffe, 2001, 2004, Murphy and Armitage, 2005, Nicoara, 2005]. In these cases, the geocoding algorithm is forced to either attempt to correct the input such that a match can be found or return a non-match. It has been shown that with deterministic approaches such as relaxing the constraint that all attributes must match exactly and allowing partial matches with a variety of attribute weighting schemes, a higher match rate can be achieved, but at the price of accuracy. In particular, studies have found that relaxing the street name portion of an address will greatly reduce the accuracy of the geocoded results [Lixin, 1996, Bonner et al., 2003, Cayo and Talbot, 2003, Krieger, 2003, Rushton et al., 2006]. In contrast, probabilistic approaches to standardization [Jaro, 1984] have been used since very early on in the geocoding literature with much success [O’Reagan and Saalfeld, 1987] and continue to improve [Churches et al., 2002, Christen et al., 2004, Christen and Churches, 2005], but one must recognize the risk that their results may not be accurate as they are relying on the confidence level of their uncertainty measures, and they will in some cases produce erroneous results.

 

PERSISTENT GEOCODING DIFFICULTIES

 

For all of the technological advances and improvements that have been made to the geocoding process and the underlying reference datasets, the geocoding difficulties identified early on still exist. In developing countries with little GIS data infrastructure, the main roadblock to accurate geocoding is the simple non-existence of reference datasets or GIS data infrastructure [Croner, 2003, United Nations Economic Commission, 2005]. The development of basic GIS reference datasets is hindered by the existence of slum-like areas that change frequently, contain geographic features that are not street addressable, and where many areas lack a consistent addressing scheme [Davis, 1993, Oppong, 1999, Davis et al., 2003, United Nations Economic Commission, 2005]. Efforts are underway to remedy these situations by the development of standardized addressing systems that include facets for encouraging public participation aimed at promoting acceptance and eventual adoption, but these are costly endeavors being undertaken in areas with few economic resources to dedicate to the task [United Nations Economic Commission, 2005].

 

Even in developed countries such as the US, the existence of rural addresses and PO Boxes impose a continual headache for geocoding practitioners [Gregorio et al., 1999, Boscoe et al., 2002, Hurley et al., 2003, McElroy et al., 2003, Schootman et al., 2004, Gaffney et al., 2005, Oliver et al., 2005]. In the PO Box case, it is not possible to determine an accurate geocode because the information available about the address is just not specific enough. The best that one can do is to geocode to a lower resolution such as a postal code centroid, but several studies have explored how this can introduce bias into the results produced with the geocoded data [Sheehan et al., 2000, Krieger et al., 2002b, Hurley et al., 2003]. Research initiatives have recently undertaken creative ways to obtain enough specific information to produce a more accurate geocode through the use of secondary sources including obtaining the PO Box renter’s address from the Postal Service, utility company records, and administrative records from government agencies. These tasks require human intervention and as such are quite expensive [Levine and Kim, 1998, Hurley et al., 2003, McElroy et al., 2003, Han et al., 2005].

 

The mandatory introduction of the E911 system in the US for all structures with telephones is improving geocoding by decreasing the number of rural addresses reported as address data and creating more accurate reference datasets [Johnson, 1998a, Cayo and Talbot, 2003, Levesque, 2003, Rose et al., 2004, Oliver et al., 2005], but historical data frequently used in research are not being updated, so the problem still remains. Again in this case, the geocoding practitioner is forced to obtain secondary information to identify what an appropriate city-style address would be for the location so it can successfully be geocoded. Additionally, as people move away from traditional land-line phones with the adoption of cell phone technology, the promise of E911 solving addressing issues begins to disappear.

 

A further problem, which the evolution of reference datasets may help solve, is that of sub-parcel geocoding. This case occurs when there are multiple structures residing on the same land parcel such as in apartment/condominium type properties and large campuses like universities and business parks or in the case of large farms where a single small structure may be located somewhere within a much larger parcel. Here, geocoding to the centroid of the property may not present sufficient accuracy for the detailed applications previously described [Gaffney et al., 2005]. However, including secondary data sources and operations such as high resolution imagery in conjunction with computer vision techniques to identify and separate buildings may help lead the way in this arena [Hutchinson and Veenendall, 2005b]. Additionally, integrating and conflating existing detailed maps of campuses [Chen et al., 2003, 2004] may enable the extraction of highly accurate polygons for building footprints, but automating this task is still an open research problem. Of course, the reliance on two dimensional GIS data sources of the traditional and commonly used GIS platforms precludes the ability for highly precise geocoding of 3D structures with multiple addresses like multi-story buildings.

CONCLUSION

 

This article has explored the state of the art in geocoding through a discussion of the path geocoding and its reference datasets have taken over the years. This work should serve as a starting point from which potential geocoding projects can be undertaken with regard to identifying the potential pitfalls and challenges that are commonly encountered. Each particular geocoding project will have its own requirements in terms of input and output data structure and format, confidentiality, cost, available tools, and technical know how, but the survey presented here should allow a more thorough understanding of the ramifications of particular choices made during the process.

 

ACKNOWLEDGMENTS

 

This research is based upon work supported in part by the National Science Foundation under Award Number IIS-0324955, and in part by the University of Southern California Libraries. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of any of the above organizations or any person connected with them.

 

REFERENCES

 

Arbia, G., D. Griffith, and R. Haining, 1998, Error Propagation Modeling in Raster GIS: Overlay Operations. International Journal of Geographical Information Science 12(2), 145 – 167.

 

Armstrong, M.P., G. Rushton, and D.L. Zimmerman, 1999, Geographically Masking Health Data to Preserve Confidentiality. Statistics in Medicine 18(5), 497–525.

 

Bakshi, R., C.A. Knoblock, and S. Thakkar, 2004, Exploiting Online Sources to Accurately Geocode Addresses. In D. Pfoser, I. F. Cruz, and M. Ronthaler (Eds.), ACM-GIS ’04: Proceedings of the 12th ACM International Symposium on Advances in Geographic Information Systems, Washington DC, USA, November, 2004, 194–203.

 

Beal, J.R., 2003, Contextual Geolocation, A Specialized Application for Improving Indoor Location Awareness in Wireless Local Area Networks. In T. Gibbons (Ed.), MICS2003: The 36th Annual Midwest Instruction and Computing Symposium, Duluth, Minnesota, USA, April, 2003.

 

Block, R., 1995, Geocoding of Crime Incidents Using the 1990 TIGER File: The Chicago Example. In C. R. Block, M. Dabdoub, and S. Fregly (Eds.), Crime Analysis Through Computer Mapping (Washington, DC: Police Executive Research Forum), 15.

 

Bonner, M.R., D. Han, J. Nie, P. Rogerson, J.E. Vena, and J.L. Freudenheim, 2003, Positional Accuracy of Geocoded Addresses in Epidemiologic Research. Epidemiology 14(4), 408–411.

 

Boscoe, F.P., C.L. Kielb, M.J. Schymura, and T.M. Bolani, 2002, Assessing and Improving Census Track Completeness. Journal of Registry Management 29(4), 117–120.

 

Boscoe, F.P., M.H. Ward, and P. Reynolds, 2004, Current Practices in Spatial Analysis of Cancer Data: Data Characteristics and Data Sources for Geographic Studies of Cancer. International Journal of Health Geographics 3(28).

 

Boulos, M.N.K., 2004, Towards Evidence-Based, GIS-Driven National Spatial Health Information Infrastructure and Surveillance Services in the United Kingdom. International Journal of Health Geographics 3(1).

 

Brody, J.G., D.J. Vorhees, S.J. Melly, S.R. Swedis, P.J. Drivas, and R.A. Rudel, 2002, Using GIS and Historical Records to Reconstruct Residential Exposure to Large-Scale Pesticide Application. Journal of Exposure Analysis and Environmental Epidemiology 12(1), 64–80.

 

Cayo, M.R. and T.O. Talbot, 2003, Positional Error in Automated Geocoding of Residential Addresses. International Journal of Health Geographics 2(10).

 

Chen, C.-C., C.A. Knoblock, C. Shahabi, and S. Thakkar, 2003, Building Finder: A System to Automatically Annotate Buildings in Satellite Imagery. In P. Agouris (Ed.), NG2I ’03: Proceedings of the International Workshop on Next Generation Geospatial Information, Cambridge, MA, USA, October, 2003.

 

Chen, C.-C., C.A. Knoblock, C. Shahabi, S. Thakkar, and Y.-Y. Chiang, 2004, Automatically and Accurately Conflating Orthoimagery and Street Maps. In D. Pfoser, I. F. Cruz, and M. Ronthaler (Eds.), ACMGIS ’04: Proceedings of the 12th ACM International Symposium on Advances in Geographic Information Systems, Washington DC, USA, November, 2004, 47–56.

 

Christen, P. and T. Churches, 2005, A Probabilistic Deduplication, Record Linkage and Geocoding System. In Proceedings of the Australian Research Council Health Data Mining Workshop (HDM05), Canberra, AU, April, 2005. In press, acrc.unisa.edu.au/groups/health/hdw2005/Christen.pdf.

 

Christen, P., T. Churches, and A. Willmore, 2004, A Probabilistic Geocoding System Based on a National Address File, Australasian Data Mining Conference, Cairns, AU, December, 2004. http://datamining.anu.edu.au/publications/2004/aus-dm2004.pdf

 

Chung, K., D.-H. Yang, and R. Bell, 2004, Health and GIS: Toward Spatial Statistical Analyses. Journal of Medical Systems 28(4), 349 – 360.

 

Churches, T., P. Christen, K. Lim, and J.X. Zhu, 2002, Preparation of Name and Address Data for Record Linkage Using Hidden Markov Models. Medical Informatics and Decision Making 2(9). http://www.biomedcentral.com/content/pdf/1472-6947-2-9.pdf.

 

Collins, S.E., R.P. Haining, I.R. Bowns, D.J. Crofts, T.S. Williams, A.S. Rigby, and D.M. Hall, 1998, Errors in Postcode to Enumeration District Mapping and Their Effect on Small Area Analyses of Health Data. Journal of Public Health Medicine 20(3), 325–330.

 

Cressie, N. and J. Kornak, 2003, Spatial Statistics in the Presence of Location Error with an Application to Remote Sensing of the Environment. Statistical Science 18(4), 436–456.

 

Croner, C.M., 2003, Public Health GIS and the Internet. Annual Review of Public Health 24, 57–82.

 

Davis Jr., C. A., 1993, Address Base Creation Using Raster/Vector Integration, Proceedings of the URISA 1993 Annual Conference, Atlanta, GA, USA, 45–54.

 

Davis Jr., C.A., F.T. Fonseca, and K.A. De Vasconcelos Borges, 2003, A Flexible Addressing System for Approximate Geocoding, GeoInfo 2003: Proceedings of the Fifth Brazilian Symposium on GeoInformatics, Campos do Jordão - São Paulo, Brazil, October, 2003.

 

Dearwent, S.M., R.R. Jacobs, and J.B. Halbert, 2001, Locational Uncertainty in Georeferencing Public Health Datasets. Journal of Exposure Analysis Environmental Epidemiology 11(4), 329–334.

 

Diez-Roux, A.V., S.S. Merkin, D. Arnett, L. Chambless, M. Massing, F.J. Nieto, P. Sorlie, M. Szklo, H.A. Tyroler, and R.L. Watson, 2001, Neighborhood of Residence and Incidence of Coronary Heart Disease. New England Journal of Medicine 345(2), 99–106.

 

Drummond, W.J., 1995, Address Matching: GIS Technology for Mapping Human Activity Patterns. Journal of the American Planning Association 61(2), 240–251.

 

Dueker, K.J., 1974, Urban Geocoding. Annals of the Association of American Geographers 64(2), 318–325.

 

Durr, P.A. and A.E.A. Froggatt, 2002, How Best to Georeference Farms? A Case Study From Cornwall, England. Preventive Veterinary Medicine 56, 51–62.

 

Eichelberger, P., 1993, The Importance Of Addresses: The Locus Of GIS, Proceedings of the URISA 1993 Annual Conference, Atlanta, GA, USA, 212–213.

 

Fonda-Bonardi, P., 1994, House Numbering Systems in Los Angeles, Proceedings of the GIS/LIS ’94 Annual Conference and Exposition, Phoenix, AZ, USA, October, 1994, 322–331.

 

Frew, J., M. Freeston, N. Freitas, L.L. Hill, G. Janee, K. Lovette, R. Nideffer, T.R. Smith, and Q. Zheng, 1998, The Alexandria Digital Library Architecture. In C. Nikalaou and C. Stephanidis (Eds.), ECDL ’98: Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries, Heraklion, Crete, Greece, September, 1998, Volume 1513 of Lecture Notes in Computer Science (London, UK: Springer), 61–73.

 

Fulcomer, M.C., M.M. Bastardi, H. Raza, M. Duffy, E. Dufficy, and M.M. Sass, 1998, Assessing the Accuracy of Geocoding Using Address Data from Birth Certificates: New Jersey, 1989 to 1996. In R.C. Williams, M.M. Howie, C.V. Lee and W.D.

Henriques (Eds.), Proceedings of the 1998 Geographic Information Systems in Public Health Conference, San Diego, CA, USA, August, 1998, 547–560. http://www.atsdr.cdc.gov/GIS/conference98/proceedings/pdf/gisbook.pdf.

 

Gabrosek, J. and N. Cressie, 2002, The Effect on Attribute Prediction on Location Uncertainty in Spatial Data. Geographical Analysis 34, 262– 285.

 

Gaffney, S.H., F.C. Curriero, P.T. Strickland, G.E. Glass, K.J. Helzlsouer, and P.N. Breysse, 2005, Influence of Geographic Location in Modeling Blood Pesticide Levels in a Community Surrounding a U.S. Environmental Protection Agency Superfund Site. Environmental Health Perspectives 113(12), 1712–1716.

 

Gatrell, A.C., 1989, On the Spatial Representation and Accuracy of Address-Based Data in the United Kingdom. International Journal of Geographical Information Systems 3(4), 335–348.

 

Geronimus, A.T. and J. Bound, 1998, Use of Census-based Aggregate Variables to Proxy for Socioeconomic Group: Evidence from National Samples. American Journal of Epidemiology 148(5), 475–486.

 

Geronimus, A.T. and J. Bound, 1999a, RE: Use of Census-based Aggregate Variables to Proxy for Socioeconomic Group: Evidence from National Samples. American Journal of Epidemiology 150(8), 894–896. Letter.

 

Geronimus, A.T. and J. Bound, 1999b, RE: Use of Census-based Aggregate Variables to Proxy for Socioeconomic Group: Evidence from National Samples. American Journal of Epidemiology 150(9), 997–999. Letter.

 

Geronimus, A.T., J. Bound, and L.J. Neidert, 1995, On the Validity of Using Census Geocode Characteristics to Proxy Individual Socioeconomic Characteristics. Technical Working Paper 189 (Cambridge, MA, USA: National Bureau of Economic Research).

 

Gregorio, D.I., E. Cromley, R. Mrozinski, and S.J. Walsh, 1999, Subject Loss in Spatial Analysis of Breast Cancer. Health & Place 5(2), 173–177.

 

Gregorio, D.I., L.M. DeChello, H. Samociuk, and M. Kulldorff, 2005, Lumping or Splitting: Seeking the Preferred Areal Unit for Health Geography Studies. International Journal of Health Geographics 4(6).

 

Griffin, D.H., J.M. Pausche, E.B. Rivers, A.L. Tillman, and J.B. Treat, 1990, Improving the Coverage of Addresses in the 1990 Census: Preliminary Results, Proceedings of the American Statistical Association Survey Research Methods Section, Anaheim, CA, USA, August, 1990, 541–546.

http://www.amstat.org/sections/srms/Proceedings/papers/1990+091.pdf.

 

Han, D., P.A. Rogerson, M.R. Bonner, J. Nie, J.E. Vena, P. Muti, M. Trevisan, and J.L. Freudenheim, 2005, Assessing Spatio-Temporal Variability of Risk Surfaces Using Residential History Data in a Case Control Study of Breast Cancer. International Journal of Health Geographics 4(9).

 

Haspel, M. and H.G. Knotts, 2005, Location, Location, Location: Precinct Placement and the Costs of Voting. The Journal of Politics 67(2), 560–573.

 

Higgs, G. and D.J. Martin, 1995, The Address Data Dilemma Part 1: Is the Introduction of Address-Point the Key to Every Door in Britain?. Mapping Awareness 8, 26–28.

 

Higgs, G. and W. Richards, 2002, The Use of Geographical Information Systems in Examining Variations in Sociodemographic Profiles of Dental Practice Catchments: A Case Study of a Swansea Practice. Primary Dental Care 9(2), 63–69.

 

Hill, L.L.: 2000, Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints. In J.L. Borbinha and T. Baker (Eds.), ECDL ’00: Research and Advanced Technology for Digital Libraries, 4th European Conference, Lisbon, Portugal, September, 2000, Volume 1923 of Lecture Notes in Computer Science (London, UK: Springer) 280–290.

 

Hill, L.L., J. Frew, and Q. Zheng, 1999, Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library. D-Lib Magazine 5(1). www.dlib.org/dlib/january99/hill/01hill.html.

 

Hill, L.L. and Q. Zheng, 1999, Indirect Geospatial Referencing Through Place Names in the Digital Library: Alexandria Digital Library Experience with Developing and Implementing Gazetteers, Proceedings if the 62nd Annual Meeting of the American Society for Information Science, Washington, DC, USA, October-November, 1999, 57–69.

 

Hurley, S.E., T.M. Saunders, R. Nivas, A. Hertz, and P. Reynolds, 2003, Post Office Box Addresses: A Challenge for Geographic Information System-Based Studies. Epidemiology 14(4), 386–391.

 

Hutchinson, M. and B. Veenendall, 2005a, Towards a Framework for Intelligent Geocoding. In SSC 2005 Spatial Intelligence, Innovation and Praxis: The national biennial Conference of the Spatial Sciences Institute, Melbourne, AU, September, 2005.

 

Hutchinson, M. and B. Veenendall, 2005b, Towards Using Intelligence To Move From Geocoding To Geolocating, Proceedings of the 7th Annual URISA GIS in Addressing Conference, Austin, TX, USA, August, 2005.

 

Jaro, M., 1984, Record Linkage Research and the Calibration of Record Linkage Algorithms. Statistical Research Division Report Series SRD Report No. Census/SRD/RR-84/27 (Washington, DC USA: United States Census Bureau). http://www.census.gov/srd/papers/pdf/rr84-27.pdf.

 

Johnson, S. D., 1998a, Address Matching with Stand-Alone Geocoding Engines: Part 1. Business Geographics pp. 24–32.

 

Johnson, S. D., 1998b, Address Matching with Stand-Alone Geocoding Engines: Part 2. Business Geographics pp. 30–36.

 

Karimi, H.A., M. Durcik, and W. Rasdorf, 2004, Evaluation of Uncertainties Associated with Geocoding Techniques. Journal of Computer-Aided Civil and Infrastructure Engineering 19(3), 170–185.

 

Kennedy, T.C., J.G. Brody, and J.N. Gardner, 2003, Modeling Historical Environmental Exposures Using GIS: Implications for Disease Surveillance, Proceedings of the 2003 ESRI Health GIS Conference, Arlington, Virginia, USA, May, 2003. http://gis.esri.com/library/userconf/health03/papers/pap3020/ p3020.htm.

 

Krieger, N., 1992, Overcoming the Absence of Socioeconomic Data in Medical Records: Validation and Application of a Census-Based Methodology. American Journal of Public Health 82(5), 703–710.

 

Krieger, N., 2003, Place, Space, and Health: GIS and Epidemiology. Epidemiology 14(4), 384–385.

 

Krieger, N., J.T. Chen, P.D. Waterman, D.H. Rehkopf, and S.V. Subramanian, 2005, Painting a Truer Picture of US Socioeconomic and Racial/Ethnic Health Inequalities: The Public Health Disparities Geocoding Project. American Journal of Public Health 95(2), 312–323.

 

Krieger, N., J.T. Chen, P.D. Waterman, M.-J. Soobader, S.V. Subramanian, and R. Carson, 2002a, Geocoding and Monitoring of US Socioeconomic Inequalities in Mortality and Cancer Incidence: Does the Choice of Area-Based Measure and Geographic Level Matter?. American Journal of Epidemiology 156(5), 471–482.

 

Krieger, N. and D. Gordon, 1999, RE: Use of Census-based Aggregate Variables to Proxy for Socioeconomic Group: Evidence from National Samples. American Journal of Epidemiology 150(8), 894–896.

 

Krieger, N., P. Waterman, J.T. Chen, M.-J. Soobader, S.V. Subramanian, and R. Carson: 2002b, Zip Code Caveat: Bias Due to Spatiotemporal Mismatches Between ZIP Codes and US Census-Defined Areas: The Public Health Disparities Geocoding Project. American Journal of Public Health 92(7), 1100–1102.

 

Krieger, N., P.D. Waterman, J.T. Chen, M.-J. Soobader, and S.V. Subramanian, 2003, Monitoring Socioeconomic Inequalities in Sexually Transmitted Infections, Tuberculosis, and Violence: Geocoding and Choice of Area-Based Socioeconomic Measures. Public Health Reports 118(3), 240–260.

 

Kwok, R.K. and B.C. Yankaskas, 2001, The Use of Census Data for Determining Race and Education as SES Indicators A Validation Study. Annals of Epidemiology 11(3), 171–177.

 

Laender, A.H.F., K.A.V. Borges, J.C.P. Carvalho, C.B. Medeiros, A.S. da Silva, and C.A. Davis Jr., 2005, Integrating Web Data and Geographic Knowledge into Spatial Databases. In Y. Manalopoulos and A.N. Papadapoulos (Eds.), Spatial Databases: Technologies, Techniques and Trends. Idea Group Inc., Chapter 2, pp. 23–47.

 

Lee, J., 2004, 3D GIS for Geo-coding Human Activity in Micro-scale Urban Environments. In M.J. Egenhofer, C. Freksa, and H.J. Miller (Eds.), Geographic Information Science: Third International Conference, GIScience 2004, College Park, MD, USA, October, 2004, 162–178.

 

Levesque, M., 2003, West Virginia Statewide Addressing and Mapping Project, Proceedings of the Fifth Annual URISA Street Smart and Address Savvy Conference, Providence, RI, USA, August, 2003. http://www.urisa.org/Street Smart Conference/2003/LevesqueM.pdf.

 

Levine, N. and K.E. Kim, 1998, The Spatial Location of Motor Vehicle Accidents: A Methodology for Geocoding Intersections. Computers, Environment, and Urban Systems 22(6), 557–576.

 

Lind, M., 2001, Developing a System of Public Addresses as a Language for Location Dependent Information. In Proceedings of the 2001 URISA Annual Conference, Long Beach, CA, USA, October, 2001. http://www.adresseprojekt.dk/files/ Develop PublicAddress urisa2001e.pdf.

 

Lixin, Y., 1996, Development and Evaluation of a Framework for Assessing the Efficiency and Accuracy of Street Address Geocoding Strategies. Ph.D. thesis, University at Albany, State University of New York - Rockefeller College of Public Affairs and Policy 1996.

Locative Technologies, 2006, Geocoder.us: A Free US Geocoder. http://geocoder.us.

 

Martin, D.J., 1998, Optimizing Census Geography: The Separation of Collection and Output Geographies. International Journal of Geographical Information Science 12(7), 673–685.

 

Martin, D.J. and G. Higgs, 1996, Georeferencing People and Places: A Comparison of Detailed Datasets. In D. Parker (Ed.), Innovations in GIS 3: Selected Papers from the Third National Conference on GIS Research UK (Gisruk), Canterbury, UK (London, UK: Taylor & Francis), 37–47.

 

Martin, D.J., 1999, Spatial Representation: The Social Scientist’s Perspective. In P.A. Longley, M.F. Goodchild, D.J. Maguire, and D.W. Rhind (Eds.), Geographical Information Systems (New York, NY, USA: Wiley), Volume 1, Second Edition, 6, 71–80.

 

McElroy, J.A., P.L. Remington, A. Trentham-Dietz, S.A. Robert, and P.A. Newcomb, 2003, Geocoding Addresses from a Large Population-Based Study: Lessons Learned. Epidemiology 14(4), 399–407.

 

Murphy, J. and R. Armitage, 2005, Merging the Modeled and Working Address Database: A Question of Dynamics and Data Quality, Proceedings of GIS Ireland 2005, Dublin, IE, October, 2005.

 

Nicoara, G., 2005, Exploring the Geocoding Process: A Municipal Case Study using Crime Data. Masters thesis, The University of Texas at Dallas, Dallas, TX, USA.

 

Nuckols, J.R., M.H. Ward, and L. Jarup, 2004, Using Geographic Information Systems for Exposure Assessment in Environmental Epidemiology Studies. Environmental Health Perspectives 112(9), 1007–1115.

 

Oliver, M.N., K.A. Matthews, M. Siadaty, F.R. Hauck, and L.W. Pickle, 2005, Geographic Bias Related to Geocoding in Epidemiologic Studies. International Journal of Health Geographics 4(29).

 

Olligschlaeger, A.M., 1998, Artificial Neural Networks and Crime Mapping. In D. Weisburd and T. McEwen (Eds.), Crime Mapping and Crime Prevention, Volume 8 of Crime Prevention Studies (Monsey, NY USA: Criminal Justice Press), 313–347.

 

Openshaw, S., 1989, Learning to Live with Errors in Spatial Databases. In M.F. Goodchild and S. Gopal (Eds.), Accuracy of Spatial Databases (Bristol, PA, USA: Taylor & Francis), 23, 263–276.

 

Oppong, J.R., 1999, Data Problems in GIS and Health, Proceedings of Health and Environment Workshop 4: Health Research Methods and Data, Turku, Finland, July, 1999. http://geog.queensu.ca/h and e/healthandenvir/Finland Workshop Papers/OPPONG.DOC.

 

Ordnance Survey, 2006, ADDRESS-POINT: Ordnance Survey’s Map Dataset of All Postal Addresses in Great Britain.

http://www.ordnancesurvey.co.uk/oswebsite/products/addresspoint.

 

O’Reagan, R.T. and A. Saalfeld, 1987, Geocoding Theory and Practice at the Bureau of the Census. Statistical Research Report Census/SRD/RR-87/29 (Washington, DC USA: United States Bureau of Census).

 

Paull, D., 2003, A Geocoded National Address File for Australia: The G-NAF What, Why, Who and When? http://www.addressonline.com.au/addressonline/home/GNAF What Why Who When.pdf.

 

Ratcliffe, J.H., 2001, On the Accuracy of TIGER-Type Geocoded Address Data in Relation to Cadastral and Census Areal Units. International Journal of Geographical Information Science 15(5), 473–485.

 

Ratcliffe, J.H., 2004, Geocoding Crime and a First Estimate of a Minimum Acceptable Hit Rate. International Journal of Geographical Information Science 18(1), 61–72.

 

Rose, K.M., J.L. Wood, S. Knowles, R.A. Pollitt, E.A. Whitsel, A.V. Diez-Roux, D. Yoon, and G. Heiss, 2004, Historical Measures of Social Context in Life Course Studies: Retrospective Linkage of Addresses to Decennial Censuses. International Journal of Health Geographics 3(27).

 

Rushton, G., M. Armstrong, J. Gittler, B. Greene, C. Pavlik, M. West, and D. Zimmerman, 2006, Geocoding in Cancer Research - A Review. American Journal of Preventive Medicine 30(2), S16–S24.

 

Schootman, M., D. Jeffe, E. Kinman, G. Higgs, and J. Jackson-Thompson, 2004, Evaluating the Utility and Accuracy of a Reverse Telephone Directory to Identify the Location of Survey Respondents. Annals of Epidemiology 15(2), 160–166.

 

Sheehan, T.J., S.T. Gershman, L. MacDougal, R.A. Danley, M. Mroszczyk, A.M.

Sorensen, and M. Kulldorff, 2000, Geographic Surveillance of Breast Cancer Screening by Tracts, Towns and Zip Codes. Journal of Public Health Management Practices 6, 48–57.

 

Smith, G.D., Y. Ben-Shlomo, and C. Hart, 1999, RE: Use of Census-based Aggregate Variables to Proxy for Socioeconomic Group: Evidence from National Samples. American Journal of Epidemiology 150(9), 996–997.

 

Soobader, M., F.B. LeClere, W. Hadden, and B. Maury, 2001, Using Aggregate Geographic Data to Proxy Individual Socioeconomic Status: Does Size Matter?. American Journal of Public Health 91(4), 632–636.

 

Stevenson, M.A., J. Wilesmith, J. Ryan, R. Morris, A. Lawson, D. Pfeiffer, and D. Lin, 2000, Descriptive Spatial Analysis of the Epidemic of Bovine Spongiform Encephalopathy in Great Britain to June 1997. The Veterinary Record 147(14), 379–384.

 

Tobler, W., 1972, Geocoding Theory. In Proceedings of the National Geocoding Conference, Washington D.C, USA (Washington, DC, USA: Department of Transportation), IV.1.

 

United Nations Economic Commission, 2005, A Functional Addressing System for Africa: A Discussion Paper. http://geoinfo.uneca.org/Docs/Situs Addressing background paper-Draft.pdf.

 

United States Census Bureau, 2006, Topologically Integrated Geographic Encoding and Referencing System (Washington, DC, USA: United States Census Bureau). http://www.census.gov/geo/www/tiger.

 

United States Department of Health and Human Services, 2000, Healthy People 2010: Understanding and Improving Health (Washington, DC, USA: United States Government Printing Office), Second Edition. online at http://www.healthypeople.gov/Document/html/uih/uih 2.htm.

 

Veregin, H., 1999, Data Quality Parameters. In P.A. Longley, M.F. Goodchild, D.J. Maguire, and D.W. Rhind (Eds.), Geographical Information Systems, Volume 1. New York, NY, USA: Wiley, Second Edition, Chapter 12, pp. 177–189.

 

Vine, M.F., D. Degnan, and C. Hanchette, 1998, Geographic Information Systems: Their Use in Environmental Epidemiologic Research. Journal of Environmental Health 61, 7–16.

 

Walls, M.D., 2003, Is Consistency in Address Assignment Still Needed?, Proceedings of the Fifth Annual URISA Street Smart and Address Savvy Conference, Providence, RI, USA, August, 2003. http://www.urisa.org/Street Smart Conference/2003/WallsM.pdf.

 

Werner, P.A., 1974, National Geocoding. Annals of the Association of American Geographers 64(2), 310–317.

 

Wieczorek, J., Q. Guo, and R.J. Hijmans, 2004, The Point-Radius Method for Georeferencing Locality Descriptions and Calculating Associated Uncertainty. International Journal of Geographical Information Science 18(8), 745–767.

 

Wu, J., T.H. Funk, F.W. Lurmann, and A.M. Winer, 2005, Improving Spatial Accuracy of Roadway Networks and Geocoded Addresses. Transactions in GIS 9(4), 585–601.

 

Yahoo! Inc., 2006, Yahoo! Maps Web Services - Geocoding API. http://developer.yahoo.com/maps/rest/V1/geocode.html.

Yang, D.-H., L.M. Bilaver, O. Hayes, and R. Goerge, 2004, Improving Geocoding Practices: Evaluation of Geocoding Tools. Journal of Medical Systems 28(4), 361–370.

 

ABOUT THE AUTHORS

 

Dan Goldberg is a third year Computer Science Ph.D. student working in the GIS Research Laboratory at the University of Southern California. He is a recent recipient of the 2005-2006 United States Geospatial Intelligence Foundation’s Graduate Student Scholarship, and his research interests include geographic information extraction and integration, automated approaches to building highly detailed and accurate gazetteers, and developing new methods for geocoding textual locational descriptions. The author may be contacted at the USC GIS Research Laboratory, University of Southern California, Los Angeles, CA 90089-0255. 213-740-8263. E-mail: daniel.goldberg@usc.edu.

 

John Wilson is Professor of Geography and Director of the GIS Research Laboratory at the University of Southern California. He is the founding editor of Transactions in GIS, an active participant in the UNIGIS International Network, and President of the University Consortium for Geographic Information Science. His major publications include two books (Terrain Analysis: Principles and Applications; Handbook of Geographic Information Science) along with numerous book chapters and journal articles on topics ranging from soil erosion and groundwater pollution problems to urban growth modeling and the environmental and social characteristics of place and their impacts on selected health outcomes. The author may be contacted at the Department of Geography, University of Southern California, Los Angeles, CA 90089-0255. 213-740-1908. E-mail: jpwilson@usc.edu.

 

Craig A. Knoblock is a senior project leader at the Information Sciences Institute and a research professor in computer science at the University of Southern California. He received his Ph.D. in computer science from Carnegie Mellon University. His current research interests include information integration, automated planning, machine learning, constraint reasoning, and the application of these technologies to geospatial data integration. He is currently the President of the International Conference on Automated Planning and Scheduling and a Fellow of the American Association of Artificial Intelligence.


Contact

You cannot send more than 3 messages per hour. Please try again later.

Recent Blog Entries

Aug 17, 2010
Hotel reservations deadline approaching (September 3) for GIS-Pro 2010 in Orlando. Don't delay!www.gis-pro.org
Aug 17, 2010
This is a reminder that nominations for the next round of appointments to the National Geospatial Advisory Committee are due by Tuesday, August 24....
Jul 28, 2010
From URISA Board Member, Carl Anderson...As some of you are aware, OpenStreetMap is having its US State of the Map (SOTM-US) conference in Atlanta...
Syndicate content

Recent Forum Topics

Jul 13, 2010
What are examples of how much has been spent on implementations of address repository solutions?
-rogerewers@sbcglobal.net
Feb 15, 2010
On behalf of the URISA Board of Directors, the Address Standards Working Group (ASWG) has submitted this draft standard to the FGDC. You can read the...
-hilary.perkins@gmail.com
Feb 01, 2010
Who should be inducted into the URISA GIS Hall of Fame this year? Nominations are due by May 1. Whomever it will be...they will join some very fine...
-Wendy Nelson
Syndicate content
Signup for our Email Newsletter: