Geospatial Data Analysis: A Review of Theory and Methods

S. Rajagopalan, Professor & Faculty-in-charge Institutional & Finance.
Yogalakshmi Jayabal, International Institute of Information Technology, Bangalore.


A vast amount of data is generated and collected every moment and often, data has a spatial and/or temporal aspect. This increasing data generation and collection is resulting in increasing volume and varying formats of data being collected and the geospatial data collection is no exception. This posses challenges in storing, processing, analyzing and visualizing the geospatial data. This paper discusses the big data paradigm of the geospatial data and presents a taxonomy for analysis of the geospatial data. The existing literature is studied and discussed based on the proposed taxonomy for analysis of geospatial data.


Spatial data also known as geo-spatial data is information about any physical object on earth that can be represented by numerical values in geographic co-ordinate system. Generally it represents the location, size and shape of the object. With the rise of web, Geotagging data has become more popular as well. Geotagging is the process of adding geographical identification metadata to various media such as photographs or videos, websites, SMS, etc. This data consists of many other information in addition to the latitude and longitude details.

A. Big Data Paradigm

In a 2001 research report Lanley [1] defined data growth challenges and opportunities by these features: Volume (amount of data), ariety(range of data types and sources) and Velocity (speed of data in and out). It is called as 3Vs model for describing big data. The spatial data is considered under the BIG DATA paradigm as the spatial data possess all the characteristics of big data like Volume, Veracity and Velocity.

1) Volume: The images collected from various earth observing satellites contains rich information. As the resolution of the image increases, the size of the image increases.

2) Variety: Geospatial data consists of three basic models: rater (e.g. satellite images), vector (encompassing points, lines and polygons) and graph (spatial network). Multiple sources and approaches are used to collect spatial data on these three forms. There are different types of formats available to store spatial representation of different objects. They are:

  • Geo Tiff [Geotagged Tag Image File Format for exchanging raster graphics (bitmap) images between application programs (.tif)]
  • IMG
  • HDF [Hierarchical Data format is a set of file formats designed to store and organize large amounts of data(.hdf5)]
  • NETCDF [Network Common Data Form is a set of software libraries and self-describing machine independent data formats that support the creation, access and sharing of array oriented scientific data (.nc)]
  • BIL [A BIL image file which means bands interleaved by line is an uncompressed file containing the actual pixel values of an image. It stores pixel information in separate bands within the file]
  • AAIGRID [This is the ASCII interchange format for Arc/Info Grid, and takes the form of an ASCII file plus sometimes an associated .prj file.]
  • Vector [There are numerous vector formats: EPS, SVG,PDF, AI, DXF, JPEG, JPG, PNG, BMP etc]
  • Twitter [Geo located tweets]
  • Text [Geo located text]
  • GIS Images


This available heterogeneity across the data sets adds to the variety component of big data.

3) Velocity: The real-time monitoring of earth or any other object means a continuous flow of data which requires high computing and storage capabilities. In addition to having these 3Vs the spatial data also poses these features of high dimensionality, high complexity and high uncertainty as pointed out by Liu et al in [19] .

B. Types of Spatial Data


It is important to analyze the discreteness or continuity of space on which the variables are measured. This classification of spatial data by type of conception of space and by measured variables is the first step in specifying the appropriate statistical technique / algorithm to use towards solving the given problem. A topology of spatial data based on the conception of space is provided by Manfred et al [20] . The four types of spatial data are:

  • Point pattern data : A data set consisting of point locations, at which events of interest have occurred like disease or incidence of crime etc.
  • Field data / Geostatistical data : The data set consisting of variable which are conceptually continuous and whose observations are sampled at fixed set of points.
  • Area data : The data set consisting of observations from a fixed set of area objects that may form a lattice like remotely sensed images etc.
  • Spatial interaction data : The data set consisting of observations of a pair of point locations or a pair of areas.



With abundance in data, the data processing and analysis technologies have been driven by data. Analysis of spatial data is both compute intensive and resource intensive. There is a huge amount of literature available under geospatial data and GIS systems. In this study, we have used the following taxonomy to understand the available literature in this area.

A. Taxonomy for analysis

The large volume of geospatial data raises the question of efficient processing architectures or data processing methodologies for acquiring useful knowledge. Hence, in this study, the available literature is studied based on:

1) Spatial data models / infrastructures.
2) Spatial data analytics platforms / data processing frameworks/systems.
3) Algorithms for spatial data.
4) Applications of spatial data.


The available literature for spatial data model is more concentrated towards improving the spatial query performance and throughput. Wang et al [2] propose a new geospatial data model X3DORGDM (X3D-based Oriented Relation Geospatial Data Model). It aims to meet the requirements of geovisualization. As a data model, X3DORGDM consists of three components: 1) a collection of geographic data types; 2) a collection of operating algorithms; 3) a collection of integration and consistency rules to define the consistent geo-database or change of state or both.

The kernel of X3DORGDM has been extended from CGAL and new classes and algorithms like Filter Search Tree has been added.


X3D-based ORGDM has been implemented based on several open source packages like POSTGRESQL, OpenGIS Simple Features Data Model, Computational Geometry Algorithm Library(CGAL), Geographic Data Abstraction Library(GDAL) and PROJ.
4) The model extends the existing approaches by utilizing the management and computation of the spatial data in a geo-database and using X3D data flow to aggregate hybrid spatial data. The interface of output data in X3D-based ORGDM deals with the request of data services from clients, which includes retrieving, querying, updating and computing spatial data, and the transmitting of X3D/XML data flow from the Data Management layer to the Business Logic layer and Client Services layer. Rich semantics, higher speed in parsing and more comprehensive visualization performance fully enables X3D file format to build the applications of geographic-visualization-based data mining and knowledge discovery.

Lacasta et al [3] propose a process to construct a Linked Data model of geospatial resources that allows semantic searching and browsing. There are some initiatives by the standardization bodies Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO) to standardize the way geospatial information is created, provided and transformed according to the user needs. However there are issues like: geo-service creators have to manually describe and provide annotations for their services. The publicly available geospatial catalogs have to be manually annotated, for a search of spatial data to yield better results. This work proposes a methodology that combines and adapts a set of information retrieval and natural language processing techniques to the geospatial web service context. It also shows how to use these techniques to create an automatic system that can identify, classify and interrelate and facilitate the access to geospatial web services.

Chi-Ren Shyu et al [4] propose a coherent system: GeoIRIS, that allows image analysts to rapidly identify relevant imagery. GeoIRIS ably answers analysts questions such as given a query image, show database satellite images that have similar objects and spatial relationship that are within a certain radius of landmark. Their architecture consists of modules namely: feature extraction(FE), indexing structures(IS), semantic framework(SF), GeoName Server(GS), fusion and ranking(FR), and retrieval visualization(RV). They use Tile based feature extraction and Object based feature extraction for feature extraction. Indexing of continuous valued features based on entropy balance statistical (EBS) k-dimensional (k-D) tree and indexing the binary-valued features is performed with the entropy balanced bitmap (EBB) tree. It also proposes novel approaches for information ranking, semantic modelling and advanced queries.

S.Roy et al [6] discuss the metadata issue related to Spatial Data Infrastructure; and they attempt to propose a three-tired infrastructure towards the enhancement of metadata catalogue services in regards of three aspects.

1) Incorporating of various geographic information metadata elements and provision of necessary support for spatial data infrastructures;
2) Achieving interoperability between different metadata standards and those essential for spatial data infrastructures;
3) Enhancement of information retrieval techniques for spatial data infrastructure using disambiguated vocabularies.



A. Analytics platform

Anuradha [5] proposes a geo spatial analytics big data platform that uses Hadoop, Hive, other NoSql technologies along with Relational DB to process geo spatial data. The architecture is a proposal and is not available as an implementation. The proposal also describes the required functionality of different components of the architecture. The architecture as shown in [5] is below:

Fig. 1. Logical Architecture of Big Data Platform


B. Data processing frameworks

Klien et al [7] proposed Physical Analytics Integrated Repository and Services (PAIRS) a scalable geo-spatial data analytics platform. It enables rapid data discovery by automatically updating, joining and homogenizing data layers in space and time. It helps in automatic data download,data curation and scalable storage and a computational platform for running physical and statistical models on the curated datasets.

It claims that the key differentiator is its capability in multilayer query, to search multiple data layers and filter based on multiple search criteria. It uses HBase to store data and index built is based on latitude, longitude and timestamp. It uses open source tools to convert data layer projections into WGS84 co-ordinate system. It helps to manage data from multiple sources in a scalable fashion on distributes compute resources. The main component of PAIRS is its Data Integration Engine. It can download, re-project and index data. If a data format is not raster, then it rasterized and handled as a large matrix. The data integration layer of PAIRS as shown in [7] is below:

Fig. 2. Data Integration Layer of PAIRS


XinChen et all [8] propose a high performance integrated spatial big data analytics framework based on MapReduce paradigm and present a few use cases. Their data integration mainly focuses on spatial datasets that relies on these following query types:

1) Point-in-polygon queries;
2) Cross-matching queries;
3) Nearest neighbor queries.

These kinds of queries are both data and compute intensive. Hence their spatial data integration is based on extending Hadoop- GIS with scalable spatial clustering and spatial regression capabilities.

Stefan Hagedorn et al [9] compare existing solution for spatial data processing on Apache Hadoop and Apache Spark. The comparison is based on their features and their performances in a micro benchmark for spatial filter and joins queries. They have compared the Hadoop extensions: Hadoop-GIS [10], Spatial Hadoop [11] and the Spark based systems SpatialSpark [12], GeoSpark [13] and their implementation STARK[https://github.com/dbis-ilm/stark]. The comparison table I shown below as given their paper gives a feature level comparison.

Jae-GilLee et al [14] present an overview of existing geospatial big data challenges and opportunities and propose a geo spatial data processing architecture identifying the existing technologies and the proposed new technologies separately with a spatial online analytical processing module.

C. Systems

David Haynes et al [15] propose Terra Populus a system that provides three web applications that allows to access, analyze and tabulate different datasets under a common platform.

1) Paragon is a prototype parallel spatial database that extends the functionality of PostgreSQL and PostGIS onto multimode systems.
2) Terra Populus tabulator application builds dynamic queries for analyzing large population survey data.
3) Terra Explorer is an exploratory analysis tool for visualizing the spatial datasets within the repository.

Barik et al [16] propose a Fog Computing based framework called FogGIS for mining from geospatial data. It is built as a prototype using Intel Edison, an embedded microprocessor. This work claims the following contributions:

  • FogGIS framework proposes improved throughput and reduced latency for analysis and transmission of geospatial data.
  • Different compression techniques for reducing data size, thereby reducing transmission power.

Bosch et al [17] propose a geo-spatial document analysis system for the VAST 2011 Mini Challenge 1. Their system equips the user to interact with the data in a visual, direct and scalable fashion, which offers diverse views and data management components. Wang et al [30] have proposed TerraFly Geo-Spatial Cloud platform. This system provides comprehensive spatial analysis methods and visualization.


Depending on the type of conception of space and the measurement level, different algorithms are applied. The widely used data mining algorithms for spatial data are:

1) Spatial Auto-regressive model
2) Markov Random Field model
3) Geographically weighted regression
4) Fractal models
5) Map-Reduce algorithm for polygon retrieval
6) The General Spatial Interaction Model

Guo et al [18] propose a Map-Reduce based parallel polygon retrieval algorithm, as it is a fundamental operation that is often computed under real time constraints. This algorithm first hierarchically indexes the spatial terrain data using a quad-tree index, with the help of which a significant amount of data gets filtered out in the pre-processing stage based on the queried object. The algorithms 1 to 6 mentioned above are widely used when the spatial data is of area data as mentioned in sectionI-B and algorithms 7 and 8 are for spatial interaction data.



A. Algorithms for Area data

1) Spatial Autoregressive model: The assumption of independent observations in the context of area data is unlikely to be appropriate, because of the possibility of spatial dependence. Spatial dependence is where the observed values in one areal unit depend on the values of neighbouring observations at near-by areal units.
This spatial dependence can be introduced into a model:

1. When there is spatial correlation in dependent variable, which is called as Spatial lag model,
2. When there is dependence in the error terms, it is called as Spatial error dependence [20]
3. When spatial dependence is in the regressor variables, it is called as cross-regressive model also called as Spatial Lagged X(SLX) models [20].

The spatial lag models are used when there is spatial correlation in the dependent variable. These spatial correlations are motivated by theoretical arguments that emphasize the importance of neighbourhood effects. Spatial lag models are extensions of regression models of type

1. They allow observations of dependent variable in i to depend on observations in neighbouring j not equal to i.

where, yi is an observed dependent variable, Xi.qXi.q is observed explanatory variables with q = 1,..,Q, q is the regression co-efficient and ∈i∈i is the error term.

The basic spatial lag model i.e first order spatial autoregressive(SAR) model is of the form 2

where, the error term Ïµiϵi is independently identically distributed. Wij is the (i; j)th element of spatial weight matrix W. The scalar ρρ in equation:2 is a parameter that is to be estimated which determines the strength of the spatial autoregressive relation between yi and  ∑nj=1Wi.jyj.∑nj=1Wi.jyj∑j=1nWi.jyj.∑j=1nWi.jyj$ is a linear combination of spatially related observations.

2) Markov Random Field models:: The Markov Random Field also called as Conditional Autoregressive (CAR)) models are popular approach for analyzing spatial and network data. These models are widely used to represent the local dependency between the random variables. Consider Y = (y1, y2,.., yn) and the set p(YiYi | YjYj , j not equal to i) then we know that p(y1, y2,…,yn) determines p(YiYi | YjYj , j not equal to i) (full conditional distributions). But if p(YiYi | YjYj , j not equal to i) determines p(y1, y2,..,yn), the joint distribution is called as Markov Random Field. This notion of using local specification to determine a joint (or global) distribution in the form p(Yi|Yj)p(Yi|Yj) , j not equal to i) = p(YiYi | YjYj ∈∈ ∂i∂i) is referred to as Markov Random Field.

But when does the set p(YiYi | YjYj ∈∈ ∂i∂i) uniquely determine p(y1, y2,.., YnYn)? To answer this, we need some important aspects in this regard like that of a Clique and a Potential.

A Clique is a set of cells (equivalently indices) such that each element is a neighbour of every other element.

A Potential function (or simply potential) of order  is a function of k arguments that is exchangeable in these arguments. The arguments of the potential would be the values taken by variables associated with the cells for a clique of size k.

The Hammersely-Clifford theorem says that, if we have a Markov Random Field (i.e., p(Yi|Yj∈∂iYi|Yj∈∂i) uniquely determine p(y1, y2,…, YnYn)), then the later is a Gibbs distribution. The Geman and Geman (1984) result says that, if we have a joint Gibbs distribution, then we have a Markov Random Field. As a result, they argued that to sample a Markov random field, one could sample from its associated Gibbs distribution. Kaise, Lahiri and Nordman (2012) defined a Conclique as a set of locations such that no location in the set is a neighbour of any other location in the set. Any two members of a conclique may share common neighbours, but they cannot be neighbours themselves. Additionally every set of a singleton.

Location can be treated as both a clique and a conclique. In the parlance of graphs, the analog of a conclique is a so-called ”independent set”, defined by a set of vertices in which no two vertices share an edge. This particular graph terminology conflicts with the probabilistic notion of independence, while a ”conclique” truly represents a conditionally independent set of locations in a Markov random field model. The full conditional p(YjYj|YjYj ; j not equal to i) is the conditional cdf of YiYi under the model. Kaise, Lahiri and Nordman

The Annals of Statistics, 40(1), Institute of Mathematical Statistics, 10430, 2012″>[22]

have also defined spatial residuals based on these conditional cdf’s as:<img class=" size-full="" wp-image-3632="" aligncenter"="" src="/wp-content/uploads/2017/12/1.png"" alt="" width="199" height="31">

The Annals of Statistics, 40(1), Institute of Mathematical Statistics, 10430, 2012″>[22]

have also defined spatial residuals based on these conditional cdf’s as:<img class=" size-full="" wp-image-3632="" aligncenter"="" src="/wp-content/uploads/2017/12/1.png"" alt="" width="199" height="31">

The key property of this generalized spatial residuals is: Let CjCj : j = 1,,,,,q be a collection that partition the integer grid. Under the conditional model, spatial residuals within a conclique are iid Uniform(0; 1)-distributed:

The Annals of Statistics, 40(1), Institute of Mathematical Statistics, 10430, 2012″>[22]

have also defined spatial residuals based on these conditional cdf’s as:<img class=" size-full="" wp-image-3632="" aligncenter"="" src="/wp-content/uploads/2017/12/1.png"" alt="" width="199" height="31"> Using the conditional independence of random variables at locations within a conclique [23] have proposed a conclique based Gibbs Sampling algorithm for sampling from a Markov random field. They have shown that it is provably fast and the sampler is also provably geometrically ergodic (i.e., it mixes at a fast rate), which is unusual for spatial data. Kaise, Lahiri and Nordman [22] provide a methodology for performing GOF tests using concliques. Concliques based Gibbs Sampling allows for fast approximation of the reference distribution for the GOF test statistics.

3) Geographically Weighted Regression: Spatial nonstationarity is a condition in which a simple ”global” model cannot explain the relationships between some sets of variables. The model must weight over the space to reflect the structure within the data. Geographically weighted regression( GWR) is a technique which attempts to capture this variation by calibrating a multiple regression model which allows different relationships to exist at different points in space. This technique is loosely based on kernel regression. GWR is a simple technique that extends the traditional regression framework by allowing local variations in rates of change so that the coefficients in the model rather than being global estimates are specific to a location i. The regression equation is then given as:

where ai.kai.k is the value of the kth parameter at location i. For this model, it seems intuitively appealing to base estimates of aik on observations close to i. In GWR weighting an observation in accordance with its proximity to i would allow an estimation of ai.kai.k to be made that meets the criterion of ”closeness of calibration points”. In kernel regression, y is modelled as a nonlinear function of x by weighted regression, with weights for the ith observation depending on the proxmity of x and xixi for each i with the estimator being,

The typical output from the above equation will be set of parameter estimates that can be mapped in geographic space to represent the nonstationarity or parameter ”drift”. The choice of weighting scheme is based on the proximity of i to the sampling locations around i. One of the schemes is to specify wi.jwi.j as a continuous function of di.jdi.j , the distance between i and j. One of the weighting based on distance can be defined as

so that if i is a point in space at which data are observed, the weighting of that data will be unity and the weighting of other data will decrease according to a Gaussian curve as the distance between i and j increases. One another weighting scheme having the computationally desirable property of excluding all data pints greater than some distance from i and also the analytically desirable property of continuity as in bisquare function defined by

This excludes points outside radius d but tapers the weighting of points inside the radius, so that wi.jwi.j is a continuous and once differentiable function for all points less than d units from i.

In the weighting schemes so far, the weighting function once calibrated is assumed to be constant throughout the study area. However there can be places where this is not a reasonable assumption. For example, in economic applications, pricing structures may be dependent on local markets, but the notion of locality may vary regionally i.e, the geographical extent of a London market may be broader than that for Newcastle.
In these cases, a more reasonable approach to GWR might be to have a spatially variable weighting function so that βiβi is computed instead of ββ . Although this is computationally complex, the results will be informative, not only of the nature of relationships between attributes but also the nature of how locations interact with each other.

4) Fractal models: A fractal cab be described as an entity that possesses self-similarity on all scales and non-integer dimension. It is important that a fractal needs to only exhibit similar(but not exactly the same) type of structure at all scales, Moreover according to Madelbrot [24]: A fractal set is one for which the dimension strictly exceeds the topological dimension. This means that while a line feature has a dimension of 1 in classical geometry, it must have a dimension larger than 1 if it is to have fractal properties.

The rough description of a fractal object is the exponent in the expression of the form as shown in the below equation [25] [26]:

in which r is the radius, N is the number quantifying the object x under consideration at the radius r, a is a constant, and D is the fractal dimension.

The self-similarity as stated by Mandelbrot is that each part is a reduced-size copy of the whole, i.e, the spread of any component of the system at a given point is proportional to its distance from that center. By calculating the quantity of any given component of the system as a function of the distance from the center, it should be possible to verify its fractal properties and extract its fractal dimension [27].

5) Map-Reduce for Polygon retrieval: For many spatial data analysis and computational problems, polygon retrieval is a fundamental operation which is often computed under real time constraints. Given that there is unprecedented growth in terrain data in volume and rate, many sequential algorithms do not effectively meet this demand [28]. Guo et al [28] propose a MapReduce based parallel polygon retrieval algorithm that aims at minimizing the IO and CPU loads of the map and reduce tasks during spatial processing.

The terrain data is usually represented using one of the common data structures to approximate surface either using digital elevation model(DEM) or triangulated irregular network(TIN). Their proposed algorithm hierarchically indexes the spatial terrain data usng a quad-tree index. Also, a prefix tree based on the quad-tree index is built to query the relationship between the terrain data and query area in real time. Their proposed technique first divides the entire data set into several chunks of files based on a quad-tree prefix. Then for each range query, a prefix tree is used to organize the set of quadindices whose corresponding grids intersect the query area.
Prior to processing a query, these indices are used to filter the unnecessary TIN data. The relationship between the TIN data and the query shape is pre-tested through the built prefix tree in the map function in order to minimize the computation.

6) The General Spatial Interaction Model: Spatial interaction models are statistical models used to predict origin destination flows. They are widely applied to geography, planning, transportation and the social sciences to predict the interactions or flows related to commuting, migration, access to services etc. Mathematically, when a series of observations y(i,j) : i, j = 1,…, n on random variables Y (i, j) is given, each of which corresponds to movements of people (cars, commodities or telephone calls) between origin and destination locations i and j. The Y (i, j) are assumed to be independent random variables. They are sampled from a specified probability distribution that is dependent upon some mean say μ(i,j)μ(i,j), then the statistical model of the general form is given by:

where μ(i,j)=E[Y(i,j)μ(i,j)=E[Y(i,j) is the expected mean interaction frequency from i to j, and ∈(i,j)∈(i,j) is an error about the mean. The mean interaction frequencies between origin i and destination j are also modelled as:

where A(i) called as origin-specific factors, B(j) called as destination-specific factors and S(i, j) is a function of some measure of separation between location i and j.

This interaction model relies on three types of factors:

1. Origin-specific factors that characterize the ability of the origin locations to produce or generate flows,

2. Destination specific factors that represent the attractiveness of destination,

3. Origin-destination factors that characterize the way spatial separation of origins from destinations constrains or impedes the interaction.


There are a number of applications built based on the geospatial data. Broadly they can be classified into the following:
1) Epidemiological data analysis
2) Geospatial data based recommender system for the location of health services
3) Seismic hazard assessment

A. Epidemiological data analysis

Wang et al [29] present a geospatial epidemiology analysis system on the TerraFly Geo-spatial cloud platform [30]. They present a four kinds of API algorithms for data analysis and results visualization based on the TerraFly GeoCloud system like disease mapping(mortality/morbidity map, SMR map), disease cluster determination(spatial cluster, HotSpot analysis tool, cluster and outlier analysis), geographic distribution measurement(mean central, median central, standard distance, distributional trends), and regression (linear regression, spatial auto-regression).

B. Geospatial data based recommender system for the location of health services

Martnez et al [32] have proposed a geospatial application that focuses on locating health care centers within a certain range of distance is presented. The proposed methodology works in three levels. The first level called the localization level, works towards the user’s location, the second level makes the semantic analysis, which uses an application ontology to describe the conceptualization of the health care services, to which the user can go. The third level generates and displays statistics of the vehicular traffic.

In the localization level, the current location of the patient is determined using a multilateration probabilistic method.

This method considers the distances to the antennas that are in the range of the mobile device and implements a calibration process.

When the system has the antenna positions and the distance between the user and the antenna, a system of equations is solved to obtain the current location of the patient. In the semantic analysis level, the user selects the type of the health insurance and depending on the selection, the health care institutions and the specialties are displayed. According to the distance between the location of the device and the hospitals the markers are colored and the scale of colors is displayed. In the visualization level, the historical data of traffic is analyzed for the coverage area and are displayed on the map. The computation is based on information contained in geospatial database, computing the average of the velocities at earlier dates to be shown under coverage radius determined by the user.

Three different types of queries are supported in this system. The first is ”General”, where the user specifies the coverage radius. The second is ”By Category”, where the user specifies the coverage radius and the type of the health care center. The third is ”By Service”, where the user must select a medical specialty.

C. Seismic hazard assessment

Zoran [33] presents the study of analysis of seismic Vranea zone in Romania, which is at the conjunction of four tectonic blocks and is considered one of the most seismically active areas in Europe. The author has come up with the steps that help in the zonation of the seismic area, which can be thought of as the factors during the data collection process.

Jaishree et al [34] have proposed a methodology for microzonation of the seismic area. Their case study was based on Chennai, TamilNadu(India), since it is in zone 3 after experiencing strong frequent tremors after September 2001. They propose seismic microzonation that can be generated with GIS platform using themes such as Peak Ground Acceleration, Geology, Bed rock depth and lineaments.

Their proposed work involves the following steps:

1. Data collection,

2. Integration of the themes in GIS,

3. Spatial and proximity analysis,

4. Seismic zonation and

5. Datavisualization.

The peak ground acceleration is obtained based on the calculated attenuation relationship. The geology is an important factor for seismic study. Harder the rock the seismic wave velocity will be more and this reduces earthquake intensity in areas having hard rocks. whereas seismic wave velocity will be low in areas having fluvial deposits, so the earthquake hazard is more in those regions.Hence the geology map of the study region is generated. The third map generated is the lineaments map followed by the bedrock depth configuration map. The bedrock configuration of a site will give an idea about the basement topography. which, in turn helps in the study of of frequencies and amplitudes of ground motions. All the generated map layers are overlaid in order to obtain the microzonation map of the study area.

The influence of these layers were identified using hierarchical approach developed by Satty [35]. The importance was given in the order of peak ground acceleration, bedrockdepth, lineaments and geology respectively. These values aid in decision making process in the infrastructure development in the city.


This study reviewed a diverse of theory and methods for geospatial data analysis. Given the unique characteristics of spatial data the following were identified as some of the existing open challenges:

1) Algorithms to handle spatial streaming data
2) New spatial data indexing schemes
3) Volunteered Geographic information systems
4) Mobile mapping and location based services
5) Object based data models for continuous data

The GPS(Global Positioning System) traces of Public Transit Systems which can be considered as spatial stream data are being stored in the huge volumes in their databases. Newer indexing techniques are required to store and retrieve the data back efficiently. These kinds of datasets may include a time series of attributes such as vehicle location, vehicle speed, fuel levels, emissions of greenhouse gases etc.

However, these datasets tend to grow big. Utilizing such datasets require the development of novel algorithms that exploit the modern computing architectures and computational middleware.

More details on the issues that arise in mobility services with respect to GIS could be found on [36].

The geographic information from the social media feeds represents a new type of geographic information. It is referred to as Ambient Geographic Information(AGI) [37] as it is embedded in the content of these feeds often across the content of numerous entries rather than within single one, and has to be extracted somehow. Nevertheless, it is of importance as it communicates real-time information about the emerging issues.


In this study, we have presented taxonomy for the geospatial data and GIS systems. Based on this taxonomy, the
available literature are studied and categorized. The geo-spatial data is one of the major contributor towards the big data paradigm and hence research for newer techniques for storage of data and newer systems plays an important role in the scientific community.


This research received funding from the Netherlands Organisation for Scientific Research (NWO) in the framework of the Indo Dutch Science Industry Collaboration programme [NWO, Den Haag, PO Box 93138,NL-2509 AC The Hague, The Netherlands]. We are thankful to NWO, Royal Shell and Prof. Sebastian Meijer, the Principal Investigator of this project.


[1] Laney, Douglas. 3D Data Management: Controlling Data Volume, Velocity and Variety. Gartner. Retrieved 6 February 2001.

[2] Tao Wang and Xi Chen and An-ming Bao and Wei-sheng Wang, A new geospatial data model to facilitate geographic data mining and knowledge discovery, IEEE International Conference on Systems, Man
and Cybernetics,3642-3646,2008

[3] J. Lacasta and F. J. Lopez-Pellicer and W. Renteria-Agualimpia and J. Nogueras-Iso, Improving the visibility of geospatial data on the Web, IEEE/ACM Joint Conference on Digital Libraries, 155-164, 2014

[4] Chi Ren Shyu and Matthew N. Klaric and Grant J. Scott and Adrian, S. Barb and Curt H. Davis and Kannappan Palaniappan, GeoIRIS: Geospatial Information Retrieval and Indexing System Content Mining, Semantics Modeling, and Complex Queries,IEEE Trans. Geoscience and Remote Sensing, 45, 4, 839–852, 2007

[5] A. Subramanian, IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), Empowering geo spatial analysis with big data platform: Natural resource management, 1-6, 2016

[6] S. Roy and S. Das, 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services, Spatial data infrastructures: Its metadata and analysis, 43-51, 2015

[7] Klein, Levente J. and Marianno, Fernando J. and Albrecht, Conrad M. and Freitag, Marcus and Lu, Siyuan and Hinds, Nigel and Shao, Xiaoyan and Rodriguez, Sergio Bermudez and Hamann, Hendrik F., Big Data, IEEE, PAIRS: A scalable geo-spatial data analytics platform, 1290-1298, 2015

[8] Xin Chen and Hoang Vo and Fusheng Wang, BigSpatial’14 3rd ACM SIGSPATIAL International Workshop on Analytics for BigGeospatial Data, High Performance integrated spatial big data analytics, 11-14, 2014

[9] Stefan Hagedorn and Philipp Gotze and Kai-Uwe Sattler, International Conference on Extending Database Technology, Big Spatial Data Processing Frameworks: Feature and Performance Evaluation, 2017

[10] A. Aji, F. Wang et al., Hadoop-GIS: A High Performance Spatial Data Warehousing System over Mapreduce, VLDB, pp. 1009 -1020, 2013.

[11] A. Eldawy and M. F. Mokbel, A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data, VLDB, 2013.

[12] S. You, J. Zhang, and L. Gruenwald, Large-Scale Spatial Join Query Processing in Cloud, ICDE, 2015.

[13] J. Yu, J. Wu, and M. Sarwat, GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data. SIGSPATIAL, p. 70, 2015

[14] Jae-GilLee, Minseo Kang, Big Data Research, Geospatial Big Data: Challenges and Opportunities, 2, 74-81, 2015

[15] David Haynes and Suprio Ray and Steven M. Manson and Ankit Soni, High performance analysis of big spatial data, IEEE International Conference on Big Data, Big Data 2015, 1953-1957, 2015

[16] Rabindra K. Barik and Harishchandra Dubey and Arun B. Samaddar and Rajan D. Gupta and Prakash K. Ray, FogGIS: Fog Computing for Geospatial Big Data Analytics, CoRR, abs/1701.02601, 2017

[17] Harald Bosch and Dennis Thom and Michael Worner and Steffen Koch and Edwin Puttmann and Dominik Jackle and Thomas Ertl, ScatterBlogs: Geo-spatial document analysis, IEEE Conference on Visual Analytics Science and Technology, 309-310, 2011

[18] Qiulei Guo and Balaji Palanisamy and Hassan A. Karimi, A MapReduce Algorithm for Polygon Retrieval in Geospatial Analysis, 8th IEEE International Conference on Cloud Computing, 901-908, 2015

[19] Zhen LIU and Huadong GUO and Changlin WANG, Considerations on Geospatial Big Data, IOP Conference Series: Earth and Environmental Science, 46, 1, 12-58, 2016

[20] Manfred M. Fischer and Jinfeng Wang, Spatial Data Analysis: Models, Methods and Techniques (Springerbriefs in Regional Science) (1st ed.), Springer Publishing Company, Incorporated, 2011

[21] Ranga Raju Vatsavai, Auroop Ganguly, Varun Chandola, Anthony Stefanidis, Scott Klasky, and Shashi Shekhar, Spatiotemporal data mining in the era of big spatial data: algorithms and applications. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (BigSpatial ’12) ACM, New York, NY, USA, 1-10, 2012

[22] Kaiser, Mark S, Soumendra N Lahiri, and Daniel J Nordman, Goodness of Fit Tests for a Class of Markov Random Field Models. The Annals of Statistics, 40(1), Institute of Mathematical Statistics, 10430, 2012

[23] Andrea Kaplan, Mark Kaiser, Soumendra N Lahiri and Daniel Nordman, A Simple, Fast Sampler for Simulating Spatial Data and Other Markovian Data Structures, JSM, Section on Statistical Computing, 2017

[24] B. B. Mandelbrot, Fractals and chaos, the Mandelbrot set and beyond: selecta volume C. New York, N.Y., Springer, 2004.

[25] S. N. Rasband, Chaotic dynamics of nonlinear systems. New York, Wiley, 1990.

[26] P. Frankhauser, Fractal Geometry of Urban Patterns and their Morphogenesis, Discrete Dynamics in Nature and Society, vol. 2, pp. 127145, 1998.

[27] A. Clauset, C. R. Shalizi, and M. E. J. Newman, Power-Law Distributions in Empirical Data, SIAM Review, vol. 51, no. 4, pp. 661703, Nov. 2009.

[28] Qiulei Guo, Balaji Palanisamy, Hassan A. Karimi, A MapReduce Algorithm for Polygon Retrieval in Geospatial Analysis, CLOUD, 901- 908, 2015

[29] H. Wang, Y. Lu, Y. Guang, E. Edrosa, M. Zhang, R. Camarca, Y. Yesha, T. Lucic, N. Rishe, Epidemiological Data Analysis in TerraFly Geo-Spatial Cloud, International Conference on Machine Learning and Applications, 485-490, 2013

[30] Mingjin Zhang, Huibo Wang, Yun Lu, Tao Li, Yudong Guang, Chang Liu, Erik Edrosa, Hongtai Li, Naphtali Rishe, TerraFly GeoCloud, ACM Transactions on Intelligent Systems and Technology, vol. 6, pp. 1, 2015

[31] D. Lopez, M. Gunasekaran, B. S. Murugan, H. Kaur and K. M. Abbas, Spatial big data analytics of influenza epidemic in Vellore, India, 2014 IEEE International Conference on Big Data (Big Data), 19-24, 2014

[32] M. L. Martnez et al., Geospatial Recommender System for the Location of Health Services, 14th International Conference on Computational Science and Its Applications, Guimaraes, 200-203, 2014

[33] M. Zoran, Use of geospatial and in situ information for seismic hazard assessment in Vrancea area, Romania, Second Workshop on Use of Remote Sensing Techniques for Monitoring Volcanoes and Seismogenic
Areas, Naples, 1-5, 2008

[34] Jaishree. S. R, Vindya. T and Sandhya. M. K, Micro-seismic Zonation based on Geospatial Data using GIS Technology, Proceedings of National Conference on Communication and Informatics, 93-96, 2016

[35] Thomas Saaty L, Decision making with the analytic hierarchial process, International Journal of Service Sciences, Pittsburgh, USA, 1, 1, 83-98, 2008

[36] S. Shekhar, V. Gunturi, M. R. Evans, and K. Yang, Spatial big-data challenges intersecting mobility and cloud computing. In Proceedings of the Eleventh ACM International Workshop on Data Engineering for Wireless and Mobile Access, MobiDE ’12, 1-6, 2012

[37] A. Stefanidis, A. Crooks, and J. Radzikowski, Harvesting ambient geospatial information from social media feeds. GeoJournal, 1-20, 2011