GeoSpark is a cluster computing system for processing large-scale spatial data. GeoSpark extends Apache Spark with a set of out-of-the-box Spatial Resilient Distributed Datasets (SRDDs) that efficiently load, process, and analyze large-scale spatial data across machines. This problem is quite challenging due to the fact that (1) spatial data may be quite complex, e.g., rivers’ and cities’ geometrical boundaries, (2) spatial (and geometric) operations (e.g., Overlap, Intersect, Convex Hull, Cartographic Distances) cannot be easily and efficiently expressed using regular RDD transformations and actions. eoSpark provides APIs for Apache Spark programmer to easily develop their spatial analysis programs with Spatial Resilient Distributed Datasets (SRDDs) which have in house support for geometrical and distance operations. Experiments show that GeoSpark is scalable and exhibits faster run-time performance than Hadoop-based systems in spatial analysis applications like spatial join, spatial aggregation, spatial autocorrelation analysis and spatial co-location pattern recognition.

Database management systems used to expect that users know what kind of data they need to query in advance. In many cases, users don’t know exactly what data they need – Instead, users sometimes prefer to explore the database. To this end, this project extends existing database systems to support recommendation as a mean of data exploration.  In this project, we designed RecDB – a full-fledged database system that produces data recommendations to end-users. The system incorporates state-of-the-art recommendation algorithms into the core functionality of a database query execution engine. RecDB allows its users to write SQL queries that seamlessly integrate the recommendation functionality with traditional relational operators, i.e., SELECT, PROJECT, JOIN. The system optimizes incoming recommendation queries (written in SQL) and hence provides near real-time personalized recommendation to a high number of end-users who expressed their opionions over a large pool of data items.