Knowledge Cubes | Amgad Madkour

Overview

A Knowledge Cube, or KC for short, is a semantically-guided data management architecture, where data semantics influences the data management architecture rather than a predefined scheme. KC relies on semantics to define how the data is fetched, organized, stored, optimized, and queried. Knowledge cubes use RDF to store data. This allows knowledge cubes to store Linked Data from the Web of Data. Knowledge cubes envisions breaking down the centralized architecture into multiple specialized cubes, each having its own index and data store.

Github Project: Knowledge Cubes

Create Encode Data

java -cp uber-knowledgecubes-0.1.0.jar:scala-library-2.11.0.jar edu.purdue.knowledgecubes.DictionaryEncoderCLI -i src/main/resources/datasets/original/sample.nt -o /home/amadkour/kclocal/encoded.nt -l /home/amadkour/kclocal -s space

The kclocal will contain the created dictionaries, the initial data structure used by the store

Create Store

spark-submit --master local[*] --class edu.purdue.knowledgecubes.StoreCLI target/uber-knowledgecubes-0.1.0.jar -i /home/amadkour/kclocal/encoded.nt -l /home/amadkour/kclocal -f 0.01 -t roaring -d /home/amadkour/kcdb

The database kcdb directory contains the actual data and reductions for the input NT file. The following is the directory structure of the local store:

Run Query Workload

spark-submit --master local[*] --class edu.purdue.knowledgecubes.BenchmarkReductionsCLI target/uber-knowledgecubes-0.1.0.jar -l /home/amadkour/kclocal -f 0.01 -t roaring -d /home/amadkour/kcdb -q src/main/resources/queries/original

The command generates the workload reductions under the kcdb/reductions/join directory. The partitions are saved using parquet format.

Local Store Overview

$ ls
amadkour@amadkour:~/kclocal$ ls
GEFI  dbinfo.yaml  dictionary  encoded.nt  join-reductions.yaml
results-20200625115017.txt  tables.yaml

GEFI: directory represents the generalized filters created for the input datasets.
dbinfo.yaml: file lists meta-data about the store datasets.
dictionary: directory containts the string to id mappings created by the dictionary module.
join-reductions.yaml: directory contains metadata about the generated reductions.
results-20200625115017.txt: is the output file containing the query performance output when running the benchmarking modules.
tables.yaml: file lists the meta-data about the tables.

Database Directory Overview

amadkour@amadkour:~/kcdb$ ls
data  reductions

The database contains parquet formatted files that represents the original data and reductions:

data: contains the original data created based on the input NT files.
reductions: contains the workload-driven reductions created after running a query workload (e.g. after running the Benchmark CLI tool mentioned below).

Program such as spark-shell can be used to view the parquet file content:

scala> var data = spark.read.parquet("/home/amadkour/kcdb/reductions/join/13_TRPO_JOIN_13_TRPS")
data: org.apache.spark.sql.DataFrame = [s: int, p: int ... 1 more field]

scala> data.show()
+---+---+---+
|  s|  p|  o|
+---+---+---+
| 11| 13|  3|
| 11| 13|  4|
| 12| 13|  5|
|  8| 13|  1|
+---+---+---+

WORQ: Workload-Driven RDF Query Processing

KC uses a workload-driven RDF query processing technique, or WORQ for short, for filtering non-matching entries during join evaluation as early as possible to reduce the communication and computation overhead. WORQ generates a reduced sets of triples (or reductions, for short) to represent join pattern(s) of query workloads. WORQ can materialize the reductions on disk or in memory and reuses the reductions that share the same join pattern(s) to answer queries. Furthermore, these reductions are not computed beforehand, but are rather computed in an online fashion. KC also answer complex analytical queries that involve unbound properties. Based on a realization of KC on top of Spark, extensive experimentation demonstrates an order of magnitude enhancement in terms of preprocessing, storage, and query performance compared to the state-of-the-art cloud-based solutions.

Features

A spark-based API for SPARQL querying
Efficient execution of frequent workload join patterns
Materialze workload join patterns in memory or on disk
Efficiently answer unbound property queries

Usage

KC provide spark-based API for issuing RDF related operations. There are three steps necessary for running the system:

Dictionary Encoding
Store Creation
Querying

Dictionary Encoding

KC requires that the dataset be dictionary encoded. The dictionary encoding allows adding resources (subjects or objects) as integers to the filters.

java -cp target/uber-knowledgecubes-0.1.0.jar edu.purdue.knowledgecubes.DictionaryEncoderCLI -i [NT File] -o [Output File] -l [Local Path for the new store] -s space

The command generates a dictionary encoded version of the dataset. This encoded NT file is used for creating the store. KC automatically encodes and decodes SPARQL queries and the corresponding results.

Store Creation

KC provide the Store class for creation of an RDF store. The input to the store is a spark session, database path where the RDF dataset will be stored, and a local configuration path.

import org.apache.spark.sql.SparkSession

import edu.purdue.knowledgecubes.GEFI.GEFIType
import edu.purdue.knowledgecubes.GEFI.join.GEFIJoinCreator
import edu.purdue.knowledgecubes.storage.persistent.Store

val localPath = "/path/to/local/path"
val dbPath = "/path/to/db/path"
val ntPath = "/path/to/rdf/file"

val spark = SparkSession.builder
            .appName(s"KnowledgeCubes Store Creator")
            .getOrCreate()

val store = Store(spark, dbPath, localPath)
store.create(ntPath)

SPARQL Querying

KC provides a SPARQL query processor that takes as input the spark session, database path of where the RDF dataset was created, local configuration file path, a filter type, and a false postivie rate (if any).

import org.apache.spark.sql.SparkSession

import edu.purdue.knowledgecubes.queryprocessor.QueryProcessor
import edu.purdue.knowledgecubes.GEFI.GEFIType

val spark = SparkSession.builder
            .appName(s"Knowledge Cubes Query")
            .getOrCreate()

val localPath = "/path/to/local/path"
val dbPath = "/path/to/db/path"
val filterType = GEFIType.ROARING // Roaring bitmap
val falsePositiveRate = 0

val queryProcessor = QueryProcessor(spark, dbPath, localPath, filterType, falsePositiveRate)

val query =
  """
    SELECT ?GivenName ?FamilyName WHERE{
        ?p <http://yago-knowledge.org/resource/hasGivenName> ?GivenName .
        ?p <http://yago-knowledge.org/resource/hasFamilyName> ?FamilyName .
        ?p <http://yago-knowledge.org/resource/wasBornIn> ?city .
        ?p <http://yago-knowledge.org/resource/hasAcademicAdvisor> ?a .
        ?a <http://yago-knowledge.org/resource/wasBornIn> ?city .
    }
  """.stripMargin

// Returns a Spark DataFrame containing the results
val r = queryProcessor.sparql(query)

Constructing Filters

Additionaly, KC provides an API for creating additional filters. KC provides exact and approximate structures for filtering data. Currently KC supports GEFIType.BLOOM, GEFIType.ROARING, and GEFIType.BITSET.

import org.apache.spark.sql.SparkSession

import edu.purdue.knowledgecubes.GEFI.GEFIType
import edu.purdue.knowledgecubes.GEFI.join.GEFIJoinCreator
import edu.purdue.knowledgecubes.utils.Timer

val spark = SparkSession.builder
            .appName(s"KnowledgeCubes Filter Creator")
            .getOrCreate()

var localPath = "/path/to/db/path"
var dbPath = "/path/to/local/path"
var filterType = GEFIType.ROARING
var fp = 0

val filter = new GEFIJoinCreator(spark, dbPath, localPath)
filter.create(filterType, fp)

Query Execution Benchmarking

KC provides a set of benchmarking classes

BenchmarkFilteringCLI: For benchmarking the query execution when using filters
BenchamrkReductionsCLI: For benchmarking the query execution when using reductions only

Publications

Amgad Madkour, Ahmed M. Ali, Walid G. Aref, “WORQ: Workload-driven RDF Query Processing”, ISWC 2018 [Paper][Slides]
Amgad Madkour, Walid G. Aref, Ahmed M. Aly, “SPARTI: Scalable RDF Data Management Using Query-Centric Semantic Partitioning”, Semantic Big Data (SBD18) [Paper][Slides]
Amgad Madkour, Walid G. Aref, Sunil Prabhakar, Mohamed Ali, Siarhei Bykau, “TrueWeb: A Proposal for Scalable Semantically-Guided Data Management and Truth Finding in Heterogeneous Web Sources”, Semantic Big Data (SBD18) [Paper][Slides]
Amgad Madkour, Walid G. Aref, Saleh Basalamah, “Knowledge Cubes - A Proposal for Scalable and Semantically-Guided Management of Big Data”, IEEE BigData 2013 [Paper][Slides]