Introduction to RDF and SPARQL

Conversion reminder

Biom files created with NG-Tax can also be queried when you need to extract additional information. If you have NG-Tax Biom files or biom files obtained from other applications you can easily convert them to RDF using the NG-Tax conversion command:

java -jar NGTax.jar -biom2rdf -i humangut.biom -o humangut.ttl 

log4j:WARN No appenders could be found for logger (nl.wur.ssb.NGTax.App).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
humangut.biom
Converting humangut.biom to RDF format.
(75/474) >> Creating RDF database for sample name: 1927.SRS019910.SRX020516.SRR044347


Load file into a graph database

RDF files can be loaded into so called triple-stores. These are dedicated databases which can easily be setup to load your data into. The tutorial can be found here. The RDF representation of the paired-end example biom file has been loaded into this system and is queriable via the web interface or via R.

Jena-Fuseki

  • This was tested on apache-jena-fuseki-3.12.0
  • Apache jena-fuseki was downloaded from the apache website and decompressed
  • Start the fuseki server using ./fuseki start
  • The webserver should start immediately and becomes available at http://localhost:3030
  • Go to manage datasets
  • Add a new dataset
  • Give it a name and for testing a small file we create an in-memory database
  • Once you have created the dataset you can start uploading data
  • Click upload files and select the file you want to upload
  • Make sure the file ends with .ttl to denote that this is a RDF turtle file
  • Click upload now and a progress bar should appear
  • Once finished it should mention how many triples were added (e.g. Result: success. 28635 triples)
  • You can now perform queries, click the query button.
  • An example query could be: SELECT * WHERE { ?subject a ?object . } LIMIT 100
  • This should show you some results from the RDF file

Jena

  • This was tested on apache-jena-3.12.0
  • Apache jena was downloaded from the apache website and decompressed
  • To query a turtle file (extension .ttl) and the file is relatively small, it can be directly queried using the following command: ./bin/sparql --data=/home/rdf/008F355R_rep1_70.ttl --query query.txt
  • The query.txt file contains the query to be executed, for example: SELECT * WHERE { ?subject a ?object } LIMIT 10

GraphDB

  • This was tested on GraphDB 8.5.0 NA
  • Once started it should be running by default on localhost:7200
  • On the left, click setup > repositories > create new repository
  • The only requirement is to give it a name, the rest can stay at default and click create
  • Once created, click on the power plug icon to connect to that specific database
  • To load the RDF file, click import > RDF > Upload RDF files and select your .ttl file.
  • It should appear in the load overview and you can import it by clicking on the right import button
  • Click import again on the popup window
  • A small file should only take a few seconds to load.
  • To query the database, click on SPARQL and the default (select all) query should show up. NA


Basics to use SPARQL

# Dependencies used in this documentation
# install.packages("DT")
# install.packages("ggplot2")
# install.packages("plotly")
# install.packages("d3r")
# install.packages("treemap")
# install.packages("devtools")

library("devtools"); 
devtools::install_github("timelyportfolio/sankeytree")

library(DT) 
library(ggplot2)
library(plotly)
library(d3r)
library(dplyr)
library(treemap)
library(sankeytreeR)
# Needed for SPARQL
# install.packages("SPARQL")
library(SPARQL) # The query library
# SPARQL endpoint where the triple store endpoint is located
endpoint = "http://nvme1.wurnet.nl:7200/repositories/NGTax2Demo"


Example queries

Analysis settings

  query <- paste0("PREFIX gbol:<http://gbol.life/0.1/>
                  SELECT DISTINCT ?library ?predicate ?object
                  WHERE {
                    ?library gbol:provenance ?prov .
                    ?prov gbol:annotation ?annot  .
                    ?annot ?predicate ?object .
                  }")
  dfGraph <- SPARQL(endpoint, query)
  datatable(dfGraph$results)


Overview of number of samples per library

  query <- paste0("PREFIX gbol:<http://gbol.life/0.1/>
                  SELECT DISTINCT ?library (COUNT(?sample) AS ?sampleCount)
                  WHERE {
                    ?library gbol:sample ?sample .
                  } GROUP BY ?library")
  dfGraph <- SPARQL(endpoint, query)
  datatable(dfGraph$results)


Overview of the samples in the dataset and corresponding metadata

  query <- paste0("PREFIX gbol: <http://gbol.life/0.1/>
                  SELECT ?sampleName ?totalCounts ?percentAcceptedReads ?numAcceptedOtuBeforeChimera ?numRejectedOtu ?evenness 
                  WHERE {
                    ?library gbol:sample ?sample .
                    ?sample gbol:name ?sampleName .
                    ?sample gbol:metadata/gbol:totalCounts ?totalCounts .
                    ?sample gbol:metadata/gbol:percentAcceptedReads ?percentAcceptedReads .
                    ?sample gbol:metadata/gbol:numAcceptedOtuBeforeChimera ?numAcceptedOtuBeforeChimera .
                    ?sample gbol:metadata/gbol:numRejectedOtu ?numRejectedOtu.
                    ?sample gbol:metadata/gbol:evenness ?evenness .
                  }")
  dfGraph <- SPARQL(endpoint, query)
  datatable(dfGraph$results)


Number of ASV’s per sample

  query <- paste0("PREFIX gbol:<http://gbol.life/0.1/>
                   SELECT DISTINCT ?sampleName (COUNT(?asv) AS ?count)
                   WHERE {
                     ?library gbol:sample ?sample .
                           ?sample gbol:name ?sampleName .
                           ?sample gbol:asv ?asv .
                     ?asv a gbol:ASVSet .
                   } GROUP BY ?sampleName")
  dfGraph <- SPARQL(endpoint, query)
  datatable(dfGraph$results)


Overview of ASV and taxonomic lineage

 query <- paste0("PREFIX gbol:<http://gbol.life/0.1/>
                  SELECT ?sampleName ?clusteredReadCount ?taxonName
                  WHERE { 
                    ?lib a gbol:Library .
                    ?lib gbol:sample ?sample .
                    ?sample gbol:name ?sampleName .
                    ?sample gbol:asv ?asv .
                    ?asv a gbol:ASVSet .
                    ?asv gbol:assignedTaxon ?assignedTaxon .
                    ?asv gbol:clusteredReadCount ?clusteredReadCount .
                    ?assignedTaxon gbol:taxonName ?taxonName .
                  }")
  dfGraph <- SPARQL(endpoint, query)$results
  library(stringr)
  #split up taxonName into 6 column
  taxonomy = str_split_fixed(dfGraph$taxonName, ";", 6)
  dfGraph = cbind.data.frame(dfGraph$sampleName, dfGraph$clusteredReadCount, taxonomy)
  colnames(dfGraph) = c("sampleName", "counts", "Domain", "Phylum", "Class", "Order", "Family", "Genus")
  datatable(dfGraph)


ASV sequences

  query <- paste0("PREFIX gbol:<http://gbol.life/0.1/>
                  SELECT ?fseq ?rseq ?taxonName
                  WHERE { 
                    ?lib a gbol:Library .
                    ?lib gbol:sample ?sample .
                    ?sample gbol:name ?sampleName .
                    ?sample gbol:asv ?asv .
                    ?asv a gbol:ASVSet .
                    ?asv gbol:forwardASV ?fasv .
                    ?fasv gbol:sequence ?fseq .
                    ?asv gbol:forwardASV ?rasv .
                    ?rasv gbol:sequence ?rseq .
                    ?asv gbol:assignedTaxon ?assignedTaxon .
                    ?assignedTaxon gbol:taxonName ?taxonName .
                  }")
  dfGraph2 <- SPARQL(endpoint, query)$results
  library(stringr)
  #split up taxonName into 6 column
  taxonomy = str_split_fixed(dfGraph2$taxonName, ";", 6)
  dfGraph2 = cbind.data.frame(dfGraph2$fseq, dfGraph2$rseq, taxonomy)
  colnames(dfGraph2) = c("fseq", "rseq", "Domain", "Phylum", "Class", "Order", "Family", "Genus")
  datatable(dfGraph2)


Chimera sequences

  query <- paste0("PREFIX gbol:<http://gbol.life/0.1/>
                  SELECT ?fseq ?rseq 
                  WHERE { 
                    ?lib a gbol:Library .
                    ?lib gbol:sample ?sample .
                    ?sample gbol:name ?sampleName .
                    ?sample gbol:asv ?asv .
                    ?asv a gbol:RejectedAsChimera .
                    ?asv gbol:forwardASV ?fasv .
                    ?fasv gbol:sequence ?fseq .
                    ?asv gbol:forwardASV ?rasv .
                    ?rasv gbol:sequence ?rseq .
                  }")
  dfGraph2 <- SPARQL(endpoint, query)$results
  library(stringr)
  #split up taxonName into 6 column
  dfGraph2 = cbind.data.frame(dfGraph2$fseq, dfGraph2$rseq)
  colnames(dfGraph2) = c("fseq", "rseq")
  datatable(dfGraph2)


ASV sequences shared among samples

  query <- paste0("PREFIX gbol:<http://gbol.life/0.1/>
                   SELECT DISTINCT ?fseq ?rseq (COUNT(DISTINCT(?sample)) AS ?samples)
                   WHERE {
                     ?sample a gbol:Sample .
                     ?sample gbol:asv ?asv .
                     ?asv gbol:forwardASV ?fasv .
                     ?fasv gbol:sequence ?fseq .
                     ?asv gbol:reverseASV ?rasv . 
                     ?rasv gbol:sequence ?rseq .
                   } GROUP BY ?fseq ?rseq")
  dfGraph2 <- SPARQL(endpoint, query)
  datatable(dfGraph2$results)


Number of reads per sample

  query <- paste0("PREFIX gbol:<http://gbol.life/0.1/>
                  SELECT ?sampleName ?totalCounts
                  WHERE {
                    ?library gbol:sample ?sample .
                    ?sample gbol:name ?sampleName .
                    ?sample gbol:metadata/gbol:totalCounts ?totalCounts .
                  }")
  dfGraph2 <- SPARQL(endpoint, query)$results
  p <- ggplot(data=dfGraph2, aes(x=sampleName, y=totalCounts)) + 
    geom_bar(stat="identity", fill="steelblue") + 
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  p


Histogram of OTU counts

  query <- paste0("PREFIX gbol:<http://gbol.life/0.1/>
                  SELECT ?asvid ?clusteredReadCount
                  WHERE {
                    ?library gbol:sample ?sample .
                    ?sample gbol:name ?sampleName .
                            ?sample gbol:asv ?asv .
                    ?asv a gbol:ASVSet .
                    ?asv gbol:clusteredReadCount ?clusteredReadCount .
                    ?asv gbol:masterASVId ?asvid .
                  }")
  dfGraph2 <- SPARQL(endpoint, query)$results
  p <- plot_ly(x = dfGraph2$clusteredReadCount, type = "histogram")
  ggplotly(p)


Total abundances (raw reads)

*Utilize dataframe from ASV table (Overview of ASV and taxonomic lineage)

  taxonomy = paste(sep = ";", dfGraph$Domain, dfGraph$Phylum, dfGraph$Class, dfGraph$Order, dfGraph$Family, dfGraph$Genus)
  dfGraph2 = data.frame(cbind(as.vector(dfGraph$sampleName), dfGraph$counts, taxonomy))
  colnames(dfGraph2) = c("sampleName","counts","taxonomy")
  dfGraph2$counts = as.numeric(dfGraph2$counts)
  #sum the counts of same taxonomy inside different samples
  dfGraph2 = aggregate(counts ~ taxonomy + sampleName, data = dfGraph, sum)
  p <- ggplot() + 
    geom_bar(aes(y = counts, x = sampleName, fill = taxonomy), data = dfGraph2, stat="identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    theme(legend.position="none") 
  ggplotly(p)


Total abundances (%)

*Utilize dataframe from ASV table (Overview of ASV and taxonomic lineage)

  taxonomy = paste(sep = ";", dfGraph$Domain, dfGraph$Phylum, dfGraph$Class, dfGraph$Order, dfGraph$Family, dfGraph$Genus)
  dfGraph2 = data.frame(cbind(as.vector(dfGraph$sampleName), dfGraph$counts, taxonomy))
  colnames(dfGraph2) = c("sampleName","counts","taxonomy")
  dfGraph2$counts = as.numeric(dfGraph2$counts)
  #sum the counts of same taxonomy inside different samples
  dfGraph2 = aggregate(counts ~ taxonomy + sampleName, data = dfGraph, sum)
  p <- ggplot(dfGraph2, aes(x=sampleName, y=counts, fill=taxonomy)) + 
    geom_bar(stat="identity", position = "fill") + 
    scale_y_continuous() +
    theme(legend.position="none") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  ggplotly(p)


Present/absent of taxonomic lineage via heatmap

*Utilize dataframe from ASV table (Overview of ASV and taxonomic lineage)

  taxonomy = paste(dfGraph$Genus)
  dfGraph2 = data.frame(cbind.data.frame(as.vector(dfGraph$sampleName), taxonomy))
  colnames(dfGraph2) = c("sampleName", "taxonomy")
  dfGraph2 <- table(dfGraph2)
  dfGraph2 <- data.frame(dfGraph2)
  dfGraph2[dfGraph2 > 0] <- 1
  dfGraph2 <- data.frame(dfGraph2)
  colnames(dfGraph2) = c("sampleName", "taxonomy","counts")
  p <- ggplot(dfGraph2, aes(sampleName, taxonomy)) + 
    geom_tile(aes(fill = counts), colour = "white") + 
    scale_fill_gradient(low = "white", high = "steelblue") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  p



Advance queries

Overlapped OTUs

query <- paste0("PREFIX gbol: <http://gbol.life/0.1/>
                SELECT DISTINCT ?sampleName ?taxonName
                WHERE {
                  ?sample a gbol:Sample .
                  ?sample gbol:name ?sampleName .
                  ?sample gbol:asv ?asv .
                  ?asv a gbol:ASVSet . 
                  ?asv gbol:assignedTaxon ?assignedTaxon .
                  ?assignedTaxon gbol:taxonName ?taxonName .
                  {
                    SELECT DISTINCT ?taxonName (COUNT(DISTINCT(?sample)) AS ?sampleC)
                    WHERE {
                      ?sample a gbol:Sample .
                      ?sample gbol:asv ?asv .
                      ?asv gbol:assignedTaxon ?assignedTaxon .
                      ?assignedTaxon gbol:taxonName ?taxonName .
                    } GROUP BY ?taxonName
                    HAVING(?sampleC > 1)
                  }
                }")
  dfGraph <- SPARQL(endpoint, query)
  datatable(dfGraph$results)


Non-overlapped OTUs

query <- paste0("PREFIX gbol: <http://gbol.life/0.1/>
                 SELECT DISTINCT ?sampleName ?taxonName
                 WHERE {
                  ?sample a gbol:Sample .
                  ?sample gbol:name ?sampleName .
                  ?sample gbol:asv ?asv .
                  ?asv a gbol:ASVSet . 
                  ?asv gbol:assignedTaxon ?assignedTaxon .
                  ?assignedTaxon gbol:taxonName ?taxonName .
                  {
                    SELECT DISTINCT ?taxonName (COUNT(DISTINCT(?sample)) AS ?sampleC)
                    WHERE {
                      ?sample a gbol:Sample .
                      ?sample gbol:asv ?asv .
                      ?asv gbol:assignedTaxon ?assignedTaxon .
                      ?assignedTaxon gbol:taxonName ?taxonName .
                    } GROUP BY ?taxonName
                    HAVING(?sampleC = 1)
                  }
                }")
  dfGraph <- SPARQL(endpoint, query)
  datatable(dfGraph$results)


Possible ASV taxonomic classification hits at genus level

query <- paste0("PREFIX gbol: <http://gbol.life/0.1/>
                 SELECT DISTINCT ?sampleName ?taxName ?ratio
                 WHERE { 
                  ?library gbol:sample ?sample .
                  ?sample gbol:name ?sampleName .
                  ?sample gbol:asv ?asv .
                  ?asv gbol:masterASVId ?asvid .
                  ?asv gbol:asvAssignment ?possibleAssignment .
                  FILTER (contains(str(?possibleAssignment), \"Level/4\"))
                  ?possibleAssignment gbol:taxon ?tax .
                  ?tax gbol:taxonName ?taxName .
                  ?possibleAssignment gbol:ratio ?ratio .
                } ORDER BY ?sampleName ?asvid DESC(?ratio)")
  dfGraph <- SPARQL(endpoint, query)
  datatable(dfGraph$results)


Taxonomic assignment details

Number of ASVs and the summation of the reads that being assigned to a taxonomy.

query <- paste0("PREFIX gbol: <http://gbol.life/0.1/>
                SELECT ?sampleName ?taxonName (COUNT(?asv) AS ?asvs) (SUM(?clusteredReadCount) AS ?totalCount)
                WHERE { 
                    ?library a gbol:Library .
                  ?library gbol:sample ?sample .
                  ?sample gbol:name ?sampleName .
                  ?sample gbol:asv ?asv .
                  ?asv gbol:assignedTaxon ?assignedTaxon .
                  ?asv gbol:clusteredReadCount ?clusteredReadCount .
                  ?assignedTaxon gbol:taxonName ?taxonName .
                } GROUP BY ?sampleName ?taxonName
                  ORDER BY ?sampleName DESC(?totalCount)")
  dfGraph <- SPARQL(endpoint, query)
  datatable(dfGraph$results)


Taxonomic classification comparison between databases

query <- paste0("PREFIX gbol: <http://gbol.life/0.1/>
                SELECT ?fseq ?rseq ?taxName ?db 
                WHERE { 
                    ?library gbol:provenance ?prov .
                  ?prov gbol:annotation ?annot .
                  ?annot gbol:refdb ?db .
                  ?library gbol:sample ?sample .
                  ?sample gbol:asv ?asv .
                  ?asv gbol:forwardASV ?fasv .
                  ?fasv gbol:sequence ?fseq .
                  ?asv gbol:reverseASV ?rasv .
                  ?rasv gbol:sequence ?rseq .
                  ?asv gbol:assignedTaxon ?tax .
                  ?tax gbol:taxonName ?taxName .
                } ORDER BY ?fseq ?rseq")
  dfGraph <- SPARQL(endpoint, query)
  datatable(dfGraph$results)

Laboratory of Systems and Synthetic Biology - Wageningen University & Research