Originally posted on the NVIDIA Developer blog at: https://developer.nvidia.com/blog/training-a-text2sparql-model-with-mk-squit-and-nemo/?ncid=so-twit-98177#cid=dl13_so-twit_en-us

Across several verticals, question answering (QA) is one of the fastest ways to deliver business value using conversational AI. Stated somewhat informally, question answering is the task of mapping a user query to an answer from a given context. This open-ended definition is best understood through a simple example:

Question: Where are the headquarters of NVIDIA?
Context: “NVIDIA Corporation (en-VID-ee-\u0259) is an American multinational technology company incorporated in Delaware and based in Santa Clara, California. It designs graphics processing units (GPUs) for the gaming and professional markets, as well as system on a chip units (SoCs) for the mobile computing and automotive market. Its primary GPU product line, labeled \”GeForce\”, is in direct competition with Advanced Micro Devices’ (AMD) \”Radeon\” products. NVIDIA expanded its presence in the gaming industry with its handheld Shield Portable, Shield Tablet, and Shield Android TV and its cloud gaming service GeForce Now.
In addition to GPU manufacturing, NVIDIA provides parallel processing capabilities to researchers and scientists that allow them to efficiently run high-performance applications. They are deployed in supercomputing sites around the world. More recently, it has moved into the mobile computing market, where it produces Tegra mobile processors for smartphones and tablets as well as vehicle navigation and entertainment systems. In addition to AMD, its competitors include Intel and Qualcomm.”
Answer: Santa Clara, California

The example is easy to answer with a deep learning model. The context, pulled from the Wikipedia extract, is used by the deep model to find the span that corresponds to the answer. There are a wide array of open datasets and models to solve this simple task, the most frequently used being the SQuAD (Stanford Question Answering Dataset). Oftentimes, this is called "open-domain" question answering.

A great example of applying a state-of-the-art QA model to this task is already provided by NVIDIA NeMo. However, it assumes that you already have the context. There is a lot of great R&D being done on this “first phase” of collecting the contexts for a given question in the form of DPR (dense-passage retrieval). But what if you don’t have a dataset that looks like the training material on which these open-domain IR models are trained (namely, Wikipedia)? Instead, what if you just have a collection of facts that you want to answer questions about?

In this post, we cover an alternative approach towards QA that instead relies on a neural machine translation model to extract answers from a knowledge graph.

Tackling QA with knowledge graphs

Knowledge graphs are flexible and dense structures that allow rich representations of connected data. For example, Wikidata, a popular knowledge graph, contains nearly 100M items.

The downside is that searching through a complex graph requires using SPARQL or other graph-oriented query languages. This is far removed from the natural way users would like to query for facts and doubly so if users expect to query with natural language, as they would with the question, "Where are the headquarters of NVIDIA?".

Thanks to advancements in neural machine translation models, we can consider this problem as a form of translation from natural language to query language. As we are converting specifically to the SPARQL query language, we refer to the task as Text2SPARQL.

Properly annotated data can be costly

In order to train a state-of-the-art query translation model, thousands of properly annotated, often hand-labeled, natural language to query pairs are required. This can be costly and take weeks or months to accomplish. Moreover, as a knowledge graph evolves with new content, the translation model will require additional hand-labeled examples, which slows down quick iterations in production. This challenge is what motivates many to move towards the open domain QA approach. However, if you can synthetically generate a dataset, then these problems disappear.

Introducing MK-SQuiT

To generate the dataset, why not automate as many components as possible, injecting semantic rules into the process to produce a high-quality dataset? In our recent preprint, we introduced MK-SQuIT (Synthesizing Questions using Iterative Template-filling). It is a nearly fully automated, open-source generation framework for creating synthetic English to SPARQL query pairs.

Figure 1. MK-SQuIT generation pipeline.

Our provided dataset generation library pulls entities and predicates from a given RDF (Resource Description Format) knowledge base and assembles them into natural language questions. Simultaneously, a graph query is constructed using the semantics tied to the predicates in the knowledge base. We use the simple predicate-argument structure of interrogative sentence types to generate natural language questions paired with their respective SPARQL query.

English SPARQL
Who is Alberto Bueno? SELECT ?end WHERE { BIND ( [ Alberto Bueno ] as ?end ) . }
Is the country of origin of Die Hard the country of origin of Tap? ASK { [ Die Hard ] wdt:P495 ?end . [ Tap ] wdt:P495 ?end . }
How many native language does the dad of the exact match of Judy Garland have? SELECT ( COUNT ( DISTINCT ?end ) as ?endcount ) WHERE { [ Judy Garland ] wdt:P2888 / wdt:P22 / wdt:P103 ?end . }

This approach lends several key benefits:

  • The synthetic dataset generation can provide large datasets with little human input.
  • Generating datasets synthetically can guarantee certain standards of quality, variability, and query validity.
  • Generated queries match the level of syntactic and semantic rigor of the source knowledge base.
  • After setup, generation is extremely fast and can be run in a matter of minutes on a modern laptop.

We have provided a sample dataset generated from Wikidata consisting of 110K queries (100K training, 10K testing). However, our code is extensible to other natural languages and graph query languages allowing for use with most graph databases.

The following sections provide a more detailed explanation of the entire process. To reproduce the results shown in the following sections, we recommend following the Docker container from the NGC catalog. It contains tutorial notebooks for all the steps show in this post as well as an interactive dataset explorer presented with Tensorflow Projector. For more information about the generation process itself, see the MeetKai/MK-SQuIT GitHub repo.

Creating a dataset with MK-SQuIT

To begin, download the generation source code:

git clone https://github.com/MeetKai/MK-SQuIT.git

First, you must extract raw entity and property data from a database, such as Wikidata. An entity refers to a node and its metadata, and a property refers to relational attributes. These raw values are saved to JSON files as specific formats: *-5k.json for entities and *-props.json for properties.

Set some variables in your terminal:

DATA_DIR="./data"  # Set data path
OUT_DIR="./out"  # Set output path
ENTITY_ID="*-5k.json"  # Glob identifier for entity data -> {domain}-5k.json
PROPERTY_ID="*-props.json"  # Glob identifier for property data -> {domain}-5k.json
PREPROCESSED_ENT_ID="*-5k-preprocessed.json"  # Glob identifier for preprocessed property data
PREPROCESSED_PROP_ID="*-props-preprocessed.json"  # Glob identifier for preprocessed property data

Run a script for GET requests that saves the JSON responses:

python -m scripts.gather_wikidata --data-dir $DATA_DIR

Raw data must then be cleaned and annotated before being fed into the pipeline. Most importantly, this step converts property labels into part-of-speech tags to assist with coherent mapping within a template. For example, the property set in location would be labeled as VERB_NOUN allowing the pipeline to easily insert the property within the context of a sentence. A typing field is also added to each property in the format [domain] ->.

python -m scripts.preprocess --data-dir $DATA_DIR \
	--ent-id $ENTITY_ID \
	--prop-id $PROPERTY_ID

You should now have several files:

  • *-5k-preprocessed.json and *-props-preprocessed.json: Preprocessed entity and property values
  • pos-examples.txt: Samples of part-of-speech tags that are sorted by the number of occurrences within the data. This is an optional file used to help with template generation (if you wished to write your own templates).

The unfilled typing field must now be annotated to include [domain] -> [type]. To understand how the type should be specified, we briefly cover its purpose. When you are considering how to link properties together, consider the types of properties or entities to which they would refer. For example, location of and location at are referring to a place. Build during and created at are referring to time. To a certain extent, these typing labels are subjective, but they allow the pipeline to string together much more coherent statements.

After the preprocessed property values have been labeled, you can aggregate their metadata into a type-list-autogenerated.json file. This is used for the pipeline and is the last requirement before generating the dataset.

python -m scripts.generate_type_list \
	--data-dir $DATA_DIR \

At this point, you are ready to generate the dataset with the following files:

  • Entity data: *-5k-preprocessed.json
  • Property data: *-props-preprocessed.json
  • Type metadata list: type-list-autogenerated.json

Run the following command:

python -m mk_squit.generation.full_query_generator \
	--data-dir $DATA_DIR \
	--out-dir $OUT_DIR

You now have the generated dataset, with data like the following:

english sparql
What is the height of Getica's creator? SELECT ?end WHERE { [ Getica ] wdt:P50 / wdt:P2048 ?end . }

We’d also like to note that the addition of domains, such as corporations, is not difficult. Raw data must be gathered, as in the example, and some care must be taken to annotate the types. The data can then be fed to the same scripts to easily repurpose the dataset.

Fine-tuning a Text2SPARQL model with NeMo

The final step is to train an actual model to perform the translation task. Using NeMo, you can quickly fine-tune a baseline model with the added bonuses of being able to easily tune hyperparameters and effortlessly scale out to multiple GPUs.

For the example dataset, we found that using a pretrained BART model to initialize the weights produces solid predictions. Do note that the syntax of our SPARQL queries is an extension of regular SPARQL. In our case, we omit entity names from the training set and relegate the task to a post-processing step. As such, predicted queries contain entities in their natural language form rather than as an entity ID. To resolve entities, we’ve provided a simple example of fuzzy text matching at the end of this section.

For optimal generation time, the model is configured to use a greedy search. Results are then passed through minor post-processing and evaluated on BLEU and ROUGE metrics.

To train and evaluate the network locally, you can run several shell scripts that download and execute tutorial code from NeMo.

Clone the MK-SQuIT repository if you haven't already:

git clone https://github.com/MeetKai/MK-SQuIT.git
cd MK-SQuIT/
pip install -r requirements.txt
cd model/

Install NeMo and download the example scripts:


[Optional] Update the train and evaluation parameters inside this script and execute it to set environment variables:


Download and reformat example data, then train:


Generate predictions and scores:


And that’s it!

Here are some predictions from the model:

query type question query
Single-Entity Who is the mother of the director of Pulp Fiction? SELECT ?end WHERE { [ Pulp Fiction ] wdt:P5 / wdt:P25 ?end . }
Multi-Entity Is John Steinbeck the author of Green Eggs and Ham? ASK { BIND ( [ John Steinbeck ] as ?end ) . [ Green Eggs and Ham ] wdt:P50 ?end . }
Count How many awards does the producer of Fast and Furious have? SELECT ( COUNT ( DISTINCT ?end ) as ?endcount ) WHERE { [ Fast and Furious ] wdt:P162 / wdt:P166 ?end . }

And some common metrics across the two test datasets, test-easy and test-hard:

test-easy 0.98841 0.99581 0.99167 0.99581 0.71521
test-hard 0.59669 0.78746 0.73099 0.78164 0.48497

BART performs nearly flawlessly on the easy test set and there is an expected dip in scores for the hard set (which contains more complex logical requirements, noisy perturbations, and additional unseen slot domains). The key is then to extend the dataset with real data that comes in when you have a Text2SPARQL system responding to actual user queries.

Entity resolution

To complete the demonstration, we must apply entity resolution, a process where natural language entity names are converted to their Wikidata IDs. This can be easily performed with a fuzzy text matcher, such as rapidfuzz. As an example, a utility class is included in the MK-SQuIT repo and can be used as follows:

import requests
from mk_squit.utils.entity_resolver import EntityResolver

resolver = EntityResolver(data_dir="./data")  # Load entity data example_prediction = "SELECT ?end WHERE { [1,1,1-trifluoro-2-chloro-2-bromoethane] wdt:P31 ?end . }"
example_query = resolver.resolve(example_prediction)
# SELECT ?end WHERE { wd:Q32921 wdt:P31 ?end . }

url = "https://query.wikidata.org/sparql" response = requests.get(url, params = {"format": "json", "query": example_query}) print(response.json())

Executing the query gets back five results:

Informing you that 1,1,1-trifluoro-2-chloro-2-bromoethane is the name of a chemical compound used in some medications. Keep in mind that not all queries return a result, as the data or properties may not exist within Wikidata.


Synthetic natural languages to query datasets have much room for improvement, but we believe they have the potential to revolutionize the accessibility of knowledge graphs. We hope you’ve found this post helpful and perhaps even found some inspiration from it. Feel free to check out other things we work on at MeetKai.com or read more about us at our blog.

About the Authors

James Kaplan

About James Kaplan
James Kaplan is the co-founder and CEO of MeetKai and oversees core R&D. He previously ran a quantitative hedge fund that used GPUs to trade options. He is a drop-out from Harvey Mudd College where he pursued a BS in computer science.

Vincent Cheong

About Vincent Cheong
Vincent Cheong is an applied ML and backend engineer at MeetKai Inc. He received his M.S. in computer science from California State University, Long Beach and his B.S. in electrical engineering from the University of California, Los Angeles.

Benjamin A. Spiegel

About Benjamin A. Spiegel
Benjamin A. Spiegel was a computational linguistics intern at MeetKai during the summer of 2020. He is completing an ScB degree in computers and minds at Brown University and will graduate in 2021.