Towards Knowledge Graph Creation from Thousands of Wikis
Sven Hertling
Heiko Paulheim
Alexandra Hofmann
Samresh Perchani
Jan Portisch

While popular knowledge graphs such as DBpedia and YAGO are built from Wikipedia, Wikifarms like Fandom contain Wikis for specific topics, which are often complementary to the information contained in Wikipedia, and thus DBpedia and YAGO. Extracting these Wikis with the DBpedia extraction framework is possible, but results in many isolated knowledge graphs. In this paper, we show how to create one consolidated knowledge graph, called DBkWik, from thousands of Wikis. We perform entity resolution and schema matching, and show that the resulting large-scale knowledge graph is complementary to DBpedia.

Linked Open Data Endpoint

We provide a Linked Data endpoint using derefencable URIs. To browse the LOD enpoint, use, e.g., the concept Harry Potter.

SPARQL Endpoint

The SPARQL Enpoint is available at /sparql.

Dataset Description

The VOID file is located at
The dataset is also described at datahub with the name dbkwik. The prefix is also dbkwik.

The whole approach is shown below:

whole dbkwik approach

The distribution of topics, hubs and languages of all wikis contained in this endpoint:

Dataset Statistics

The following table shows some basic statistics of the overall dataset:

Typed instances1,372,971
Avg. indegree0.703
Avg. outdegree8.169

Data Dumps

The following versions of the dataset are available:

Date Version No. of input wikis Release notes
2018-04-01 1.1 12,840 Introduces data fusion and lightweight schema induction download
2018-01-31 1.0 12,840 First version of the dataset download
2017-07-21 - 248 Proof of Concept download

Crowdsourcing results

We have crowdsourced two gold standards, one for the mapping between DBkWik and DBpedia, one for matching instances inside DBkWik.

Survey template for interwiki mapping (preview - source)
And the resulting gold standards (in alignment format - see alignment api):

Code Repository

The code repository with all results is hosted at github:

This dataset uses material from multiple wikis at FANDOM and is licensed under the Creative Commons Attribution-Share Alike License. See also the license page at Fandom.