2. RDF Rule Mining
• RDFS/OWL rules are given as schema.
• Schema-free datasets might have an implicit schema.
• Why “mining”? Because rules are not visible in data.
2
4. Motivation/2
Markov Logic Networks.
Given a collection of first-order
statements (evidence) and a set
of weighted first-order rules,
build an undirected weighted
graph where nodes are
statements and edges indicate
dependency.
4
7. Horn Clauses
A Horn clause is a clause (a disjunction of literals patterns)
with at most one positive, i.e. unnegated, literal pattern.
p ∨ ¬q1 ∨ ¬q2 ∨ ... ∨ ¬qn
p ← q1 ∧ q2 ∧ ... ∧ qn
head body
p(x,y)
7
8. Confidence score (weight)
The confidence score of a rule is defined as the rate of
the occurrences of head and body together
over
the occurrences of the body.
p ← q1 ∧ q2 ∧ ... ∧ qn
head body
8
9. The HORN CONCERTO approach
P(A | B) =
P(A∩ B)
P(B)
Bayes’ Theorem
p ← q1 ∧ q2 ∧ ... ∧ qn
Horn Clause
Event-based confidence score
P(p |
!
q) =
P(p ∩
!
q)
P(
!
q)
≈
p ∧
!
q{ }
!
q{ }
9
10. Rules with
p ∧
!
q{ }
!
q{ }
SELECT ?p (COUNT(*) AS ?c) WHERE {
?x ?p ?y .
?x <target_q> ?y .
FILTER(?p != <target_q> )
} GROUP BY ?p
SELECT ?q (COUNT(*) AS ?c) WHERE {
[] ?q []
} GROUP BY ?q
!
q = 1
p(x, y) ← q(x, y)
10
11. Rules with
p ∧
!
q{ }
!
q{ }
SELECT ?q ?r (COUNT(*) AS ?c) WHERE {
?x ?q ?z .
?z ?r ?y .
?x <target_p> ?y
} GROUP BY ?q ?r
SELECT (COUNT(*) AS ?c) WHERE {
?x <target_q> ?z .
?z <target_r> ?y
}
!
q = 2
p(x, y) ← q(x, z) ∧ r(z, y)
11
12. Optimizations
• Select only top N properties.
• Order by descending score and prune when it’s lower
than a threshold T.
• Cache scores in-memory, as there might exist p1, p2
such that: pi(x,y) ← q(?,?), r(?,?).
• Parallelize algorithm by rule type.
12
13. Evaluation Setup
• 8 CPUs, 32 GB RAM, Ubuntu 16.04
• Scalability study
DBpedia Person (7 million triples)
DBpedia 2016-04 (397 million triples)
FUTURE
• Rule effectiveness for link prediction
FB15k (592 thousand triples)
WN18 (151 thousand triples)
• Rule quality (human judgment?)
13
14. Preliminary results – DBpedia Person
Runtime (s) # Rules Used RAM (GB)
AMIE+ > 10 days 6,337 4
Ontological PF > 3 hours > 1,000 4
HORN CONCERTO
single-thread
1.7 hours 3,125
client: 0.2
server: 1.0
14
16. Discussion
• AMIE+
Cons: indexes the graph in-memory.
• Ontological PF
Pros: Very fast
Cons: Relies on schema data (types domain, range)
• Horn Concerto
Pros: Works with SPARQL endpoint, fast also single-threaded,
may be able to overperform Ontological PF in RDF datasets with
available schema.
16