Knowledge Graphs Models
At the foundation of any knowledge graph is the principle of first applying a graph abstraction to data, resulting in an initial data graph. We now discuss a selection of graph-structured data models that are commonly used in practice to represent data graphs. We then discuss the primitives that form the basis of graph query languages used to interrogate such data graphs.
RDF.
A standardised data model based on directed edge-labelled graphs is the Resource Description Framework (RDF) , which has been recommended by the W3C. The RDF model defines different types of nodes, including Internationalized Resource Identifiers (IRIs) which allow for global identification of entities on the Web; literals, which allow for representing strings (with or without language tags) and other datatype values (integers, dates, etc.); and blank nodes, which are anonymous nodes that are not assigned an identifier (for example, rather than create internal identifiers in RDF, we have the option to use blank nodes).
Property graphs.
Property graphs were introduced to provide additional flexibility when modelling more complex relations. A property graph allows a set of property–value pairs and a label to be associated with both nodes and edges. Property graphs are most prominently used in popular graph databases, such as Neo4j. In choosing between graph models, it is important to note that property graphs can be translated to/from directed edge-labelled graphs and/or graph datasets without loss of information. In summary, directed-edge labelled graphs offer a more minimal model, while property graphs offer a more flexible one. Often the choice of model will be secondary to other practical factors, such as the implementations available for different models, etc.
The Cypher Query Language
For most knowledge graph developers (and even some users), the majority of interactions with the database will be via the Cypher query language. Cypher is a declarative, pattern-matching query language, originally created by Neo4j but now implemented by several other systems and currently under standardization by the International Organization for Standardization (ISO) as GQL (Graph Query Language, or “SQL for Graphs” informally) at the time of writing.
- Creating Data in a Knowledge Graph.
Example 1. Using Cypher’s CREATE keyword to insert a subgraph
CREATE (:Person {name:'Rosa'})-[:LIVES_IN {since:2020}]->(:Place {city:'Berlin', country:'DE'}).
In Example 1 the graph structure is very clear. Nodes are represented by parentheses (), which look like circles in ASCII art. Next, there are labels for the nodes like :Place and :Person, which indicate the role those nodes play in the graph (places and people in this case). Nodes can have zero or more of these labels.
N.B. When storing data, the direction of a relationship is mandatory since it creates a correct, high-fidelity model, disambiguating that Rosa lives in Berlin, rather than Berlin living in Rosa.
- Avoiding Duplicates When Enriching a Knowledge Graph.
Sometimes you do want to CREATE new records, but often you don’t want duplicates in your data. Since you know that CREATE will always insert new records, you’ll use Cypher’s MERGE keyword, which inserts records only if the entirety of the supplied pattern does not already exist.
Example 2. create a graph where Karl lives in London.
MERGE (:Person {name:'Karl', age:64})-[:LIVES_IN {since:1980}]→(:Place {city:'London', country:'UK'}).
Since there are no records in the database, this MERGE acts like a CREATE, and as you would expect, two nodes and a single relationship are persisted in the knowledge graph .
Example 3. Now you can see what happens when you MERGE Fred, who also happens to live in London. If you type
MERGE (:Person {name:'Fred'})-[:LIVES_IN]→(:Place {city:'London', country:'UK'}),
you might expect the database to create a new node to represent Fred and connect it via a new LIVES_IN relationship to the existing node representing London. But, surprisingly, that is not what happens. MERGE is subtle. Its semantics are a mix of MATCH and CREATE insofar as it will either match whole patterns or create new records that match the pattern in its entirety. It will never partially MATCH and partially CREATE a pattern.
Clearly, there should be only one London, UK, and many people can live there. Also clearly, other places called London (e.g., London, Ontario, in Canada) exist and should be allowed. You can constrain the data model to support the existence of a single London, UK, node after which any updates that try to create additional identical nodes will be (cleanly and safely) rejected.
Example 4. Create a constraint using
CREATE CONSTRAINT no_duplicate_cities FOR (p:Place) REQUIRE (p.country, p.city) IS NODE KEY.
This statement declares that Place nodes need a unique composite key composed from a combination of city and country properties.
Now that you have a constraint in place, you can go back to thinking about how you’d like to connect both Karl and Fred to the node representing London, UK. To do so, you have to decompose this into three distinct steps:
- Create or find a node representing London.
- Create or find a node representing Karl and connect it to the node representing London.
- Create or find a node representing Fred and connect it to the node representing London.
Example 5. Multiline MERGE query
MERGE (london:lace {city:'London', country:'UK'}) // Creates or matches a node to // represent London, UK
// Binds it to variable "london"
MERGE (fred:Person {name:'Fred'}) // Creates or matches a node to represent Fred // Binds it to variable "fred"
MERGE (fred)-[:LIVES_IN]->(london) // Create or match a LIVES_IN relationship // between the fred and london nodes
MERGE (karl:Person {name:'Karl'}) // Creates or matches a node to represent Karl // Binds it to variable "karl"
MERGE (karl)-[:LIVES_IN]->(london) // Create or match a LIVES_IN relationship // between the karl and london nodes
The MERGE statements in Example 5 are executed as part of a single transaction and so will be applied (or rejected) atomically. Moreover, because of the uniqueness constraint, the database won’t accept duplicate Place nodes and will abort any transactions that try to create them.
Of course, since you are interacting with a database, you can update and enrich the data in your knowledge graph after it has been created.
Example 6. add a property that sets Rosa’s date of birth.
MATCH (p:Person)
WHERE p.name = 'Rosa'
SET p.dob =19841203
Equally, you can remove properties (without deleting the associated node or relationship):
Example 7. remove a property that sets Rosa’s date of birth.
MATCH (p:Person)
WHERE p.name = 'Rosa'
REMOVE p.dob
REMOVE can also be used to strip a label from a node:
MATCH (p:Person)
WHERE p.name = 'Rosa'
REMOVE p:Person
- Graph Local Queries.
Example 8. Who lives in Berlin?
MATCH (p:Person)-[:LIVES_IN]->(:Place {city:'Berlin', country:'DE'}) RETURN (p)
The query in Example 8 is quite readable even if you’re new to Cypher. It starts with MATCH, which tells the database that you want to find patterns. The pattern of interest has some parts that you know exist in the knowledge graph and leaves some parts for the query to find. Specifically, the pattern (:Place {city:'Berlin', country:'DE'}) will match the node representing Berlin, and -[:LIVES_IN]-> will match incoming LIVES_IN relationships to that Berlin node. This is a loose pattern since no properties are specified, but it will still match correctly in your knowledge graph as it stands. The (p:Person) part of the pattern will match any Person nodes in the graph and bind those matches to variable p for later use. When taken together, the pattern asks the database to find any matching patterns where any Person node has an outgoing LIVES_IN relationship to the specific Place node representing Berlin. Finally, the query returns any Person nodes that have been matched as part of this pattern by RETURN (p), which results in just the node representing Rosa.
Example 9. Naive friends of friends
MATCH (:Person {name:'Rosa'})-[:FRIEND*2..2]->(fof:Person) RETURN (fof)
In Example 9 you start with the node representing Rosa and then look for any Person nodes that are connected to her via two outgoing FRIEND relationships. You can use the variable-length path syntax *2..2 in this example to specify path lengths from 2 to 2 (that is, exactly length 2). Less compactly, you could have written the pattern in full to the same effect: (:Person)-[FRIEND]->(:Person)-[FRIEND]->(:Person). These are equivalent, though the shorter version is preferred since it’s more readable.
Since Rosa is a friend of Karl, and Karl is a friend of Rosa, there is a depth-two path that matches your -[:FRIEND*2..2]-> pattern, from Rosa to Karl and back to Rosa! To avoid including Rosa, you augment your MATCH clause with a WHERE predicate to constrain the search pattern, as shown in Example 10.
Example 10. Correctly finding friends of friends
MATCH (rosa:Person {name:'Rosa'})-[:FRIEND*2..2]->(fof:Person) WHERE rosa <> fof RETURN (fof)
The WHERE rosa <> fof predicate enriches the pattern. Now it only matches when the node representing Rosa is not the same as the node matched, avoiding the Rosa-Karl-Rosa problem you saw earlier.
N.B. There are more predicates you can apply using WHERE, including Boolean operations, string matching, path patterns, list operations, property checks, and more, such as:
- WHERE n.name STARTS WITH 'Ka'
- WHERE n.name CONTAINS 'os'
- WHERE NOT n.name ENDS WITH 'y'
- WHERE NOT (p)-[:KNOWS]->(:Person {name:'Karl'})
- WHERE n.name IN ['Rosa', 'Karl'] AND (p)-[LIVES_IN]->(:Place {city:'Berlin'})
Example 11. Correctly finding friends and friends of friends of someone who lives in Berlin
MATCH (:Place {city:'Berlin'})<-[:LIVES_IN]-(p:Person)<-[:FRIEND*1..2]-(f:Person) WHERE f <> p RETURN f
- Graph Global Queries.
But what if you want to query the whole graph, as is often the case with knowledge graphs? These queries are called graph global. You might want to ask the simple question, “Which are the most popular cities to live in?” In this case, your query pattern isn’t bound to any specific node but instead must consider all cities and their populations.
Example 12.
MATCH (p:Place)<-[l:LIVES_IN]-(:Person) RETURN p AS place, count(l) AS rels ORDER BY rels DESC
You ask the database to match any pattern that has a Place node with an incoming LIVES_IN relationship from any Person node. Whenever the pattern is matched, you can access variables bound to the matched nodes. Place nodes are bound to the variable p and LIVES_IN relationships bound to the variable l where they can be accessed elsewhere in the query. you see RETURN p AS place, count(l) AS rels ORDER BY rels DESC. This takes more unpacking but will be familiar to anyone with experience with SQL. First, RETURN p AS place means that any pattern matches bound to p (Place nodes) will be returned, but instead of being called p in the results (too pithy), results will be called place. Along with place, the query returns the number of LIVES_IN relationships attached to a Place node and with count(l) AS rels where l is matched from the pattern (specifically LIVES_IN relationships) and given the friendly name rels. Since you want popular cities, you’ll need to impose an ordering on results using ORDER BY rels DESC, which says to order the results on the value of the variable rels and return them in descending order (highest first).