At the foundation of any knowledge graph is the principle of first applying a graph  abstraction to data, resulting in an initial data graph. We now discuss a selection of  graph-structured data models that are commonly used in practice to represent data  graphs. We then discuss the primitives that form the basis of graph query languages  used to interrogate such data graphs. 

RDF. 

A standardised data model based on directed edge-labelled graphs is the Resource Description Framework (RDF) , which has been recommended by the W3C. The RDF model defines different types of nodes, including Internationalized Resource  Identifiers (IRIs) which allow for global identification of entities on the Web;  literals, which allow for representing strings (with or without language tags) and  other datatype values (integers, dates, etc.); and blank nodes, which are anonymous  nodes that are not assigned an identifier (for example, rather than  create  internal identifiers in RDF, we have the option to use blank  nodes).

Property graphs

Property graphs were introduced to provide additional flexibility when modelling more complex relations.  A property graph allows a set of property–value pairs and a label to be associated with both nodes and edges. Property graphs are most prominently used in popular graph databases, such as Neo4j. In choosing between graph models, it is important to note that property graphs can be translated to/from directed edge-labelled graphs and/or graph datasets without  loss of information. In summary, directed-edge labelled graphs offer a more minimal model, while property graphs offer a more flexible one. Often the choice of model will be secondary to other practical factors, such as the implementations  available for different models, etc. 

The Cypher Query Language

For most knowledge graph developers (and even some users), the majority of interactions with the database will be via the Cypher query language. Cypher is a declarative, pattern-matching query language, originally created by Neo4j but now implemented by several other systems and currently under standardization by the International Organization for Standardization (ISO) as GQL (Graph Query Language, or “SQL for Graphs” informally) at the time of writing.

- Creating Data in a Knowledge Graph.

Example 1. Using Cypher’s CREATE keyword to insert a subgraph

CREATE (:Person {name:'Rosa'})-[:LIVES_IN {since:2020}]->(:Place {city:'Berlin', country:'DE'}).

In Example 1 the graph structure is very clear. Nodes are represented by  parentheses (), which look like circles in ASCII art. Next, there are labels for the  nodes like :Place and :Person, which indicate the role those nodes play in the graph (places and people in this case). Nodes can have zero or more of these labels. 

N.B. When storing data, the direction of a relationship is mandatory since it creates  a correct, high-fidelity model, disambiguating that Rosa lives in Berlin, rather than Berlin living in Rosa.

- Avoiding Duplicates When Enriching a Knowledge Graph.

Sometimes you do want to CREATE new records, but often you don’t want duplicates in your data. Since you know that CREATE will always insert new records,  you’ll use Cypher’s MERGE keyword, which inserts records only if the entirety of the  supplied pattern does not already exist.

Example 2. create a graph where Karl lives in London.

MERGE (:Person {name:'Karl', age:64})-[:LIVES_IN {since:1980}](:Place {city:'London', country:'UK'})

Since there are no records in the database, this MERGE acts like a CREATE, and as you  would expect, two nodes and a single relationship are persisted in the knowledge graph .

Example 3Now you can see what happens when you MERGE Fred, who also happens to live in London. If you type 

MERGE (:Person {name:'Fred'})-[:LIVES_IN](:Place {city:'London', country:'UK'})

you might expect the database to create a new node to represent Fred and connect it  via a new LIVES_IN relationship to the existing node representing London. But,  surprisingly, that is not what happens. MERGE is subtle. Its semantics are a mix of MATCH and CREATE insofar as it will either match whole patterns or create new records that match the pattern in its entirety. It will never partially MATCH and partially CREATE a pattern. 

Clearly, there should be only one London, UK, and many people can live there. Also clearly, other places called London (e.g., London, Ontario, in Canada) exist and should  be allowed. You can constrain the data model to support the existence of a  single London, UK, node after which any updates that try to create additional  identical nodes will be (cleanly and safely) rejected.

Example 4. Create a constraint using 

CREATE CONSTRAINT no_duplicate_cities FOR (p:Place) REQUIRE (p.country, p.city) IS NODE KEY

This statement declares that Place nodes need a unique composite key composed  from a combination of city and country properties.

Now that you have a constraint in place, you can go back to thinking about how  you’d like to connect both Karl and Fred to the node representing London, UK. To do  so, you have to decompose this into three distinct steps: 

  1. Create or find a node representing London.
  2. Create or find a node representing Karl and connect it to the node representing London.
  3. Create or find a node representing Fred and connect it to the node representing London.

Example 5. Multiline MERGE query

MERGE (london:lace {city:'London', country:'UK'}) // Creates or matches a node to                                                                                   // represent London, UK

                                                                           // Binds it to variable "london"

MERGE (fred:Person {name:'Fred'})     // Creates or matches a node to represent Fred                                                             // Binds it to variable "fred"

MERGE (fred)-[:LIVES_IN]->(london)   // Create or match a LIVES_IN relationship                                                                     // between the fred and london nodes

MERGE (karl:Person {name:'Karl'})     // Creates or matches a node to represent Karl                                                             // Binds it to variable "karl"

MERGE (karl)-[:LIVES_IN]->(london) // Create or match a LIVES_IN relationship                                                                    // between the karl and london nodes

The MERGE statements in Example 5 are executed as part of a single transaction and  so will be applied (or rejected) atomically. Moreover, because of the uniqueness  constraint, the database won’t accept duplicate Place nodes and will abort any  transactions that try to create them. 

Of course, since you are interacting with a database, you can update and enrich the data in your knowledge graph after it has been created.  

Example 6. add  a property that sets Rosa’s date of birth.

MATCH (p:Person)

WHERE p.name = 'Rosa'

SET p.dob =19841203

Equally, you can remove properties (without deleting the associated node or relationship):

Example 7. remove  a property that sets Rosa’s date of birth.

MATCH (p:Person)

WHERE p.name = 'Rosa'

REMOVE p.dob

REMOVE can also be used to strip a label from a node:

MATCH (p:Person)

WHERE p.name = 'Rosa'

REMOVE p:Person

- Graph Local Queries.

Example 8. Who lives in Berlin?

     MATCH (p:Person)-[:LIVES_IN]->(:Place {city:'Berlin', country:'DE'})                             RETURN (p)

The query in Example 8 is quite readable even if you’re new to Cypher. It starts with  MATCH, which tells the database that you want to find patterns. The pattern of  interest has some parts that you know exist in the knowledge graph and leaves some  parts for the query to find. Specifically, the pattern (:Place {city:'Berlin', country:'DE'})  will match the node representing Berlin, and -[:LIVES_IN]-> will match incoming  LIVES_IN relationships to that Berlin node. This is a loose pattern since no properties  are specified, but it will still match correctly in your knowledge graph as it stands. The  (p:Person) part of the pattern will match any Person nodes in the graph and bind  those matches to variable p for later use. When taken together, the pattern asks  the database to find any matching patterns where any Person node has an outgoing  LIVES_IN relationship to the specific Place node representing Berlin. Finally, the query  returns any Person nodes that have been matched as part of this pattern by RETURN  (p), which results in just the node representing Rosa.


Example 9. Naive friends of friends

     MATCH (:Person {name:'Rosa'})-[:FRIEND*2..2]->(fof:Person)                                     RETURN (fof)

In Example 9 you start with the node representing Rosa and then look for any Person nodes that are connected to her via two outgoing FRIEND relationships. You can use  the variable-length path syntax *2..2 in this example to specify path lengths from 2 to 2 (that is, exactly length 2). Less compactly, you could have written the pattern in full  to the same effect: (:Person)-[FRIEND]->(:Person)-[FRIEND]->(:Person). These are  equivalent, though the shorter version is preferred since it’s more readable. 

Since Rosa is a friend of Karl, and Karl is a friend of Rosa, there is a depth-two path  that matches your -[:FRIEND*2..2]-> pattern, from Rosa to Karl and back to Rosa! To  avoid including Rosa, you augment your MATCH clause with a WHERE predicate to  constrain the search pattern, as shown in Example 10.  

Example 10. Correctly finding friends of friends

MATCH (rosa:Person {name:'Rosa'})-[:FRIEND*2..2]->(fof:Person)                          WHERE rosa <> fof                                                                                              RETURN (fof)

The WHERE rosa <> fof predicate enriches the pattern. Now it only matches  when the node representing Rosa is not the same as the node matched,  avoiding the Rosa-Karl-Rosa problem you saw earlier.

N.B. There are more predicates you can apply using WHERE, including Boolean operations, string matching, path patterns, list operations, property checks, and more, such as:

  • WHERE n.name STARTS WITH 'Ka'
  • WHERE n.name CONTAINS 'os'
  • WHERE NOT n.name ENDS WITH 'y'
  • WHERE NOT (p)-[:KNOWS]->(:Person {name:'Karl'})
  • WHERE n.name IN ['Rosa', 'Karl'] AND (p)-[LIVES_IN]->(:Place {city:'Berlin'})

Example 11. Correctly finding friends and friends of friends of someone who  lives in Berlin 

MATCH (:Place {city:'Berlin'})<-[:LIVES_IN]-(p:Person)<-[:FRIEND*1..2]-(f:PersonWHERE f <> p                                                                                                  RETURN f

- Graph Global Queries.

But what if you want to query the whole graph, as is often the case with knowledge graphs? These queries are called graph global. You might want to ask the  simple question, “Which are the most popular cities to live in?” In this case, your  query pattern isn’t bound to any specific node but instead must consider all cities and  their populations.

Example 12

   MATCH (p:Place)<-[l:LIVES_IN]-(:Person)                                                            RETURN p AS place, count(l) AS rels ORDER BY rels DESC

You ask the database to match any pattern that has a Place node with an incoming  LIVES_IN relationship from any Person node. Whenever the pattern is matched, you  can access variables bound to the matched nodes. Place nodes are bound to the  variable p and LIVES_IN relationships bound to the variable l where they can be  accessed elsewhere in the query. you see RETURN p AS place, count(l) AS rels ORDER  BY rels DESC. This takes more unpacking but will be familiar to anyone with  experience with SQL. First, RETURN p AS place means that any pattern matches  bound to p (Place nodes) will be returned, but instead of being called p in the results  (too pithy), results will be called place. Along with place, the query returns the  number of LIVES_IN relationships attached to a Place node and with count(l) AS rels  where l is matched from the pattern (specifically LIVES_IN relationships) and given the friendly name rels. Since you want popular cities, you’ll need to impose an  ordering on results using ORDER BY rels DESC, which says to order the results on the  value of the variable rels and return them in descending order (highest first). 


Modifié le: samedi 22 juin 2024, 10:27