Notebooks
N
NVIDIA
05 Link Prediction

05 Link Prediction

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtGTC25_DLInvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-servercommunityknowledge_graph_ragLLMnotebooksragnemo

Improving our KG.

Now that we have a knowledge graph generated by our LLM, how do we know that we are not missing facts from our database? This is a difficult problem in the study of knowledge graphs, with a long history, and in this notebook we will go over one way that you can tackle this problem. We will also discuss some other techniques that you can try.

Let's start off, as we always do: with some imports.

[ ]

A first, familiar step.

As with notebooks before me, one of the first things we need to do is to read in the same set of triplets we keep using. Let's go ahead and do that! We will end up casting this into a couple different formats over the course of this notebook to make the triplets work with different tools as explained in the code comments.

[ ]

Learning a thing or two about our graph.

Now that we have the triplets read in we can train a model using PyKEEN. Let's explain what model we are training first. In the world of knowledge graph completion there are many ways you can perform edge prediction. For simple knowledge graphs you may use nonnegative matrix factorization (NMF) to predict unlabeled edges based on structural properties for instance. For more complicated knowledge graphs we require more sophisticated methods like neural networks. The way these models tend to work is to take all the nodes in our graph as well as all the edges and obtain embeddings for them, such that for any edge (u, relation, v) we have a likewise set of embeddings (e(u), e(relation), e(v)) where e(u) + e(relation) = e(v). At least this is the idea. In reality, we end up getting some approximation of that. Then, if we have two nodes u,v and we want to predict what an edge between them might be, we can cycle through all the different relations we have emebdded, and see where the prior equation holds the closest! This is what the TransE model does, and it does this without the need to train an expensive GNN.

It should be noted there are a ton of variants of this including TransR, and TransSPARSE, but we will keep it simple! Let's go ahead and train our model!

[ ]

An unfortunate truth.

Now we may think that we are mostly done. "Aha, now that my model is trained, I can just check every possible edge to see which ones satisfy the above equations!" you may have thought. And you are right, if we had a very small graph that is. Unfortunately, that is a lot of edges and vector comparisons to perform. And with the right hardware we definitely can perform that process on our graph, but not in general. In general, doing an all-to-all prediction with TransE, or any embedding-based link-prediction model is prohibitive. Instead we limit to subgraphs and perform this comparison there.

There are a lot of ways that we can get sensible subgraphs to perform prediction on. Some popular methods are to perform a shallow breadth-first-search (BFS) around a node and only consider links within it. Another is to get an array of paths between nodes using A* or Djikstra's algorithm, and use that induced subgraph. In our case we are going to layer some techniques. We are going to use NMF to suggest structurally similar links and then we will use TransE to fill in what those link's values "should" be. To reiterate though, this is not the best way of doing this, this is just a way of doing this. It is recommended to test alternate subgraph retrieval techniques for your specific KG and your specific use-case.

[ ]

Something to note about the previous code is that we normalize the adjacency matrix by the inverse square-root of the degree of each node when performing NMF. There is a long history of why this is the way we choose to perform the adjacency normalization, but the reason we perform this normalization is to dissuade "bad-behavior" from NMF. If we do not perform a normalization, then the predicted edges will cluster around high-degree nodes, and we want to avoid that in lieu of more varied structure.

Now with that out of the way we are really close! We just have to see what the most-likely edges are. Let's write up one more function to do relationship-prediciton on this edge-set and see what we end up getting as an output.

[ ]

Maybe one more thing.

Okay so we tried out the TransE model, where the value of the tail is given by simple vector composition. And I know we said that we wouldn't do this, but let's quickly train a TransR model on the same set of data, and see what relations that model suggests. The TransR model is slightly more sophisticated. Instead of e(u) + e(relation) = e(v) we simultaneously learn an affine transformation such that Me(u) + e(relation) = Me(v). This can be thought of loosely in the following way. Each relationship in the TransR model has a "native" vector space. Before performing prediction we need to ensure that our embeddings are put into that vector space. This gives the model more flexibility at the cost of extra optimization expense. The TransR model is generally better at handling KG's with lots of complicated relations.

[ ]

Awesome, our model is trained. Now let's do some prediction with it!

[ ]

As can be seen, this model gives different results. But, how good are these results? This depends on our end goals. There are two different paradigms when discussing knowledge graph completion and fact assessment. These are generally called, the open-world and closed-world assumptions. Roughly speaking, in open-world KG completion, we assume that there are facts outside of our known dataset that may be relevant to our knowledge graph. This is the realm of "fact-discovery", and can be used to help guide things like research efforts, or prediction. In closed-world KG completion, we only want to include facts that we can reason about directly from our data. This is generally more espensive to enforce, and requires extra evaluations of the proposed triplets. This can be done either by hand, or by machine using some metric or LLM. As an example, e may care about enforcing a closed-world assumption in the realm of legal data, or documentation.

Given how broad open-world assumptions can be, we will not speak much to that kind of knowledge graph completion. Instead let us briefly talk about one way we can look to confirm our facts.

Assessing our predictions.

There are a few different ways we can assess our suggested edges under a closed-world assumption. This is a tough task since we do not have a reliable ground truth in general. let us consider what is available to us. We have our initial data, we have our knowledge graph, and we have our suggested edges. From these we need to intuit whether suggested edges are sensible. One way we could do this is to trace each fact (RDF triple) from our knowledge graph back to our dataset and cite specific text. This can then be evaluated by a person, or using LLM as a judge.

This seems like a good path! However, how do we know what text we should look at? Simply put, we don't. To ensure that we are not missing the key documents we would have to check over our whole dataset for each edge. But if we are okay with some approximation we can limit our options if we are clever. We know where each edge in our KG came from, and can trace it back to a file. What if we pursued the following flow from data ingestion to assessment?

  1. Ingest triplets from our dataset and track which files each triplet comes from.
  2. Turn this into a knowledge graph where each edge has its parent file as metadata.
  3. Train a TransE, or TransR model as described above on the entire graph.
  4. Sample subgraphs of the KG for smaller sets of documents
  5. Perform link prediction and relationship prediction on each subgraph
  6. Evaluate the suggested relationships, with respect to the text in the parent documents only.

The above workflow allows us to search for citations in a much smaller corpus of text, while still allowing us to include information from several documents. So, let's start that process. We already have 1-3 done, so let's start at step 4!

[ ]

Cool! We have our subgraph, and the triplets therein. Let's run our TransR model!

[ ]

Now that we have suggested edges for our subgraph, we can go ahead and evaluate if the given triplets seem reasonable given the context. Again, we can do this manually, or using LLM as a judge.

Wrapping up.

You now know how to perform edge-prediction on your LLM generated knowledge graph in a way that is both computationally feasible, and encorporates structural and embedding information. This is due to the use of NMF as well as TransE/TransR, ensuring that each suggested edge has been recommended by both. Remember that this is only one way to do this, and we encourage you to try out any number of the suggestions we made during this notebook.