Written by Dexter Lowe, Software Engineer at G-Research.
On Monday 17th November, many of us from G-Research were excited to hear from Amy Hodler – one of the leading lights in the field of graph algorithms and their role in AI – at the Royal College of Physicians.
Amy was visiting us from America where she works with Relational AI to try and connect the sometimes painful divide between graph data and useful applications.
Unsurprisingly, connections are a key concept in graph databases, a critical piece of knowledge for many systems, and at the forefront of Amy’s talk, which ended with a plea for the audience to improve our ‘betweenness centrality’.
But before we got to that, I learnt lots, came away with a reading list as long as my arm, and discovered a small amount of impish delight that I can start using the term pizza nodes.
My interest in graphs
For me, as with most computer scientists I imagine, my first exposure to graphs was the Seven Bridges of Königsberg – an example of applied mathematics credited with effectively creating the field of graph theory.
At the time I found this interesting, but a few months later I was even more intrigued when I saw a speaker showing off a graph of beer!
They represented the beers in a graph structure before revealing how that graph could help you find similar beers that you might like. The algorithm used showed what connected two of his favourite drinks, allowing him to discover new beers that might appeal based on their common characteristics.
It was here I really got to grips with my first graph algorithm. It was a simple shortest path. I was intrigued by how easy it was to consume the information and, in turn, discover other interesting facts about a subject that I hadn’t even considered before.
In graph form, you can explore the structure of your data in a simpler way compared to looking at the results of a SQL query, because when you find a big common node, it becomes instantly apparent.
Standard data techniques are useful to get you what you asked for but graphs can help you find things you never even realised could be related.
Graphs are everywhere
As you start to look into graphs, you realise almost everything is a graph in some form; the streets of a city, the complete transitive list of everything that goes into a can of coke, the migratory flight paths of birds.
Even things we might not think of as graphs often are. Time is a good example because it can be modelled as a tree with a whole host of extra useful links (like “next_month” or “last_day”) so you never again have to wonder how many days hath September.
Small self-plug: you can learn more about this in my recent talk at Devoxx.
Average is a lie
Graphs are everywhere and so are averages, but it’s the latter that we grow up intimately aware of.
In almost any domain there will be outliers, whether that’s the tall colleague that has to stoop to get through the door or that one road that is so much busier than any others near it.
And while the average price of bread can let you know if something is overpriced, if you build a door for the average person, you’ll find that a lot of people hit their head. So how does this relate to graphs?
We, as data consumers, like averages. We like to discard all the outliers and just look at the average.
But in a graph, it’s not quite that straightforward because structure and connections are key.
Conceptually, an average node in a graph would be one with an average number of connections with a quick drop off either side. This describes what is known as a random network.
But a quote Amy shared, from Albert-László Barabási, threw the random network model into doubt.
“No network in nature that we know of would be described by the random network model”.
So by assuming our data will fit into the random network model (or worse coercing it to), we are not reflecting reality. This will then likely have some impact on our ability to make effective decisions from our data.
Instead, most graphs follow a rough power law distribution, with most nodes having very few connections and a long tail, and a very small number having a vastly outsized number.
Anyone that has worked with graph databases before has almost certainly come across a supernode (or ‘pizza node’, according to Amy).
Their presence, as well as that of the many nodes with few connections, are things I was aware of, but this was the first time it had really been driven home for me. The drastic contrast of the bell-curve super-imposed over the power law graph really showed how different the two mental models are.
The value that graphs provide versus a random network model is therefore clear in principle. I have, however, learned from my own experience that graphs are quite hard to work with at scale, especially when you throw time into the mix. Having a graph with billions of nodes is great but it often just looks like a big ball of mud. That simple process of clicking around to explore data is somewhat ruined every time you accidentally find the America node, which is so hugely interconnected that it overshadows almost everything else in most graphs it appears in.
So we know that graphs are hugely valuable in highlighting the connections and the structure of our data in a way that is often hidden when we flatten things into more traditional databases. But how can we take advantage of that? As a human clicking around the beer graph, I can gain more interesting insights just by poking the graph than I could with a big table, but past a certain point it’s easy to get lost. So how do we continue to get value from graphs at scale?
Graph algorithms, take the stage.
The algorithms
Amy embarked upon a whirlwind tour of some of the most useful general graph algorithms applied to the real world.
We heard about the Closeness Centrality algorithm and how it can be applied to a security analysis to find the most enticing parts of a system to an adversary.
We also returned to Amy’s self-confessed favourite algorithm – the Betweenness Centrality algorithm – and how it can identify critical links which might allow you to limit the damage of an intrusion with a quick cut.
I was particularly struck by another quote here, this time from John Lambert; “Defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.”
Perhaps the most interesting example she raised was the spectre of financial contagion, the idea that troubles in one economy or business can spill into connected entities causing a cascade. This effect can be seen at an international level right down to local communities, and it’s happened many times over throughout history.
Amy showed how you can apply a series of algorithms to analyse (and hopefully prevent) contagion. Even pagerank – which it turns out has uses other than uncovering the best cat memes – can help find critical nodes, the loss of which might have wide ranging impact on financial systems.
For me, the most thought-provoking aspect of the talk was the discussion of how graph algorithms can help predict data which is real but missing from analyses. These algorithms, like Common Neighbours, are often used to predict future links based on structural similarity.
Assuming that things will converge, you can be ready for potential new links forming, but it’s interesting that given our information is never complete, these “predictions” might actually already have come true. The graph is almost filling in the unknown unknowns for you based on the structure of the graph itself. This ability to find and capitalise on unknown unknowns is what really draws me to graph technologies.
The key takeaway from this section for me was that a larger graph will have a lot of sub-structures. Many graph algorithms therefore capitalise on the structures of the graph to pick out the important areas of interest, identifying the parts to examine more closely or to trigger other analyses. The superpower of graphs is their ability to understand structure and the algorithms can help us humans grok that structure without needing to try and understand every connection ourselves.
Unknown unknowns and ML with graphs
This final section of the talk went into a little more depth on graph embeddings. This is a term that I’ve heard thrown around a lot recently, often as a ‘magic bullet’ that lets you do machine learning on graphs.
Amy took it back a step and explained how an embedding can flatten the graph structure into a form that traditional AI techniques can work with.
This is really interesting as it makes a graph just another input to your ML pipeline. Looked at from this perspective it’s actually of a lot more important than it first seems.
Most forays into graph technologies stall at some point as companies realise they have a graph that, to use Amy’s description, is like an unwieldy extra appendage that nobody really knows how to use.
So how do you take that to production and bridge the gap between human insight and analysis at scale? Well, graph embeddings are a key part of this. They can help more companies take advantage of their graph datasets by using a machine learning process to learn that predictive graph structure.
Closing thoughts
To close the talk Amy left us with some practical advice from her considerable experience working with graphs, including strategies for dealing with those pizza nodes, as well as things that you maybe shouldn’t model in a graph.
After the talk there were some lively questions, one of which led Amy to discuss the future of graph technology, touching on graph native learning and causal AI. Graphs let us express a wonderfully diverse world of connected data but understanding the WHY is often hard.
At G-Research, understanding connections, why things happen, and how to predict them is key to our business, so recent breakthroughs in the causal field have definitely filled up my reading list, starting with The Book of Why.
The questions and discussion of developments in graph technologies and their applications continued late into the night, giving me lots to ponder and more to read.
But I’m afraid I don’t think I’ll ever be able to call those outsized entities pizza nodes with a straight face.