Skip to main content

Distinguished Speaker Series: Insights from Amy Hodler, Average is a Lie

15 November 2022
  • Software Engineering

Written by Dexter Lowe, Software Engineer at G-Research.

On Monday 17th November, many of us from G-Research were excited to hear from Amy Hodler – one of the leading lights in the field of graph algorithms and their role in AI – at the Royal College of Physicians.

Amy was visiting us from America where she works with Relational AI to try and connect the sometimes painful divide between graph data and useful applications.

Unsurprisingly, connections are a key concept in graph databases, a critical piece of knowledge for many systems, and at the forefront of Amy’s talk, which ended with a plea for the audience to improve our ‘betweenness centrality’.

But before we got to that, I learnt lots, came away with a reading list as long as my arm, and discovered a small amount of impish delight that I can start using the term pizza nodes.

My interest in graphs

For me, as with most computer scientists I imagine, my first exposure to graphs was the Seven Bridges of Königsberg – an example of applied mathematics credited with effectively creating the field of graph theory.

At the time I found this interesting, but a few months later I was even more intrigued when I saw a speaker showing off a graph of beer!

They represented the beers in a graph structure before revealing how that graph could help you find similar beers that you might like. The algorithm used showed what connected two of his favourite drinks, allowing him to discover new beers that might appeal based on their common characteristics.

It was here I really got to grips with my first graph algorithm. It was a simple shortest path. I was intrigued by how easy it was to consume the information and, in turn, discover other interesting facts about a subject that I hadn’t even considered before.

In graph form, you can explore the structure of your data in a simpler way compared to looking at the results of a SQL query, because when you find a big common node, it becomes instantly apparent.

Standard data techniques are useful to get you what you asked for but graphs can help you find things you never even realised could be related.

Graphs are everywhere

As you start to look into graphs, you realise almost everything is a graph in some form; the streets of a city, the complete transitive list of everything that goes into a can of coke, the migratory flight paths of birds.

Even things we might not think of as graphs often are. Time is a good example because it can be modelled as a tree with a whole host of extra useful links (like “next_month” or “last_day”) so you never again have to wonder how many days hath September.

Small self-plug: you can learn more about this in my recent talk at Devoxx.

Average is a lie

Graphs are everywhere and so are averages, but it’s the latter that we grow up intimately aware of.

In almost any domain there will be outliers, whether that’s the tall colleague that has to stoop to get through the door or that one road that is so much busier than any others near it.

And while the average price of bread can let you know if something is overpriced, if you build a door for the average person, you’ll find that a lot of people hit their head. So how does this relate to graphs?

We, as data consumers, like averages. We like to discard all the outliers and just look at the average.

But in a graph, it’s not quite that straightforward because structure and connections are key.

Conceptually, an average node in a graph would be one with an average number of connections with a quick drop off either side. This describes what is known as a random network.

But a quote Amy shared, from Albert-László Barabási, threw the random network model into doubt.

“No network in nature that we know of would be described by the random network model”.

So by assuming our data will fit into the random network model (or worse coercing it to), we are not reflecting reality. This will then likely have some impact on our ability to make effective decisions from our data.

Instead, most graphs follow a rough power law distribution, with most nodes having very few connections and a long tail, and a very small number having a vastly outsized number.

Anyone that has worked with graph databases before has almost certainly come across a supernode (or ‘pizza node’, according to Amy).

Their presence, as well as that of the many nodes with few connections, are things I was aware of, but this was the first time it had really been driven home for me. The drastic contrast of the bell-curve super-imposed over the power law graph really showed how different the two mental models are.

The value that graphs provide versus a random network model is therefore clear in principle. I have, however, learned from my own experience that graphs are quite hard to work with at scale, especially when you throw time into the mix. Having a graph with billions of nodes is great but it often just looks like a big ball of mud. That simple process of clicking around to explore data is somewhat ruined every time you accidentally find the America node, which is so hugely interconnected that it overshadows almost everything else in most graphs it appears in.

So we know that graphs are hugely valuable in highlighting the connections and the structure of our data in a way that is often hidden when we flatten things into more traditional databases. But how can we take advantage of that? As a human clicking around the beer graph, I can gain more interesting insights just by poking the graph than I could with a big table, but past a certain point it’s easy to get lost. So how do we continue to get value from graphs at scale?

Graph algorithms, take the stage.

The algorithms

Amy embarked upon a whirlwind tour of some of the most useful general graph algorithms applied to the real world.

We heard about the Closeness Centrality algorithm and how it can be applied to a security analysis to find the most enticing parts of a system to an adversary.

We also returned to Amy’s self-confessed favourite algorithm – the Betweenness Centrality algorithm – and how it can identify critical links which might allow you to limit the damage of an intrusion with a quick cut.

I was particularly struck by another quote here, this time from John Lambert; “Defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.”

Perhaps the most interesting example she raised was the spectre of financial contagion, the idea that troubles in one economy or business can spill into connected entities causing a cascade. This effect can be seen at an international level right down to local communities, and it’s happened many times over throughout history.

Amy showed how you can apply a series of algorithms to analyse (and hopefully prevent) contagion. Even pagerank – which it turns out has uses other than uncovering the best cat memes – can help find critical nodes, the loss of which might have wide ranging impact on financial systems.

For me, the most thought-provoking aspect of the talk was the discussion of how graph algorithms can help predict data which is real but missing from analyses. These algorithms, like Common Neighbours, are often used to predict future links based on structural similarity.

Assuming that things will converge, you can be ready for potential new links forming, but it’s interesting that given our information is never complete, these “predictions” might actually already have come true. The graph is almost filling in the unknown unknowns for you based on the structure of the graph itself. This ability to find and capitalise on unknown unknowns is what really draws me to graph technologies.

The key takeaway from this section for me was that a larger graph will have a lot of sub-structures. Many graph algorithms therefore capitalise on the structures of the graph to pick out the important areas of interest, identifying the parts to examine more closely or to trigger other analyses. The superpower of graphs is their ability to understand structure and the algorithms can help us humans grok that structure without needing to try and understand every connection ourselves.

Unknown unknowns and ML with graphs

This final section of the talk went into a little more depth on graph embeddings. This is a term that I’ve heard thrown around a lot recently, often as a ‘magic bullet’ that lets you do machine learning on graphs.

Amy took it back a step and explained how an embedding can flatten the graph structure into a form that traditional AI techniques can work with.

This is really interesting as it makes a graph just another input to your ML pipeline. Looked at from this perspective it’s actually of a lot more important than it first seems.

Most forays into graph technologies stall at some point as companies realise they have a graph that, to use Amy’s description, is like an unwieldy extra appendage that nobody really knows how to use.

So how do you take that to production and bridge the gap between human insight and analysis at scale? Well, graph embeddings are a key part of this. They can help more companies take advantage of their graph datasets by using a machine learning process to learn that predictive graph structure.

Closing thoughts

To close the talk Amy left us with some practical advice from her considerable experience working with graphs, including strategies for dealing with those pizza nodes, as well as things that you maybe shouldn’t model in a graph.

After the talk there were some lively questions, one of which led Amy to discuss the future of graph technology, touching on graph native learning and causal AI. Graphs let us express a wonderfully diverse world of connected data but understanding the WHY is often hard.

At G-Research, understanding connections, why things happen, and how to predict them is key to our business, so recent breakthroughs in the causal field have definitely filled up my reading list, starting with The Book of Why.

The questions and discussion of developments in graph technologies and their applications continued late into the night, giving me lots to ponder and more to read.

But I’m afraid I don’t think I’ll ever be able to call those outsized entities pizza nodes with a straight face.

Distinguished Speaker Series: Amy Hodler, graph analytics guru
  • 15 Nov 2022

We’re excited to announce Amy Hodler (@amyhodler) as the latest speaker in the G-Research Distinguished Speaker Series. Register your interest to attend this in-person event Amy is an evangelist for graph analytics, network science and responsible AI, and is co-author of Graph Algorithms (O’Reilly). Amy will join us in Central London to talk about how […]

Read more

Watch Amy's talk

Have you ever heard of the phrase, the smartest person in the room is the room. So I just wanna thank everybody for joining us today. I feel like we have a really smart room. Also wanna thank de research our host, uh, for bringing us together in such a, a wonderful location, uh, and giving us some time to spend together, uh, as well as, uh, spend a a few stories or talk to a few stories and to connect. So connections. That's something that I've been interested in, passionate about for a long time. Um, because they really are about understanding, at least for me, about understanding and analyzing connections to understand our, our bigger world. And that's what graph analytics is about. Uh, I've been interested in and fascinated with graph analytics for years because they do help us see some obscure structures and things. They reveal hidden meanings that we otherwise couldn't see. And I believe it's critical, especially right now as our world is becoming more complicated and more interdependent, that we look at the world and our complex problems as a network, as, uh, in a lens where we can kind of see how things interact. So in case you're not familiar with graphs, a long, long time ago in a galaxy not too far away, uh, somebody named Oiler actually created graph theory, uh, to solve a network problem. And we see these networks all over the place from communications to economics to social. That social picture is actually eurovision voting from a couple years ago. And we can see that, uh, that you see representations where circles or dots are often what we use to represent nodes. Nodes are the entities. And then how the entities are related are called relationships. And those are usually represented as lines of some sort. Now, you can imagine if we think about nodes as nouns and relationships as verbs, there's a lot of very interesting and meaningful sentences we could construct. And we can also do a lot of non-trivial, non-trivial analysis. So one of the research areas that I find really fascinating is voting and how we can predict and help promote people vote more. So, for example, some of the research that I've seen where they're trying to predict whether somebody's gonna vote, looking at all different sorts of factors, and what they found was thinking about somebody's, uh, address, uh, where they went to school, their socioeconomic status was actually not as predictive as their network. And so if we think about here, we've got our nodes and our lines, again, relationships. If I were to say the orange.is an individual, and I'm trying to predict whether they're going to vote, it's actually more important who their friends are. So think about the orange dots as, uh, or excuse me, the yellow dots as their friends. So they actually know three people in their network, and it's more important to know whether they're, uh, the people in their network are gonna vote, in particular the friends of their friends, and even the friends of their friends of their friends. So more than demographic information, what this means is that we can know more about somebody's behavior. We're more likely to be able to predict someone's behavior based on the network around, based on people we don't even know can be more predictive of whether I or, or you are gonna vote. And now this is fascinating just on itself, but in doing some research and preparing to talk to you, uh, I actually was trying to figure out is there anything in the UK that's more predictive? Is there anything local? And I was really quite shocked to find that there's something about dogs, and I've not seen this, this has not happened in the us so I haven't seen this anywhere else. And I probably spent about two hours on the internet. I don't know if that was a good use of time or not, but looking at pictures of dogs at polling stations. And I started to, I think, well, this is really interesting. And then I started to see it's starting to spread. It's starting to spread to the us and then even cats are getting in on this as well. So that started me thinking about, okay, spreading. So as can influence spread. And so doing a little research on that, I'm like, there actually is some research out there about voting and spread of influence that shows if your network, so here friends and some connections again, isn't very strong, that cascading influence doesn't happen. Okay, probably not that surprising. They're probably not that organized. There's no way to reinforce behavior. But what really surprised me is when the connectivity is high, you also don't get cascading, uh, spread of influence. And they think this might be because groups that have super high connectivity are insular. So they're not as likely to have connections outside and not as open to new ideas. So there's no change to go ahead and spread. But what I found really astounding was that there's this weird zone, I'll call it the Goldilocks zone, where about 50% transitivity. So about where half of your friends know each other, there's more likely to cascade influence within the network, but also external. So these kind of insights, uh, can be re are recorded actually as some of these in the book called connected from James Fowler. But there's a lot of research out there. And what they find in general is the relationships and the structures within a network are extremely predictive. In fact, the most predictive single elements of behavior and change how networks evolve. So we know these are really strong signals. We know relationships are strong signals, but we can't use them if we can't see them if we're not looking for them. And a lot of data science ignores them because either we're not sure to even look for them or because it's just too difficult to extract them out, uh, in legacy systems. So if we look at what people, how people normally approach network analysis, you could, they have these assumptions about networks. I've got the number of relationships here on the, um, x-axis, number of nodes, and they kind of come, you get this quasi bell curve here where we assume that the average number of nodes has an average number of relationships. Maybe I have some friends with less, some with more, but just in general we can average it out. Uh, and this average distribution is called a, uh, random network. And the interesting thing about random networks is they kind of don't really exist in nature. We don't see in nature when we look at networks, when I look at a financial network, when I look at traffic, you don't see this kind of average. What you do see is you see things with shape. So this is a power law distribution where most nodes have very few, uh, connections. And then you have a few with quite a few. So this would be the same as like website, so, or internet, so that you might have webpages with a lot of connections, but most webpages have very few. So you kind of see a little, uh, shape here. Uh, and you see a lot of quasi power law, so not strict power law as well. So the, the point I wanna reinforce is that, uh, average is really just a concept. There's a couple really great books. Um, the end of average has some really great stories, uh, and, uh, so does fool fooled by randomness as well. To think about if we're just looking at average, how is there such a thing as an average person, half female, half male, uh, so, so tall, so big, those things are just concepts. And so if we're looking at that, we're missing the bigger meaning or we're missing how reality is normally formed. So with grass, we can use them to help us see what's real help, see what the structure is really about. It can help us uncover structure we couldn't otherwise see and find entities that might be big control points or might have a lot of influence, uh, in whatever you're analyzing. And in fact, this is what we see out there in the real world. Uh, this image is from, uh, pulsar. It is a story about how things spread virally in a social network and what the shows is that reality's uneven. We've got these big hubs and we've got these small communities forming as well. And so not being able to, if we take an average approach, it kind of lies to us. It's hiding this under the covers. And it's important because the shape of our data actually has meaning. We talked a little bit about random, it's a very flat, um, very, uh, kind of almost a, uh, linear type of distribution. Uh, small world networks. So this is seen a lot in social networks. That's why I was using it for the voting. Uh, but that's where we are friends with or we have connectivity. If we're a business, we businesses that are near us, but we're never too far out from the other side of this network. And then a scale free network, you see this hub and spoke structure over multiple scales. Imagine what that could mean if you're trying to dis uh, distribute something throughout a network like this. You can see already you have certain control points that pop out. And so it's important to be able to see these so that we can react to them. Now, graph analytics actually does see these shapes. Uh, they can see them and help us understand and infer some meetings. You can try to understand, well, is this pattern important in my network? Uh, is this just noise? Can I predict what's going to happen next? Uh, network evolution is actually highly dependent upon structure. So understanding the structure, you can understand better how the network might evolve and then finding those control points. So if we need to make a change or we're trying to push something in a certain direction, graph analytics, again, can kind of un uncover those important points. And so it's really about asking better questions. And so if we're talking about graph analytics, we're usually talking about graph queries or graph algorithms. Uh, if you have a query, those are kind of simple statements. We know exactly what we're looking for. How many hops out or how many people two hops out from Amy voted. That's a, that's a, that could be a query. And there's something that we could, you know, we could actually create ourselves and we can read as a, as a human, when you're thinking about graph algorithms as opposed to a local question, you are asking a larger question. Maybe you're doing a global analysis, you're trying to understand your network better, uh, and then make predictions as well. Uh, these are often pre-coded formulas. So they are, uh, they're, the, the formulas are, um, pre-computed, but or pre-formulated, but the output is still human readable or able to extract like a, a machine learning system. And then we have graph bendings. And those are where we're trying to represent the topology of our graph, uh, in a faithful way, uh, in a flattened way, so that we can then use it usually for machine learning purposes. These type of, uh, algorithms are not, are machine learned and are machine read, so they're not something that a human would be able to look at and and understand exactly what they mean. So there are many graph algorithms. They're very powerful. I threw up a few categories here. Uh, there are, um, hundreds of graph algorithms, but they usually find, we commonly see these. So you have things like pathfinding, how do I find the best route from A to f, uh, community detection. How are, how are things grouped together by relationships or can I classify by these groupings? You have centrality, which are all about importance, which are the important points in my network? You have flow. So how can I move or what's the capacity, uh, through my network? You could think of that like scheduling or, or traffic. Uh, then you have a similarities, which are set comparisons where you're trying to say, what are the most imp not important, excuse me, what are the most, um, similar, uh, nodes in my network? And then you have link prediction as well. So, is there a link going to form in the future, or maybe one that I can't see between two nodes? It's like, where are these hidden connections? Uh, and then we of course just talked about the embeddings and the machine learning as well. So I really feel that graph analytics is a superpower for connected data. And as our data becomes more and more connected, I do feel like it's a superpower for our modern connected data as well. We see graph analytics used in many use cases. I just threw up a few here. Um, what I find really fascinating is that, uh, these use cases look different on the surface. Like predicting churn probably looks really different than evaluating your, your portfolio. Uh, but under the hoods, the data structures can sometimes be very similar, which means we can use different algorithms, um, actually same algorithms on different problems and get some very interesting results. So I'm gonna talk just about a couple of the use cases and algorithms to give you kind of a taste of what's possible. Uh, but do remember they span all different types of use cases. So, uh, if we're talking about, uh, one in one area, uh, it doesn't mean you can't use it in another area. So the first thing that I think is interesting is supply chain. I think it's a, it's a more intuitive introduction to graphs 'cause we can all think about things, how things are connected. And I'd definitely say this last year, uh, we've probably all learned more about supply chains than, than we have. Um, maybe in the previous 10, um, we've felt, uh, how unusual obscure connections have caused weird ripple effects. How something that was real one day, uh, all of a sudden was false the next. So things changed very quickly for us all the last couple years. And I think right now we're at a point where people are trying to remake their supply chains, both outta necessity. And because we've, we've seen how brittle and where the issues are. And so good introduction is basically very simply, what's the best route from a to, in this case, A to C? Uh, should I go directly there? Should I go through b? Well, if you're looking at the numbers here, and I've decided that the weights matter on the relationships, well, no, the most efficient way is go from A to D to C. And so you can use weights, uh, to kind of evaluate all sorts of things. Maybe it's cost, maybe it's time. I actually found, um, this other, uh, image here. I actually found an example where they're giving us five top best path by distance plus risk, which I thought fascinating for a supply chain to be thinking about risk of transport, but it makes sense. And if you're doing that, you could actually throw anything together on weights. Um, I was actually thinking it'd be interesting to see compliance scores on routes as well, especially if you're talking about international, uh, shipping. So let's say we have our, our best optimal path and we've got our top five optimal paths, and we're pretty good with that. The next thing we're probably gonna be interested in a supply chain scenario is where are my vulnerabilities? Where are their breaking points? And so with the supply chain, we, we wanna be looking at things like, is there a particular vendor or supplier in my network that has outsized influence if they change their packaging, if they go out of business, does it, does this break? And between a centrality, which happens to be my favorite algorithm, uh, it actually looks at these vulnerabilities by looking at the percentage of shortest paths that go through a node. Uh, and it's really quite good at understanding control points, um, bottlenecks, choke points, things like that. Uh, and if you can imagine if some of these critical bridges were actually to go away, you might actually break your graph and have a supply chain issues. So let's say we've done this in the big purple notes here. We've decided our, our most critical path. We're having good, you know, things are going well, but we've decided these are potential split of our supply chain. We would wanna find some alternatives. And so algorithms like similarity algorithms do comparisons of nodes based on, uh, some kind of set value. And this example I actually took from a McKinsey report where they were looking at, uh, med tech was having trouble sourcing medical speaker equipment during coronavirus times. Uh, and uh, what they were looking for was, are there similar capabilities of vendors we don't normally think of that way? And so anything of the same color has a higher similarity, uh, correlation. And so what they were able to do is say, for our speaker parts, we're having trouble sourcing that. Let's say that that's the purplish blue. We now have other purplish flu vendors that we can go to as alternative sources to either, uh, deal with an issue we have today or to just be prepared with something that might come up tomorrow. So another area that actually reminds me a lot of supply chain, uh, is thinking about data and, uh, can we trust their data? I was actually really surprised to see this blue quote, uh, when I was doing a little research that 75% of executives don't feel comfortable with, aren't confident with their data. I don't know how you make good businesses decisions if you're not comfortable with your data. Uh, if you, how do you know, um, which policies to support? Um, how do you know you can trust your AI if you don't, if you don't trust your data itself? And in order to do that, that's kind of where data lineage starts to come in. Uh, if we don't know where our data is from, uh, how it was, uh, processed, uh, and what may have changed or what somebody may have cleansed with it, how do we trust the data we have? And so here, looking at data lineage, it's all about where did this come from? Um, what is this data? How is it manipulated? And, uh, by whom actually as well I made these changes to it. So the diagram in the back is actually, uh, taken from a real users or a real customer's, um, cloud data warehouse looking at the movement because data lineage is often looking at the flow of data from original sources in tables about in the middle where it gets purple-ish, it's uh, that's where it starts to be views and changes and views. And then on the far end in red is actual end reports. And so you can see it's kind of hard to interpret a lot, but you can see some patterns, maybe some strengths, some things that move in different directions. But this is just one little area of data, uh, for this company. Can you imagine trying to do that for all of your data? And what if you wanted to understand not just, uh, how it was flowing, uh, maybe you wanna understand who actually made that change. Maybe you wanted to know versioning of those changes. And I took this, uh, image from Octopi here at the bottom, the graph, 'cause I thought it was a really nice representation where they actually were representing just the process points as nodes and going through ETL to cleansing and the transformations that you would have to do. And then the green ones on the end are end, uh, end reports. So it's a very graph process representation for, uh, the representation that's on the back, which is just the data flowing itself. I think it's actually a little easier to see what's going on there, um, because you can kind of, uh, get a feeling of, of the flow there. And so these types of looking at your data that way is just kind of foundational to have good, uh, master data management. It's foundational for compliance. Let's say you have to, you have a regulatory compliance and you have to report data at a certain point. Um, it can be, be hard to see where that is up there, but if you can map it to a graph structure, you can then do versioning and tracking and reporting. Um, it's also really good for data ops. So think about all of the different data we are feeding in from all these different sources to intelligent data apps. Uh, and we're trying to feed this information. How do we know that reporting on revenue, uh, one place is the same number as reporting another? Um, another place. So I had an interesting lunch today. We were talking about this very problem of how do you know that the number or the fact that I has, that I have is the actual true fact? And what does it mean depends on how you counted it. Uh, so using data lineage principles, you can start to manage your data data better and increase your, um, efficiency of your data ops. And of course, it's also important for trust, whether you're talking about trusting the data itself or you're talking about responsible AI and trusting your, your machine learning as well. Um, interesting side note. Um, working with a customer that's using a graph as a machine learning feature store. So they're actually treating the features as data elements, and then they're able to do data lineage on that as well. So where did this feature come from? Who changed it? Is this feature too similar to another feature? And we're double counting features within our machine learning and, and how do you automate that so people can grab the right features quicker? So, and then thinking about, uh, data management and data processing, that's another area that, um, graph data, uh, graph analytics can, uh, assist with as well. Uh, so the one example here at the top is, uh, label propagation. And that's an algorithm that looks around itself or the nodes look around and they grab the labels of whoever's around it. So they grab their neighbor's labors and they do kind of a reverse voting on that. And this is really good for, for a fast initial community building. So if you're doing a lot of data classification, if you have data, a bunch of data coming in and different sources, it's not well structured, it's not well labeled, but you have relationships, especially when you can weight relationships. Label propagation works really fast and it's a good way to give your first kind of communities or your first classifications. Another community detection algorithm is, uh, weekly connected components. This is a fun algorithm. You see it used in all sorts of, uh, places, uh, for data processing. Uh, what you can do and what's interesting with this is it will show you, uh, kind of that red circle there. It will show you these weird floating islands of data, uh, where they're connected with each other, but nobody else. That's kind of an unusual thing. If we were talking about fraud, I would say that's a likely fraud, uh, especially money laundering. Uh, but for a data processing standpoint that also says something about this group is disjointed and they're together. Maybe they're, um, uh, ambiguous pseudo duplicates. So that I actually am double, I have, these are actually the, some of these are actually kind of the same data point and we can collapse them. Uh, maybe they are synonyms of each other. If we're thinking about NLP, um, maybe they're self-referential because they're really part of a larger hierarchy or larger group that we should be able to know. So again, um, this another algorithm that works really fast, it scales nearly linearly, and it can help us cleanse our data and kind of move on to, um, the next point. And then finally, in, um, data, because we have so much personal data, we're always thinking about, uh, privacy and where our exposure may be. And I think this is an interesting example. I pulled this example for somebody who is using it to track sensitive information. The little blue box in the middle is a customer name, and then they did a shortest path, single source, shortest path. I wasn't sure if I was gonna be able to say that, uh, to, uh, to all the reports where customer name is used. And then they did another shortest path to all the access points to those reports. So different portals and different analyst group. And so they're able to really quickly and just this small graph be able to say, well, okay, maybe I'm okay with this exposure. That's okay. Or me, if I'm not, I know exactly where to cut, I'm kind of cut one of these, uh, red boxes or, or both if that's not appropriate. So another thing or an area that I think is on the mind of a lot of us right now is financial contagion. Um, and I gotta say, uh, 50, only 53% of Europeans are worried about inflation. Like I thought, I would've thought that was higher. Um, but anyhow, uh, we're, a lot of us are worried about this contagion. And the interesting thing is, I was looking at research into graphs and contagion. It was sometime after 2008 when we had the great recession in the austerity. And, um, those economic issues, uh, people started to shift what they were using graphs for in a contagion sense. So it used to be we were thinking about financial contagion just as a global, so something that would cross borders. And sometime after 2008, people started to think about financial contagion more locally, uh, in my state, in my country, uh, within a certain, uh, sector of the economy that I care about. And I find that really fascinating that, uh, for something like, uh, am I overinvested in the transport industry? You know, trying to understand that from a contagion standpoint. Are there things that are gonna ripple just to the area that I, uh, that I care care about? And if you're thinking that way, one of the first things you wanna know is, are there connections in my data? I can't see, are there connections in my suppliers that I can't see? So for example, um, the algorithm, uh, common neighbors, which is a link prediction algorithm, um, looks at all of the possible neighbors we may have in common and then makes an estimation that, hey, you have a lot of neighbors in common, so you probably have a connection. The way I like to think about it is social. So think about, uh, the friends that you have. If you have a lot of friends in common, you're probably gonna be introduced at some point, um, or you probably will be introduced or you already know each other, but it's just not being seen in the data and, and nobody has perfect data. So that can definitely, um, that can definitely happen. So for example, in a financial situation, perhaps you have a lot of businesses that you're working with, maybe you're invested in, um, multiple business within let's say transport again. Uh, and then, uh, and you, you are kind of wondering, are there connections between some of these vendors that would imply a relationship that I should be concerned about, maybe too tight of a, a relationship? Uh, and I I will also say with link prediction, you're thinking about not just, uh, future predictions, which are very important or future links, but also, uh, links that you, you can't see. So the other thing in a financial sense is we also wanna think about broadest impact. So what node or supplier or business or regulation has the broadest impact to my business? And so there's probably no greater known, uh, algorithm out there than page rank. So page rank, uh, infers, uh, impact or infers, uh, uh, influence based on who you're connected to and who they're connected to and who they're connected to. So the way, the way you can think about it is if, uh, if I'm an individual contributor at a large company and I have a lot of friends that are individual contributors, I have influence over those friends, you know, and we definitely share, you know, common time and, and communications. But let's say now you find out that I am the neighbor of the CEO, and she and I go golfing on the weekend, you are gonna go, okay, Amy's got a little more influence here. And so it kind of gets that, that's the kind of influence that page rank, uh, you know, kind of uncovers. It's things that are obscure on the surface, but because of relationships, you know, they have a a larger impact. And so in a financial system, it could be something. Uh, if we think about not even suppliers, let's think about regulations. A lot of regulations tend to point to other regulations, to guidelines, to other regulations, to other regulations. It can be very nested and very recursive, which regulation change is gonna have the biggest impact on you. That's something that something like, uh, page rank can help you understand is within your network, uh, maybe there's certain regulations or guidelines that have a larger, uh, a larger impact. Um, we also see page rank used a lot in machine learning in general across different industries. So it's, I think it's another reason why, um, we come across page rank quite a bit. So finally, um, let's think about security. I love this quote, defenders think and list attackers think in grass. This is a John Lambert quote from about 2015, so quite some time ago. Uh, and I think he does a really great job of explaining the thinking behind this and that it often thinks about managing assets, uh, and in classes. So you have user systems, uh, you have server systems or infrastructure routers, you have, um, uh, customer systems, you get back office systems, and we kind of think about managing them in classification. But in reality, there are many different teams that are gonna touch this. There's actually kind of a lattice work over those systems. You have, um, your help desk, uh, you have your backup team, uh, you have your, uh, your infrastructure team, your security team, and all of those groups actually kind of form this lattice work over what you would normally think of, um, your IT management. And it's those intersection points that attackers normally look at, and they actually form the basis of an attack surface. And so they will pivot around the lattice looking for a little inroad, and then the first thing they do is create relationships, and then they get another inroad, and then they create more relationships until they're embedded in influential relationships in the network. And so if you can use a graph to understand that and to see that, you can start to understand where, what do I need to protect? Where do I need to act first? What is an important system that actually has no influence? So I'm okay, you know, with, with letting that go. Uh, and so thinking about where to react first, the algorithm closeness centrality can help us see which node has the fastest path to reach every other node. And so it's looking at shortest path between all other nodes. And so in this case, I've actually, uh, brought in between a centrality as well. Sometimes people confuse these, but between a centrality, remember, is about bridges closeness. Centrality is all about fastest path. So for instance, if I had, if I was worried about my entire system, I might react first if there was an infiltration right into closeness centrality, like I wanna keep everything from hitting all the other points. But let's say I have something really critical over here that I actually want to protect more than, like, I don't care about the rest of those nodes. Uh, I've had an infiltration in my network, I wanna break it. You get rid of that and it breaks your graph in half. So that might be a bad thing in a supply chain, it could be a good thing in a network. Uh, so another reason to, to look at close centrality is you can also estimate speed of traversal. So there's a couple, uh, examples where looking at, uh, weighting the relationships, they were actually able to estimate the speed of interaction and the speed of dissemination of information. Um, the two examples I know of, uh, one was a telco, um, IT network, and the other one was actually a criminal network, which I think is kind of fascinating to think about spread of information in a criminal network. So the other thing we need to think about in security situations is false positive and false negatives. So if you have a lot of false, uh, positives, that's can be really expensive to deal with. So imagine you have a lot of false positive and your security team, it almost is noise security team has to go and look at and investigate these while you're doing that. Other things might happen if you say, well, okay, well just, you know, anything that looks like a positive will deny service that'll make your customers unhappy. And so it's really a balancing act here. And community detection can be really great start to look at how to cluster, let's say, something suspicious behavior. Um, but you also need to kind of balance that. And that's what I like for, uh, modularity, community detection, modularity algorithms. Uh, luve modularity is probably the, the best known is it's looking at grouping not just by, um, any kind of relationship, but they're trying to group by some kind of, uh, maximum, uh, an overall value. And so your overall value might be, hey, in a banking system, a lot of transactions, a high clustering bank to bank, that's totally normal, don't worry about it, but it might not be normal for individuals to banks. And so if you get a higher mod modularity or a higher connectivity, you might want to trip that as as an issue. So modularity give you a really nice way to start tuning that. You can tune it to whatever you want. It could be an average, especially if you're just getting started, or it could be some kind of random variable. So that's kind of a nice way to get started. And then what modularity do is they increase the, um, uh, the clustering over time. So if you look at these kind of greenish, uh, cluster at the bottom, uh, what it would do in the next iteration, it would eat the smaller group and it continues until it reaches that maximum that you've set. Uh, and so maybe in this example, I actually shouldn't have, uh, grabbed those, those yellow dots, but what it does, what modularity do, because they keep the iterations, is you can tune for sensitivity now too. So first you can tune for clustering strength, and now we're, uh, tuning for sensitivity because, uh, you're keeping those different iterations. And what a lot of people do is they don't take the end iteration, they take some iteration in the middle as an ability to, um, uh, to tune that sensitivity. Um, the other thing is that, uh, lu vein modularity really well known, a lot of people like it, it tends to be very greedy, so it will eat and dominate small groups. There's a newer algorithm called, uh, Leadin that is supposed to be a little kinder to smaller groups. So again, another opportunity to, um, tune what you're doing as well. So what do you do when you don't know what you're looking for? Let's say you're trying to make predictions. You are trying to get that predictive lift up really, uh, as much as possible. Uh, but you're not sure what's predictive anymore. You've run out of, uh, ideas. This is actually where graph embeddings come in really well. So graph embeddings, the way I intuitively think of them is I think of them as taking all of that structure, smashing it down, and creating a vector that can be used in machine learning. Either, you know, directly in machine learning, or maybe you're gonna output it into something you can do some statistical analysis with. Now, the first thing I usually get asked about is, well, why did I have to start with a graph? Like I went from a flat format, then, you know, you tell me I need a graph and now you're flattening it again. Why not just stay with a flat? Well, if we think about, uh, one of the security examples, if I wanted to know two hops out from an access point, how many, uh, infiltrations there were in the last day, okay, we could probably do that by hand, or how many flagged bad behavior might not be easy, but I could probably hand calculate that. What if I wanted to know three hops out, uh, how many bad actors or flagged behavior for the last two weeks? And let's say I also by the way, want to know what the attributes of that, um, that, uh, vulnerable system was or what became a vulnerable system. How about if I also wanted to know what path was taken before, uh, before this vulnerability, uh, event happened that just becomes intractable. You can't do that in normal systems. And so what graphs allow us to do is they give us a very scalable way to represent very complex data and the structure in a very faithful way that we can then push to legacy systems that and machine learning systems where there's already been a lot of work done there. So there's a really nice way to, um, be able to tune our data and tune our, um, our dimensionality and get it down to what we do. Now, there are a couple different types of graph embeddings. Um, the most probably well known, uh, right now is node vec and which is a random walk. And this is how the, how they actually work as opposed to my smash down example. So what a random walk does is it basically does just that, it randomly walks around your graph picking up things that you told it you wanted to, to have. So maybe you said, Hey, I wanna know everything about nodes, entities, I wanna know about things about nodes and relationships. Well, I wanna know nodes plus the attributes of those nodes. So you tell it what to pick up, it does a random sampling, and then it encodes it into this flatter format. Uh, another uh, embedding approach is random projections. That's a little different than a sampling. It's actually a random projection of data into space, um, that then gives us the embeddings as well. Now the interesting thing about random projections, it is really fast compared to random walk, but you can't read it. You can't tell what's going on. Um, so it's very hard to work with if you're not sure if you are starting with a kind of a a a known approach. What's, what I find interesting is you, we've, we've seen customers actually use these two together. So you use Random walk or no deve to kind of understand and tune what you're looking for. Then you use your random projection to match that and use that in production because it is so much faster. Now, the third way, um, that I haven't seen used a lot yet, and I'm really curious, uh, and I'd love to talk to anybody who might be using these, is a learned function. And in particular I'm thinking about the Graph Sage algorithm. And with that, it actually learns a function to represent your data. Now that's like the difference between trying to calculate page rank for my data set versus just running a formula, like trying to do that by hand versus running a function. And the other thing that's exciting about that is that since we're learning a function, if we add new data, we can more easily classify it because we've learned the structure. And the, the thing that I'm really hoping somebody out there maybe has tested is that my belief is that because you are learning a function, you're learning a representation of the structure that that's probably less likely to change as quickly than data points. So think about representing data, data drift, the real world change, your machine learning predictions, um, become very poor. But if I'm learning, if I'm learning structure and structure doesn't change as quickly, potentially your machine learning drift is gonna be, um, much less significant. So I don't know that, um, that's just a postulate. Uh, if anybody wants to test that, I would love to to hear about the results. So finally, some practical advice about working with graphs. So first off, configuration matters. Uh, algorithms are usually created with a purpose. So they have default assumptions, default settings, and assumptions about the structure of the data itself. Uh, for example, page rank, the most famous graph algorithm assumes a power law distribution. If you are using it on something that's not a power law distribution, you probably wanna tune the dampening factor. So just things like that, that there is no kind of outta the box. Freebie works for everything. Um, the other thing is, especially in the centrality algorithms, is there's usually, um, terminology assumptions. What one algorithm created in one domain assumes is important, is different than maybe what you're using it on. So that level of, so what's important, what the algorithm is assuming is important might not be the same definition of what you have. So understanding what the algorithms were created for and seeing, you know, if they're, um, uh, useful for you. Uh, the other thing is tuning for consistency. So a lot of the graph algorithms originally from the academic space were on very clean, small data sets. Our data's very big and it's very messy as most of you probably know. And so you can have unusual things where you get a different result every time you run it, and that can, um, keep you from doing it in production. But you can do things like seeding so that you get more consistent results. You can enforce certain ways of breaking ties, so things that you can do to kind of get that consistency. Um, you also have to worry about skewing your machine learning results. Um, the graph algorithms and analytics are really great for, uh, adding features to machine learning. But just be aware, you will always have class and balance always, because there's always gonna be more lack of relationships or relationships that don't exist, non-relationship. Then there are relationships. Graphs are sparse, our world is highly connected, but when you think about the possibility, it's actually, uh, sparse links. So you just need to make sure you balance that by either adding or or removing or both. Um, some data. Uh, the other thing to be careful of is data leakage. So graphs are highly, uh, they're all about relationships. They're very connected. And so if you just arbitrarily break your graph and use part of it as your, uh, trained and part of it as your test, there's potentially some graph relationship information that's going to already, that's already having the other part. So one of the things that if you aren't, if you're doing this on your own by hand and you're not sure how to, um, partition it for your test train split, um, use time. So you can use like a date split and, and kind of do that for your test train, that's just one of the tricks you can do. And that kind of brings us to time. So I've talked to, uh, several people, uh, today and over the last few weeks about time, uh, and just difficult data types. So in graphs, uh, time can be difficult to model. Uh, you definitely want, don't wanna just throw everything you wanna think about what queries you're going to do, what kind of algorithms and questions, and then model your time appropriately. Uh, there's a really great talk by Dexter Low who's down here in the front, um, that you can look up. Just look up DevOps in, um, time traveling with graphs to give you a couple different approaches to look at time within a graph. There are tricks, again, you can do such as only, uh, putting timestamps on things that actually occur. Don't, don't try to times amp things that there was no event for. Uh, and uh, and look at different ways to, uh, to include that. Um, geospatial is another area that, uh, can sometimes be a little tricky, but again, that's just about thinking a little bit, uh, ahead of what you need to get to. Um, the other thing we talk about or we hear a lot about is scale, scale and performance. Um, the, if you remember nothing else about this talk, one of the things that you can take away if you're, you're working with graph is do less work. Whatever, whatever it is, you do not need to graph everything in your dataset and do all the analytics, do every algorithm on it. So doing less work in general, always a good idea. Uh, but people ask about, well, how big is it big enough? Can I scale it to all my data? I, this is often a perception issue because there's this habit a a lot of times in, uh, big data analytics that I'm just gonna throw all my stuff into this new system. I'm gonna throw all my stuff into this new way of looking at my data or getting machine learning. I'm gonna hit a button and I'm gonna get good stuff out. Uh, graphs just aren't built that way, um, because they do, they are, uh, computationally very intensive, especially graph, um, uh, analytics. And so you probably don't need to do that. You're probably really just looking at a subset of information that you need. You also wanna be thinking about things like, um, using, uh, the, the metadata. 'cause if you're really trying to understand the relationships, you don't need as much data. You just need the, the metadata in there as well. There are also some tricks and specialized platforms you can use as well. I was talking to a few people earlier about, um, partitioning, it can be di difficult to partition graph workloads arbitrarily or generically. Um, but you can partition them and use, um, and use an ability to scale out if you have data that's highly correlated, really grouped together. Let's say, um, you're looking at international data and you, you put all of the, the French data in one area because that's all highly related, that's one approach. You just have to be careful because if you're looking at information another way, that may change where you wanna cluster. So no free lunch there. But uh, definitely, uh, looking at modeling and how the data's gonna be actually used. So no one generalized approach. Um, the other thing is with graph analytics being computationally expensive, again, this is the less work you want to, if you don't need to analyze it, do a do a sub graph, do a projection if you only need to look at, um, uh, certain manufacturers, uh, or certain countries of data, do a projection of that and then work on that. Uh, same kind of idea as using aggregation and uh, results voting as well. Uh, also algorithm CHO choices. A lot of algorithms in their original academic format are super greedy and you're just, they're not gonna run well in big, uh, big data. But most of those have a heuristic that let you cut down and do little cheats, uh, to get around a more purist approach. Uh, and then finally, real data is really lumpy. That's kind of why we were seeing those power law distributions and that lumpy data. Uh, and if you, you can have things like super nodes, really not unusual if you're just pouring everything. And actually I was told that in the UK you guys call them pizza nodes 'cause like everything's on top. Does that make more sense? I'd never heard that before. So, uh, I'll, I'll use that. So anyhow, you have super nodes or pizza nodes and again, if you have the US in your data set, you're gonna have like all these connections to the US and it's going to each either call your cause your career to crash or not complete, uh, or it's just gonna run terribly slow. So you, you can break those down. If you've got the US you can break down to states. Uh, the other thing is if you have, uh, something like high transaction volumes, do you really need every transaction to be its own relationship? Probably not. You could probably collapse that and then wait relationships or collapse it by day and then wait those relationships. So again, some tricks you can use there. Okay. Uh, last bit and it's just a little bit of a personal recommendations, personal practical advice here. Um, we've all heard the phrase, it's not what you know, it's who you know, but I hope after you know this last few minutes, you also know that it's who you know and where they are and where you are on the graph. So there's actually studies that show network structure is highly predictive of pay and promotions. And I find this fascinating. They actually call people who are in these blue nodes or organizational misfits, which I just love. Um, 'cause I now it's a good thing to be a misfit. Uh, but can you, I don't know if anybody wants to shout out, but do you know what, what's common here or where these two nodes have hide between a centrality? These are our bridges to other groups. And if you imagine that in a work setting, these are the people when you need help, you go to them because they can find somebody else that can help you out. So these people have a tendency to move across different organizations, change jobs. There's nothing wrong with that. It actually is good for your pay. The surprising thing for me was it's not just them, but it's people around them that also, uh, have higher pay. And so knowing people that have high between is centrality is also good. So your job tonight is to increase your betweenness centrality. So that means you have to meet a few people you don't already know and then you have to connect with them later. So not just tonight meet them, get on LinkedIn, do your connection, say Amy said to, and, uh, and then, uh, connect with them at some point in, in the, uh, in the future. Um, so thank you for your time and your attention. If you wanna talk things graph, I love talking about things, graph, um, just get ahold of me and feel free to, uh, find me online. So thank you.
Open video transcript

Stay up to date with
G-Research