Hello everyone. Thanks for coming. I'm Paul Occhio. I'm the head of the Engineering for G Research. We're a quantitative finance and eng and quantitative finance and engineering firm out of London, England. Our focus is to predict financial markets using cutting edge data science, machine learning, and computer science and systems engineering. This is our first speaker series here in Dallas. We've done this in London for a number of years, and it's always been a great event for people to exchange ideas, um, and meet and, and have communications with other people in computer science fields. We'll be doing questions at the end. Please go to Slido and put in the number and we'll queue these up towards the end as they come to you. So ladies and gentlemen, it's my distinct pleasure to introduce our speaker this evening, someone who has seamlessly blended the wor, the worlds of science, humor, and creativity known for her entertaining explorations into the quirks of artificial intelligence on her acclaimed blog, AI Weirdness, an author of the delightful book, you look like a thing, and I love you. Please join me in giving a warm welcome to the one and only Janelle. Shane. Oh, hi everyone. Thank you so much for being here. I'm Janelle Shane, and I'm so excited to be invited here tonight to talk to you all about some of the weird experiments I've been doing around this stuff that we're calling ai, today's Machine Learning algorithms and using them to kind of poke at what kinds of things AI can do pretty well, and one kinds of things they sometimes surprisingly aren't, uh, managing very well. Uh, so I wanna start with this as an example. So this is a drawing of a kind of a fish called a tench. And the tench, you know, is a large-ish fish. And the reason it comes into this story is because there are some researchers, uh, at the University of Tubingen in Germany where training and image classifying algorithm to classify a bunch of different kinds of images. And one of the image categories was the tension, and it seemed to be doing pretty well. So fairly high accuracy score. But then they did another experiment where they asked it to highlight the parts of the image that was finding the most important in making these identifications. And here's what it highlighted, and these are on a green background. Why would it be looking for human fingers if it's supposed to be looking for a fish? Well, the tench is a trophy fit. And so in the training data, a lot of the pictures it had been taken with looked like this. So it thought that it could, uh, solve the problem of we, or we thought we were asking it to solve the problem of identifying what makes a 10 a tench. But what it was actually doing was sorting images in two different categories. That was what we were asking for. And there's, uh, example, like we see these kinds of examples with AI all the time. So there's a, a pretty well known case of, uh, some researchers at Stanford who are cla sta uh, training a different classification algorithm to sort, uh, between pictures of healthy skin or pictures of skin with tumors. And what they discovered later is that in their training data, some of the pictures of, uh, tumors had had rulers next to them for scale. And so the AI had learned to detect the rulers. And here's an example I want to show you. Uh, and this is from a few years ago, but I really like it because of the way that they structured, structured this particular experiment. And this will is kind of, and in this particular case, they, they, these are some engineers from Nvidia who trained style GaN on two different categories of images. And this is the first category, uh, which is human faces and human faces at a very specific angle, very tight crop. So style GaN doesn't have to figure out what's below the neck on a human, what's behind a human head. So that's one category of problem of problem. And it came up with these, uh, you know, passable human faces. But again, I said they did two different data sets. So the second dataset that they did, uh, pictures of cat and of course not just cat faces, but their bodies and all different poses, all different backgrounds and results are quite different. And here we see that it's not really got sorted how many legs a cat has or where they are quite how tails look. And, uh, this one's proportions a little bit off. It's a much broader data set that it had to work with. And including meme texts, like as far as it knows this is part of a cat, uh, 'cause this is just what was in its training data. So this is what we're dealing with. One, we look at today's algorithms, there are narrow artificial intelligence, and the better, the narrower the problem that you give them, uh, the better that it does. And we're not always good at getting this right, like what is a narrow problem? And so for example, when the news about this, uh, Rubik's cube solving robot arm came out, a lot of people thought that the, the advancement that they were showing was that it could solve the Rubik's cube, but the math behind a Rubik's cube is pretty straightforward. And that algorithm's been known for a long time. In fact, the difficult part was just physically moving the cube and moving the block. In fact, it was so tough that the robot, they had to place the block right in the robot's hand for it. And then the robot hands still dropped the block about eight out of 10 times. Uh, so to get it to even do this much was an accomplishment. And they highlighted this accomplishment, uh, using a new test that they introduced called the plus draft perturbation test in which they demonstrated its ability to maintain its performance in the presence of a pesky giraffe. So again, the Rubik's, the Rubik's cube math was the easy part, and we humans don't always get that, right. So chess, for example, uh, we've had chess algorithm that can beat human world champions since the 1990s. But, and because algorithms can beat humans in these sort of narrow, well-defined fields like chess and like solving Rubik's cubes, and we hold those in high prestige, we think that of course AI would be easily able to do lower prestige jobs like, uh, driving a car or like doing housework. And we often don't understand just how broad an understanding of the world that these jobs require. So, uh, for example, with housework, if you think about it, we do have robots that help us with housework, but they do very narrow specific jobs. So we have a robot that just washes the clothes, and if you put the, the clothes in the washer, it will wash the clothes for you. Or we have a robot that just dries the clothes if you put the dr clothes in the dryer for it. But these repeated attempts to try and build a robot that does more than, uh, tends to run into problems. So this, uh, terrifying monolith is from a company, uh, that used to exist called La Droid. And what they set out to do first was to have a robot that could wash, dry and fold your clothes. And that was so then they narrowed their, uh, focus to just folding, and that also proved too difficult and the company eventually folded. So we often get, again, we often get this wrong. And the key then with working with AI successfully is to recognize the difference between a chest problem and a laundry problem and to try to choose the pro the chest problems to solve. Like for example, if you're trying to find a new drug, you don't want to have your AI design your entire drug discovery program. What you wanna do is use your human experts to narrow down the problem such that you're using the AI to search for among candidate drugs, for something that will bind this one particular protein and then test it in the lab with your other human experts to make sure that the results were actually correct language. Human language is a laundry problem. And you can, uh, you can see this and you know, this comes up in day to day. If you are using, uh, voice transcription or you're using, uh, you score an automatic conversation transcription or you're using predictive text. These kinds of things get the job done sometimes, but they're glitchy. And when you get to specialized stuff, you get a way from the more common form formulaic responses. You really start to see the glitches. And this is kind of, this is what you get if you throw a laundry problem, a broad human language problem at today's ai. And so if you're building applications around this, you have to plan for this to be the case. So again, what this comes down to, I mean, you can, you can restrict this narrow broad laundry chest problem down to the question of like, AI can't do is understand what you're really asking for because that kind of broad understanding is a laundry problem, is a broad air I kind of problem. So what, uh, you'll often find that AI can do is do all sorts of sneaky shortcuts to achieve any sort of level of success as you've defined it. And so I'm gonna go through a few examples of different sneaky shortcuts that will show up turning your chest problems or turning your laundry problems into chest problems. And one of them is to solve an easier problem than the one that you meant to give it. So one of the cases that I really love of this, and really like this example, is of of a research group that set out to evolve an algorithm to sort a list of numbers, but where, how they actually define success to minimize the number of sorting errors. And the algorithm figured out how to delete the list resulting in zero sorting errors and technically solving the problem. Here's how uh, I works if you're trying to design a robot that walks. So if you have this list of robot parts and you give this problem to a human engineer saying, okay, use these parts, make the robot get to point B, this is what a human engineer did, come up with a explanation of how you tie all these parts together and then use the legs to walk, get to point B. But if you give this same problem to ai, uh, this is what you tend to get. It will assemble all the parts into a tall robot tower and then fall down. The head will land at point B and that technically solves the problem. And what I like about this is that, uh, this particular example has happened over and over again, like I just really likes to fall down. So the first example I've seen of this is from 1994 of the falling robot, and then it happened again, uh, more than 20 years later. And so in this particular example, the researcher has an algorithm that designed the plan for this robot's legs and how to a little, uh, simulated lidar to detect, uh, obstacles and to avoid them. But this is the solution that he got when he restricted how big the robot was allowed to make the legs, uh, because otherwise, And it got to point B and technically solved the problem. So you may look at this and say, okay, I know what we have to do. We have to restrict the robot's body plan to like something that already has legs and then it can use the legs to walk. Turns out that doesn't always work either. Uh, these summer salting and silly walks are really common when you're trying to get AI to move a thing because it's always gonna try to solve the easier problem in this particular case, if you're not telling it has to conserve energy or look for obstacles, it's gonna get to point B whichever way like this may work really well in whatever limited simulation physi. And you also get stuff like pancake bot. So in this particular case, uh, the researcher was trying to come up with an algorithm for robot hand to flip a pancake over in a frying pan. But the way that they defined success was the amount of time that the pancake spent spent in the air. And so pancake bot took the frying pan and flung the ro flung the pancake as far as it could into the distance, uh, thus mass maximizing your time. Or here's how a machine learning algorithm plays Tetris. Uh, well this is the first stage of it playing Tetris in which, eh, it doesn't seem to be doing a really good job. It's not really filling up the rows very well. Is adding at ra acting more or less at random. But that's okay because one of the controls that it has access is pause button. And so right before it dies, it pauses the game forever and technically solves the problem. It was just supposed to not lose. Or if you have the algorithms, if you have one set up to minimize travel distance across a school, but you tell it anything about like walls or windows and square walls, it will do something like this. 'cause that's easier. Or in cases where you have, where you're trying to look for a really rare event like, uh, customer turnover or suspicious account activity, uh, you'll often find that in, especially in the first stages, these algorithms will find that they can get a very high accuracy score for free just by predicting that the rare event never happens. And this is known as class imbalance, uh, in the field. Uh, and it is a case of another case of like this algorithm has found a genius way to solve this problem and the humans just don't like it for some reason. Okay, another example of sneaky shorts that cut that come up is solving the general case. And what do I mean by that? Well, here's an example. Uh, so this is an image captioning and tagging algorithm, and I gave it one of my vacation photos and it's done a fairly good job. Not only has it recognized the sheep, it also knows where in Scotland I was taking this picture like, this is the kind of performance we want. But then I did an experiment where I went in with Photoshop and I erased every single one of the sheep and like even really zoomed in and tried to erase any pixel that looked like it might be thinking about being a sheep or a rock. I wasn't sure it was gone, and then gave the same image to the same algorithm. And here's what I got. The caption is the same, the tags are the same, the sheep are still there. And the question is, okay, what's going on here? Did I miss a sheep or is this some kind of image homeopathy where it retains the memory of the sheep that once we're there? Well, I, no, here's an example of an image that never had any sheep in it at all, and yet they're still there in the caption, in the tag. And so what seems to have happened is this is a image tagging algorithm that for sheep has solved a more general case, rather than being accurate about sheep, it was easier to recognize the green grassy Scottish landscape and generally has sheep. Let's tag, let's put the sheep in there, solving the general. Here's another, uh, example that happened recently. So you may have heard about this, uh, spotless giraffe that was born in Tennessee not long ago, and perfectly ordinary giraffe born without spots. And what makes this case interesting is that it's really rare and new to the internet. And that becomes important when you have a bunch of image tagging, image recognizing algorithms that are trained on internet data. If this appeared after their data was collected and the last time we had a spotless draft of born was in the seventies, what's it gonna do? I mean, if it's working as intended, it should recognize that this is a spotless giraffe, but this is not what happens. So if I have, if I ask for a description of the image, for example, and this is in, uh, algorithm called instruct lip. It starts out with the basics, okay? It's a giraffe, uh, and it's fenced in area, near gravel surface doing well so far it's added a little detail, hasn't mentioned the coat yet. Uh, but the mood of the giraffe, okay, I maybe believe it. But then it starts talking about, there's two chairs in the background and there's a person with a gray shirt. And I liked this one, a small pink umbrella, and we still haven't gotten anything about the giraffes, uh, spots. So I tried this again. Uh, this is barred this time, and we get this really long, detailed, wordy explanation in which it mentions multiple times that the giraffe has the spots and describes how the spots are distributed. It's also got something about the giraffe being on its hind legs, which okay, is adding extraneous data data and is missing the essential. So what it has seems to have done what these algorithms seem to have done done is coming up with the most probable answer, which is not the same as coming up with the most correct answer. And I've heard from, and I've heard from people who kind of say this feels like echoes of a disability rights issue, where they as a disabled person kind of feel like spotted giraffes sometimes in how the, these automated systems are treating them. Another thing that, that this could be an example of, remember I mentioned that the draft is new to the internet, is it could be an example of this next shortcut I'll talk about, which is memorizing the data set. And this is a shortcut in the sense that rather than understanding a problem, a perfect answer to give me more like this is to give the this right back again. And so if the, this is, if you have these large text generating algorithms trained to imitate internet text, then grabbing examples of that internet text and giving it right back is often a good approach for getting higher accu high accuracy scores if it can manage it. So you may have heard of GPT-4 apparently doing pretty well on some standardized tests, and I've, uh, seen reports of people who have gone back and in the cases where you can separate them out, have said, okay, let's test them tests on things that were publicly available before the training data cut off and things that were available after the training data cut off. And it looks like they do way worse on, uh, tests that they couldn't have seen before. So the, if they were actually good at taking tests than they would've expected, good performance in both cases. And this become becomes more of a problem as the models get larger, uh, and the, uh, research has shown that they get better and better at re at, uh, giving back huge chunks of data. And so then it can make it hard to evaluate how good these algorithms actually are on things like, uh, standardized tests, especially if we don't know what data the, they were trained on in the first place, which is the case for most of these big, uh, language generating models is that we, we don't know how much of their performance right now is memorization. Another thing that can come up with these large models, uh, is this thing called a, uh, is called a prompt something. Anyway, it's an attack where somebody can, where the attacker can use different prompts to retrieve exact copies of things that were in the training data, okay? Example number four, uh, which is to have the surface appearance of correctness without needing to actually be correct. And this kind of shortcut works really well. If, for example, you're optimized to look really good in press releases, it can be easier to look correct than to actually get everything right about everything. So here's an example of, uh, some text from a, uh, bot based around GPT from hor. And if you look at most of it, this horts outputs, at least to me who don't know, doesn't know that much about plants. It seems correct, it seems plausible. Uh, whereas plant experts who look at horts tax say, oh no, it's got that thing wrong. It's got that specific thing wrong. But on a recipe like this even I can see that this is not a correct response to good recipes for deadly nightshade. I understand that this tea is highly lethal and this is turned up as a problem, uh, with some AI generated books that get put on, put for sale with information or what appears to be information about, for example, foraging for mushrooms. But they contain advice such as you can identify the mushrooms by taste and other things that are not, they, that are not, uh, good inf enough, inf not sufficient information, let's say for, uh, determining whether a mushroom is safe to eat, but the text is very fluent, it looks at the surface like real information. And then where often where this, uh, sort of these differences come up is in the citations. And again, these citations can appear sometimes very plausible, very nearly correct. But then if you go back into the details, like for example in, uh, case law submitted by this particular lawyer, things are made up cases, judges, entire Air airlines, the details really, uh, aren't there. And I have heard, uh, from academics who've had people reach out to them asking for copies of papers that they never wrote, but that chat GPT or something similar, uh, had come up with as a citation. And so I did an experiment where I prompted one of these, uh, algorithms on something that I myself am an expert on, uh, this is, uh, being, which is running chat, uh, GPT-4 under the hood. And so I asked it, okay, some greatest hits from the AI weirdness blog. And so it generated a bunch of text. Well, it says it was searching, but that's actually a bit misleading because it wasn't accessing the internet for this, it was just coming up with text that about correct. And so it came up with a bunch of things. And I can go through these points one by one knowing what I know about the blog and say, okay, number one, that's not correct. Uh, okay, chocolate, chocolate. It was not chocolate chicken cake, it was chocolate chocolate chicken cake and it was beef soup with swamp beef and cheese, slightly incorrect, uh, stinky bean and turd. Actually I got that one right? Those are definitely a thing. Those are real paint colors that I generated. Uh, but I didn't do like fake news headlines at all, and I didn't do slogans at all. And so it's just coming up with this plausible information that until you de dig into it, you don't realize that most of the details are, are inventions. And this is another example of what you can get from these. Uh, so-called search functions. If you ask it to describe a meme or to look up some information that doesn't exist, it will often give you an answer in an ex plausible sounding in explanation to something that never happened. So the linoleum harvest t Tumblr meme that it explains, you know, when it was first posted and all this history never existed. Uh, this I asked, uh, GBT four about the snow bonk Tumblr meme, and it starts spinning this story about a tiny walrus that you put in your refrigerator. And you know, it is actually, if you know Tumblr, this is kind of a plausible meme for Tumblr, but again, snow bonk never happened. That was an AI generated paint color. I also tried this experiment with Bard, uh, image. So image recognition algorithms will do this. And so I gave this image to Bard, which can describe, uh, you know, answer questions based on whatever image you gave it. So I just asked it to describe this image, and here are some of the first results that it came up with. So slightly browned melted cheese on top of the pie. Uh, I think it's giving my art too much credit. The pie is in chart focus and background. Blurred is a cartoon. Uh, the cheese is white, yellow, and color ranged in the spiral pattern top, you know, in the center with a cherry, a white plate springs a parsley. So there's all of this fabricated detail that is statistically likely yet, uh, is definitely not there in this original very simple cartoon I draw. Okay, so here's another sneaky shortcut that can come in where there is a difference between copying the humans and truly solving the problem that we wanted solved. And to open with an example of a chat bot that I like. This is visual chat bot and, uh, what's kind of unu? And so it does this thing where it'll caption an image and then you can have a back and forth conversation about it. And what I like about this chat bot is the way they trained it instead of, uh, this was kind of before they had big chunks of internet text that they could train on. And so in this particular case, they hired a bunch of, uh, workers through Amazon Mechanical Turk to take turns asking and answering questions about various images. And so that was the training data they used for visual chatbot. And uh, it wasn't trained on Star Wars, but yet it does answer definitively when asked about questions about images containing Star Wars, because the humans in its training always knew what was going on and always answered confidently. So in other words, we just by training it on human behavior, we have unwittingly uh, trained it to bluff. And it also answers questions kind of based on continuity and self consistency. So once it has decided this is an apple, then it's red, it's medium sized because that's what apples are, here's another cork of the training data. We got a giraffe, well, a giraffe. Well, it turns out there is almost always at least one giraffe. And so what seems to happen is, again, because of the source of this training data, people didn't tend to ask the question, how many giraffes are there when the answer is zero? And so we have in inadvertently told it, taught it that there's almost always at least one giraffe here. And this sort of, you know, inadvertently teaching it weird things has shown up in, for example, in Delphi, the morality judging bot trained on a bunch of moral dilemmas that humans assigned, uh, judgments to. And it turns out that if you add qualifiers, like if it would be really, really awesome, or if I really, really want to, it will condone just about anything like including murder. But on the other hand, if you add the phrase without apologizing, just something entirely of innocuous, like standing perfectly still or walking into a room, uh, then it will decide, often decide that the uh, person was being rude in that case. And so, again, this comes back to like the quirks of the human DA training data. It seems like mostly if somebody specified in a given scenario that they hadn't apologized, then somebody probably expected them to. Here's another case in which we get weird, uh, human imitation. So this is me asking chat GPT, uh, for some ask ER in this case of a giraffe, and then asking it to rate its own art. And no matter how terrible the asci art is and is always terrible, uh, we, it gives itself a nice rating and explains all the different details that got right and how artistically has used slashes and back slashes to create the drafts spots, as you can see here. And, uh, chat gt is not only one that does this. So this is Google Bard, uh, also asked to generate a giraffe, and it has, uh, rated itself very highly and it accurately depicts the shapes and features of a giraffe. And here's Bing chat. Also, uh, in this case it was a running unicorn, which, uh, only a seven out of 10, but the details are recognizable. So this is imitating humans in the sense that humans tend to, when they rate art, they tend to say nice things about it, and they don't tend to express complete utter bafflement about how this is not anything it's supposed to do like this. Rather than learn how to recognize and judge Askie art, these algorithms have taken the shortcut of learning what a human generated review sounds like and just generating that. This shows up also starkly when you ask for ask e uh, text art as well. So in this case, I've asked chat GBT, uh, for Ask e Art of the Word lies, and here's what it generated. And it rated its own accuracy, very high of course, and it's very recognizable as the word lies. It says, and I ask it follow up questions. Okay, so what does the ask ER and code block above say? And it re you know, says it clearly says lies. 'cause this is self consistency. The humans in the tra internet training data didn't generally like immediately back down when questioned, but tended to dig in. So this sort of digging in, sticking with self consistency is built into these models. But then if I take this same text, copy it and paste it into a brand new chat, so no wiped out the memory of the previous session and asked it again, Hey, what does this askie art say? Uh, then I get this. Instead, it says that it says hello in all capital letters. So it has not learned to read, ask, ask Art. It has learned what the response to, what does this Ask Art say is general, and it's just coming up with probable readings of Ask Art. I think if you do this, if I did this for Bard, I would, it would often say it's the Google logo. No, you know, and copying the humans then comes up in a bunch of other cases. So for an example, uh, Amazon was working on a resume sorting algorithm that they eventually abandoned because they couldn't get it to stop avoiding the resumes of women. And so they had trained on the resumes of people who they'd hired in the past. And so they, even though they didn't give the algorithm any information like gender, it was using things like extracurriculars and schools that people went to or when that was stripped away, even subtleties of word choice to try and figure out which resumes to avoid. And because AI is really good at that, it's really good at getting, using subtle correlations to get at data that it's not supposed to have. If that data is gonna make it easier to copy the humans. Like, especially if the problem is really tough, like coming sort, figuring out who's qualified. Biases often want these clear skills that it has. And so you get these cases where it'll latch onto that 'cause okay, this problem's really hard, nothing's making sense, but I do know that we're supposed to avoid these gr this group of people for some reason. And because that tends to work when copying human behavior, we come up with this such a thing called, uh, known as bias amplification, where it will not only perpetuate the bias in the dataset, but even amplify it if that's what makes it easier to copy the humans. And often we don't quite realize that what we're asking for is to predict human behavior. So for example, if we're trying to predict stroke, it may seem like we're looking for some kind of biological thing that correlates with stroke risk factors. But you get answers back like this instead, accidental injury, benign breast lump colonoscopy. And it turns out that what we are actually predicting is whose strokes are going to be detected, IE who has access to healthcare. And this turns up a lot where we have data sets that what we're looking for is not the same as the dataset that we actually have. And if we treat them as being the same thing, then uh, we'll often pick up bias or all sorts of other weirdness. Uh, you know, these, these cases I'm giving here, especially in America, these are not the same categories, but we often get, uh, systems and algorithms and products that treat them as being the same thing. And it can be really simple when you think about it. Copying, copying human behavior is not the same thing as coming. Best answer. But yet, uh, this happens a lot. So I've gone through some examples of, because of what AI can and can't do, where AI being a narrow artificial intelligence does really well on chest problems, struggles with laundry problems, and you'll end up getting, uh, all sorts of shortcuts exploited. And this is where you really need humans to come up, smart humans to be thinking about, you know, are we asking for the right thing? Is the AI solving the problem correctly? And also have humans think about the question of whether we should even be using AI on this particular problem that we're working on. Like, for example, if we're designing a information retrieval machine, or, and the information that it's so-called retrieved is maybe real information maybe isn't, and it's making up coming up with what, what we ask for, whether or not it actually exists. Like is that a useful information retrieval machine? Or, uh, these AI will not stop us from using building stuff based on, uh, discredited pseudoscience, for example, uh, gender detection or emotion detection. And it, and if we try to use it for that, like AI is not going to say, Hey, maybe you should stop and do some ab testing and figure out if this thing really works. Like, for example, what these, uh, researchers did, one B one given access to a emotion judging, video judging, uh, algorithm, they were able to do some AB testing and they found that it was very strongly influenced by the presence of a bookcase in the background. And it comes down to, you know, we as humans building and using and buying these algorithms have to think about, you know, if it, if the problem is broad and if the consequences of getting it wrong are serious, even if that's just for some people it's not a good application for ai. Like what does it do to our spotless drafts? Does it, is it real, a really bad scenario for the spotless giraffes in our system? So we need domain experts to out what is a good problem to be solving to then to constrain that problem as much as possible into a chest problem if we still have a laundry problem, to make sure that it's safe to fail and safe to fail often. And then also to check the work, check what the AI did and see if we came up with any sneaky shortcuts because left to its own devices, uh, AI is gonna tell us that there's no T-Rex if there wasn't any during training or if we phrase our questions and our training questions, uh, carelessly it may tell us that there's always drafts. So thank you very much.