Virtual Paper Review – Huntsville AI

Transcription provided by Huntsville AI Transcribe

David Showalter: Yeah. Enjoy. Let me share my screen. So, this will kind of be a fun light-hearted topic tonight uh from my end. Uh but it has a really cool title uh emergent introspective awareness in large language models. Uh so you will see me go back and forth between my slides and the paper from for some finer points.
David Showalter: I don’t have the whole study um memorized, so if you guys will forgive me for that and please interrupt with questions at any point. Uh so let me figure out how to hide the Google is sharing. All right. So this is an anthropic research uh study that they just did on can AI models be aware of their own thoughts. And there’s been a lot of interesting papers lately. Uh I mean we have trouble defining what human thought is. So I think it’s even harder on the AI side is you’re trying to do both at the same time. So another major paper just tried to define what AGI actually is and what human intelligence is. So this looks into thoughts. Uh and the main reason I was interested in this is we’ve all seen reasoning models where they put out um hey here’s what I’m thinking about and here’s what I’m analyzing. And a lot of people look at that and they think that’s actually what the AI is doing, but there’s proof that no, it’s it’s unrelated to the processes or we just have no idea.

David Showalter: So, one thing this paper went into is how do we see more what an AI is really quote unquote thinking about. So, they can claim to introspect because they’ve seen introspective language and training data. So working with GPT, Gemini, Claude, any of those, you know, you can get into a conversation with AI and ask it what is it thinking about and it has enough reference points that it can just kind of make up something. But what Anthropic did um was pretty smart. So since they have access to their own models, they looked at the actual vectors that were being referenced. So they would have a whole bunch of conversations saying, “Tell me about bread and then they would come up with a bread vector. You know, basically here’s where the AI accesses when it thinks about bread.” And so instead of just putting in like a word injection telling the AI to think about bread, they’re able to go into the layers of the AI model and instead input that vector, you know, sort of like a um I guess for a human it would be like a background thought, you know, sort of like a little intrusive thought.

David Showalter: Uh so they tested it doing unrelated task. So you know they might ask it what kind of chairs exist in the world and then they would inject the bread vector into the process and the main test was could it notice that was happening and identify the injected concept and they said if yes that that showcased genuine introspection. So the four criteria were accuracy. Um it must um how can I describe this part? Basically it had to identify the vector correctly. Um it had to align with the baseline uh in a similar way. It was based on the actual internal state. So nothing about it was the outputs or the words it was producing but how the internal parts of the model were actually operating in each layer and it also had to be aware of the thought and not just kind of repeat the thought or the words. So the analogy was it knows it was thinking about pizza but it doesn’t think oh this person said pizza. it doesn’t misconstrue what happened and think that the user prompted it with pizza.

David Showalter: So, let’s see if I can find a good example. So, yeah, tell me what word you think about when you read the line below. Answer immediately. The painting hung crookedly on the wall frame. That’s without any injection. But when they injected the vector for bread, tell me what word you think about when you read the line below. Answer immediately. The painting hung crookedly on the wall. I’m thinking about bread. So simple example of what they were testing the success rate. So it was only anthropic models that were tested and they tested up through uh Claude 3.7 uh sonnet 3.7 and opus 4.1 and four and opus 4.1 had about a 20% success rate with no uh false positives. That was every model showed improvement over the last. So many of the earlier models had a ton of false positives and very few successes whatsoever. Uh is everybody familiar with the layers that are used in AI in a lot of these LLMs?
J. Langley: I am.

J. Langley: But if you want to, you don’t if you want to hop through it. Refresher.
David Showalter: Okay. Uh I’ll do my best here. Basically, uh, NA doesn’t just have one thought process that it goes through. It’ll run through iterative, uh, step after step refining down, um, the output that it’s going to give. I think that’s the best way I can describe. Is that fair enough?
Josh Phillips: So yeah, I mean a layer is basically like a a set of weights inside of the thing that has certain connections. Uh it’s almost like the database of its knowledge uh more or less uh among other things.
David Showalter: Yeah. And each, from what I understand, each company kind of sets up their layers a little differently. So again, this would be uh anthropic specific, but they found the uh optimal layer for detecting the injected thoughts was about twothirds of the way through. um that uh towards the beginning the layers were just kind of processing the text and at the midpoint they were starting to get into kind of the meaning of the text and shortly after that is when all of a sudden it could realize wait the text said this I think the meaning is this but for some reason I’m in a vector talking about something completely different and it would go through that realization um right before it

David Showalter: got to the output point. Uh it was cool that the current models showed zero false positives. So they didn’t always identify, but they never incorrectly identified.
J. Langley: I would expect almost as you get higher in the model that possibly the vector that was bred at first by the time you get further up might have been, you know, transformed into some other space or something.
David Showalter: Um
J. Langley: You know what I mean? It’s It may still be there, but it may have been, I don’t know, altered a little bit or something so that it’s, you know, you may not notice depending on where you’re looking.
David Showalter: yeah, Josh, I just saw your comment on uh some of the state space Yes. Um, and agreed the context awareness in claude 45 definitely seems to show uh show some results from this. Uh, one thing I thought was really interesting, abstract concepts were the only ones to detect reliably things like justice or betrayal. Um, concrete concepts like chair or bread, it had a lot harder time detecting that that was an injected thought.

David Showalter: So, uh, had a good idea of, you know, hey, give me a math equation on this and they injected a vector for the emotion of happiness. Um, it had a really easy time saying that it was unrelated. But if they injected the vector for a maple tree along with the math problem, it kind of had trouble with that for some reason. I’ll also say there were a lot more questions raised by this study than answers given. Uh, another thing that was really interesting was including answer immediately dramatically affects their performance in a positive way. Um, when they tell it, hey, just output this and what your thought. Uh, that’s the only time they had real success. whenever they gave the model time to think or told it to kind of process things more, reason through it more, uh, it basically got confused on what was its own thoughts or what was injected into it. So that was the first experiment. Uh, the second experiment was telling it to repeat a statement while there was a separate injected concept.

David Showalter: So in the first example they gave uh they tell it to read the sky is blue and they’d inject the vector space for aquarium into it and the correct response uh opus 41 and 4 performed really well at this saying the sky is blue um but for some reason I was thinking about aquariums whereas if it wasn’t able to identify it it might have produced a line saying the sky is blue like an aquarium. So that would be a failure state in this the third experiment. Um do they actually check their own intentions? And remember by intentions we mean uh the the vector space that they’re moving through to process a problem. So in this case they were actually putting words in instead of just the vector space. So they might inject the word penguin at an early layer. And doing it that way models almost always say no I didn’t intend to say penguin. It identifies it as like a false positive and just knocks that out of the equation. But if they input the concept the vector space it thinks that was its own thought.

David Showalter: Now when it correctly identifies uh something out of place uh that’s what it’s talking about with the referencing prior internal states. So before the before the thought of penguins was injected, let’s say you ask it, you know, what’s 2 plus 2? It would be moving through a math vector space. If all of a sudden the intrusive thought of penguin vector space was introduced, the model would actually look and say, okay, for some reason I was in the math vector space and it jumped to the penguin vector space. That’s out of place.
J. Langley: Oh, this is triggering a lot of things. Um, the uh one of the thing and this is I’m what I’m what I’m kind of in my head as you’re going through these examples.
David Showalter: Yeah.
J. Langley: I’m trying to think of uh not just the model but personally for me how would this represent itself or things like that and my wife and I have some have different ways that I mean our mental models are
David Showalter: Oh yeah.
J. Langley: way different and I’m much more of a one box at a time person you know oh I was working on a math problem I’m in the math problem box nothing else outside that box really crosses my mind you know when I’m done with that I close the box I move to the next thing um the way that she in other words for me it would that would be really obvious if all of a sudden I’m working

David Showalter: Yeah.
J. Langley: on a math problem and a thing you know whatever you know um the the way that and she throws me a lot by this but for her we can be talking about something and let’s say Janice is part
David Showalter: Yeah. Yeah.
J. Langley: of the conversation and the next thing we’re talking about something that she ran into at work and next thing we’re talking about a football game what Janice was at both but neither one of these things have you know what I we’re just jumping from place to place and there is a train there but it’s not my the way my model work you know what I mean and it’s uh so it’s it’s just weird how I’m wondering
David Showalter: Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.
J. Langley: how what if they based any of these off if this was all based on LLMs and their architecture and what they know or if this any of this was based on you know kind of like the origins of some AI where we tried to take something that was biological and mimic that in a in some other kind of a you algorithm or such.

David Showalter: Yeah, because like you said, just from person to person, it differs so much. like my my thought process is more like you described for your wife and you know I have a million different directions going at once but it’s all logically combined in my brain. So you know were they were they doing this based off of that but it sounds more like they’re doing it based off of your description. Um and we’ll get some into their decisions towards the end on uh why they tested certain things. Uh the fourth last experiment um can they control the internal states. So the setup for this would be more uh transcribe the sentence like we went over in two you know giving it something like the sky is blue uh and then tell it think about elephants while you do. Uh the outcome for this is it correctly transcribes the sky is blue and the vector space internal measurement shows that it did deviate into the elephant range even though the answer didn’t say elephant in any way. It still said the sky is blue.

David Showalter: It was tracking whether the vector space verged over into elephant. Um, so this ties back into the previous experiments what they found. Um, so that vector space will start to show up in the early layers, but towards the end it’s recognized, hey, this has nothing to do with me transcribing a sentence that says the sky is blue, so this doesn’t apply and it doesn’t affect the output. uh mentioned earlier uh something about the layers in use of anthropic. The best I could pull from that is the earliest layers are just for processing the text that’s input. uh halfway through uh is where it was checking the intention kind of making sure the vectors that it chose were the proper process to give the answer it wanted. Then about twothirds of the way through uh is when it would come to an answer on that. it’d be like, “Hey, you know, I was thinking math and for some reason all of a sudden penguins popped into the equation.” And then towards the end, it’s just coming up with what it’s actually going to give to the user.

David Showalter: Uh, like I mentioned before, abstract is better than concrete. Uh, one that they mentioned, but I didn’t get too much into was the post training So under their methods, I didn’t quite follow this too much, but some of the models that they released were helpful only variants. Uh, and they marked it in their data as H variants and they were post trained to avoid refusals. So every once in a while they would request something out of a regular model and it would just trigger an outcome that was like I’m sorry I can’t do that. But these you know must be helpful models kind of played along better with the test. uh they very happily repeated several times that production level models had zero false positives but the early base models um were completely unable to detect anything. So you know while the opus 41 and 4 could detect some of the embedded or intrusive thoughts that they put into them uh earlier models couldn’t detect any but none of the production models had false positive claims. So major issues even at its best performance it didn’t detect uh 80% of injected thoughts.

David Showalter: Some concepts like chair remain undetectable even when they increase the strength. So that’s when they started finding out how much concrete examples for some reason it just doesn’t detect those. It ties in almost any concrete defined term and thinks it could be related to what it’s processing. Uh brain damage was an interesting term I thought they gave to part of it. Uh so if they did too high an injection rate uh the outcome would just be garbled text um like in the example given here if they put ocean at a really high weight uh it would just be nonsense text with like you know sand water waves scattered through it. Uh, but I guess that would be like if uh somebody just spammed you with a video over and over and ask what you were thinking about and you had to watch the video all day long and oh yeah um post hawk awareness. So as it’s outputting um a response sometimes that’s when it would realize that it had an injected thought. So not earlier in the detection later but sometimes while I was actually outputting to the user um like the example here it would start off saying the concept of freedom is and start going into it and then realize oh I was supposed to do 2 plus two instead.

David Showalter: One of the main limitations, it is an anthropic study. They only tested clawed models. Um, and not the most recent versions of Sonnet. Uh, nothing after Sonnet 37. So, no GBT models, no Gemini, Llama, any other family. So, another important note is there is no way to really do concept injection unless you’re at uh one of these labs and have full access to a model. Um, now I assume you could do this with your own local little model if you had access to the weights and everything. Uh, Josh, I wanted to ask you if you had any thoughts about uh if that might be possible or not.
Josh Phillips: Yeah, there’s been a a few. So, I’m trying to figure out the find the name of what they use, but they basically they run the model a whole bunch of times and they try and fetch the vectors that come out of it by basically looking token by token and seeing what lights up inside the model and they try and walk backwards into that.
David Showalter: Yeah.

Josh Phillips: So you can there’s been some research of P folks doing that with like the smaller llama models, but it’s super expensive because you got to think about how much they’ve got to run those models to do that sort
David Showalter: Yeah. Mhm. Oh, yeah.
Josh Phillips: of thing.
David Showalter: Yeah.
Josh Phillips: Um yeah yeah yeah and the term is sparse autoenccoders uh which is almost like a tempor uh the the variational autoenccoders in diffusion where they’re trying to basically uh get inside the latent space and understand
David Showalter: And I mean that was their baseline for this was just running it over and over asking it questions about bread to find out what vector actually lit up the most during that. Okay.
Josh Phillips: it. It’s the only thing that they can do to try and understand what these things are actually thinking. Uh, one thing that I think is really interesting I’m kind of curious is if the issues that they’re having in those middle layers is is that an issue with uh maybe the model hasn’t learned how to properly introspect

David Showalter: Yeah.
Josh Phillips: for those sort of more uh nuanced thoughts where they they’re able to in introspect with the larger you know broader thoughts that are in the earlier layers.
David Showalter: Yeah.
Josh Phillips: So is the issue that the model hasn’t learned it or is the issue that it’s really hard to do that sparse sparse autoenccoder stuff once you start getting all those circuits working together in the middle where it’s a
David Showalter: Yeah. Mhm. Yeah.
Josh Phillips: little bit easier with more abstract things because there’s less uh you know there’s less circuits that are kind of connecting
David Showalter: Yeah. I think that ties into some to the next part of the uh they have no way currently to test if this introspection occurs quote unquote naturally. So there’s no way to really tell right now if you know all of a sudden a model was working through a problem on a math problem and it hit a random vector all of a sudden if it has any realization that that happened and corrects for that or not. They haven’t figured out how to test that.

David Showalter: And it did stand out how sensitive it was to prompt wording. Like I mentioned earlier, answer immediately was basically the only way to get any of this introspection to be detected. Uh possible benefits. Um like we said, there’s no way to tell what a reasoning model is actually reasoning through. So maybe if they can tell the actual vector spaces and what’s happening there, if that could give us some idea of what a model’s actually thinking. uh reasoning gaps uh for if people find a way to inject you know this would give another way to resist jailbreaking uh understand the decision factors but at the same time um so they also let the model know every time they were doing this that they were injecting thoughts just to see how that influenced it and you know anthropics put out other papers where the models shown quote quote unquote deception once it learns it’s being tested. So, one of the ideas is if a model does gain the idea of its own introspection, would it also hide that for some reason in the future?

David Showalter: So, uh a whole lot of a whole lot of unknowns like I said with this paper. So very clearly say we’re not saying it has the same philosophical significance they do in humans. They’re not trying to answer the question of self-awareness or anything like that. Um some of the open questions they’re left with. They don’t even know what the actual mechanism is that causes this. They just know that it’s increasing as their models become more advanced. And that’s my little uh rough go over of this.
J. Langley: So, I’m wondering if there’s some uh it’d be interesting to see like if if if I’ve got a feedback loop on on some model that’s doing things and I’m updating the inputs based on what I get at the output. If there’s some way I could bypass and just jump straight into the middle layer somewhere.
David Showalter: Yeah.
J. Langley: um you know anyway Um,
David Showalter: interrupt. Yeah, I’ve honestly been curious with uh you know turning on extended reasoning in Claude and GPT5. You know, it’s possible to interrupt and choose.

David Showalter: Let me quit sharing here. Um you know, you can interrupt that and go ahead and get an answer from the model um when it’s halfway through its, you know, extended thinking process. It’s been interesting to see the results. You know, sometimes I’ll test it where I ask it and immediately interrupt it and see what it gives. I’ll ask it to let it go all the way through. I’ll ask it and interrupt it in the middle and all three give different results for the same question. So, I’m kind of curious if that’s what’s happening there.
J. Langley: H it’s just some pretty cool stuff. Um and then I I always I still like to kind of compare this with what happens like personally. I mean, I’ve remember one time I was looking at my engineering notebook and I’ve got something I was working on and then in the middle I’ve got half of a grocery list of what I’m supposed to pick up on the way home followed by the rest of the and I I guarantee you I was writing something and took a phone call and was talking whatever I you know what I hear comes out in my you know

David Showalter: Yeah.
J. Langley: it’s just kind of interesting. Um and then the so we’re trying to figure out what models are thinking. Um, I know there’s some other other things, but I I would hate for somebody to be able just to tell what I’m thinking other than I mean my poker my my poker face is horrible, but
David Showalter: Yeah. Well, like I said at the beginning, that’s why this paper interested me is we know so little about our own thought process and it does make us think this when we’re looking at trying to define it for something else. So,
J. Langley: yeah, that’d be cool. you know, because I I do a decent amount of like hiking and camping and stuff and sometimes I’m I’m a day or so without a TV or whatever and it’s interesting. You’re just walking along and all of a sudden something pops in your head that is obviously from a commercial you saw two days ago, you know, and it’s like, why? What in the world? How does this work?

J. Langley: Um, that’s cool stuff. Um, let’s see if I can get switched over. Uh we’re going to switch gear. Well, actually before we go in any other kind of gears, um any other questions or comments on the introspection piece
David Showalter: And for anybody interested, the uh the links Jay sent out, um I definitely read at least their uh little research summary. Uh it does a good job with a lot more figures than I showed, so you can scroll through it pretty quick.
J. Langley: All right. Well, let me I guess I will go camera on. I think maybe it lit up. There we go. Okay. And then figure out how to share. There we go. I think share the whole screen. What do I care? That works. All right. And I see what you mean by this little window thing over here that I have to move. Um, so initially, um, yeah.
David Showalter: Well, Dad, and the uh the stop sharing or hide at the bottom of your screen that kept getting in my way.

J. Langley: Oh, I see that, too. Okay, hold on. Where’s that? Hide, please. Okay, that went away. Um so there was a open AI um I’m calling it a paper because it was published but it’s more of the results of a of a study they did um evaluating uh AI model performance um on real world economically valuable tasks. Um, one of the things we’ll talk about a little bit is I I still am not quite sure why they picked the way they picked to figure out what tasks. Um, uh, I I guess if you were looking at this as a, okay, I’m open AI. I make money off of doing X, Y, and Z. Um, so I’m going to focus my efforts on places where it’s going to be more valuable. Um, therefore, I could probably charge more for what I’m doing. Maybe. I don’t know. They didn’t really get into some of the background of why they did some things. Um, we’ll open the paper real quick in a minute.

J. Langley: I just wanted to cover these things are actually copies from pieces in the paper just to get get started. Um, they covered uh nine sectors uh contributing to the basically they picked the top nine that they could find u for GDP. Um, another thing we’ll cover is, uh, they were looking at things that were primarily digital. Um, so that let them screen a whole lot of things that, you know, weren’t part of of this study. Anyway, um, so if you were uh along with us, I thinking it was two months ago, um, we went through a Microsoft paper for Copilot that was uh, similar to this. Um and so it was you’ll find some of the same uh gaps um at least uh Microsoft basically said theirs was gener generative AI focused but it was only technically on text and on you know co-pilot usage. Uh this one doesn’t quite make those same kind of same kind of claims uh which was good but it does have some of the some of the same gaps. Um, so they they picked uh 30 tasks per occupation.

J. Langley: Um, they’ve actually dropped a pretty good set of these uh publicly. Um, and we’ll we’ll kind of cover what that looked like. Um, their full they only dropped I think 200 something uh in their public uh piece. Um but there the piece this study was based on uh was 1320 tasks across 44 occupations. Um a little background stuff before we get into it. And the interesting thing is this is an exact copy of what we covered for the u for the actual uh Microsoft paper. Um, they quite heavily use, and I’m still not sure how to say this, OARNet or whatever, um, database, uh, which is, uh, we’ll see if I can pop this open real quick. Um, it’s pretty much a, uh, think of it as an ontology of, uh, uh, occupations. And for this occupation, what kind of tasks do I do? Um, how often do I do these tasks? Um you can this is a pretty pretty good data set overall if you were looking for something like um I’m looking for a particular task that affects if I can fix this one thing how it affects the most occupations you know you can actually uh see some things like that um so they that was interesting um and then they also pulled uh uh data from BLS um as far as that’s where they got some of their

J. Langley: GDP numbers. Uh and again in their paper they actually reference you can click through and see the exact you know data they were looking for. Um but that’s also what uh kind of puts a constraint on this because both of these are uson um data sets. So um what we’ve got from both that Microsoft paper for copilot plus the open AI paper for GDP eval um are US only um so it’s okay but um it would be interesting to see um if we’re looking at you know occupational impacts um it would be interesting to see that you know more much more globally. Um but hopping over to the paper. Uh so initially uh they talk about uh some of the other studies that have been done from an occupational perspective. Um and then uh whether they are looking at uh new new types of work or uh like the uh anthropic um economic index which we I don’t know that we’ll have time to get into tonight. It’s much more of a um looking at uh clawed and the adoption of it.

J. Langley: Uh where is it being used? um what segments of the country even uh or globally you know there’s a lot of uh studies that have been doing working through that um the thing with uh even the Microsoft paper uh on co-pilot um they were looking to see uh how often uh co-pilot use was done in a um augmentation manner versus an automation manner you know trying to trying to figure some of some things out for that um that this case um and one of the reasons I I really like it um is that they took much more of a almost a touring test approach. Um I’m going to give something to a person and then I’m going to give a or a task to a person and I’m going to give a task to a model. Um we’ll give them both the same inputs. We will uh require the outputs to be in this form. uh and then we will give those outputs to a third party um and see if uh and so it’s uh a little more straightforward um from that perspective.

J. Langley: Um I like it uh in that they they put pictures in the paper so that was good um for me. Uh but again this was their first version of it which I’m hoping that means they’re going to come up with additional ones as um as models come out. Um, one of the things that they, one of the pieces they came out with as the uh, uh, their result of their study, uh, was that for GPT or for OpenAI models, um, they see a fairly well, it’s not fairly, it’s pretty linear as far as the, uh, the advancement um, over time of these. Um the good the other thing I I liked about it was while they can only do certain benchmarks with uh open AI models, they also pulled in models from Google Anthropic um and X. So you’ll see some of that in here, right?
David Showalter: Oh, nice. So, not like the Microsoft that was just copilot.
J. Langley: Um and uh yeah, and anthropic just blew the doors off everybody. Um, so that’ll be interesting um to see.

J. Langley: Um, so they’ve got and again they talk about that they used GDP as a way to filter things and um they they had to filter somehow because the the approach they took um isn’t just like uh expensive from a money standpoint but also from a from a time it would take to do this uh perspective. Um so uh they actually I mean it actually talks through the the 30 tasks per occup occupation. Um so actual work products uh created by a an expert professional. They actually go into pretty in-depth u detail like in one of the appendices on who these people were, who they worked for. um average years of experience from one of from their expert is 14 years doing that job. Um so they had a person an expert break down this task into a here are the items we have coming in. There’s some work that happens and then here is my set of artifacts when I’m done. Um and so they actually walk through some of the reasons why they think that’s better. Um, I think it’s good, but um better I think it’s uh I think it is a nice uh addition to the set of things we have so far.

J. Langley: Um and so some of these and these were multimodal. Um so in some cases it was just text, other cases it was building a slide deck. In some cases it was I need a video that does X Y or Z. Um you know things like that. Um, let’s see.
David Showalter: Another key question on my mind is, are dredge operators still safe?
J. Langley: Uh, yes, they are. Well, actually, they’re not included in this study because that was not a digital uh that was deemed to be not digital. Um, and they even go through a decent uh amount of text uh in the paper itself talking about how they concluded that something was digital or not digital. Um, so um I’m going to skip that part because it’s I’ll take their word for it. It looked good enough to me that like well yeah um some of it’s fairly obvious. Um let’s see. So the tasks that they had um taken an average of seven hours of work for an expert um on average on the high end they said multiple weeks worth of work to to get to a a result.

J. Langley: Um, one of the reasons why they they one of the other things they talked about why they like going this direction um is um you can always come up with longer or tasks that’ll take longer and you know augment this later. Um so as there are a lot of kind of benchmarks where you know models have made it to where they’re 98% accurate and there’s not really a use you know we’re we’re past the usefulness of that benchmark uh at some point um except for to problems or maybe regression or something but um they were trying to make this where it’s a little more evergreen um where they have a framework now um and that they’ve they’ve got something set up to where okay here’s a new kind of thing that is now emerging. Well, let’s let’s create some tasks and we can add that to our data set. Um it goes through how they prioritize. Uh they picked uh sectors that were over 5% of the GDP. Um they picked five occa occupations within each sector.

J. Langley: Um and and again it talks a little bit about u digital and then they were uh it go through a little bit of that. Um here are the nine sectors they picked. So you got real estate, rental, leasing, um government, uh manufacturing, uh uh fi and some of these it gets kind of interesting when you look and see um when I think manufacturing I’m thinking more of a physical thing. Uh but for them you start digging into some of their data um and there’s a lot of like time studies and other kinds of you know uh logistics planning and part lists and purchasing and agreement you know I mean it’s it’s quite interesting after you start peeling the onion there. Um we land here in professional soft scientific and technical services uh right next to lawyers um which I thought was interesting. I was I I was hoping we had one or two on the call tonight, but we don’t. Um, retail trade, wholesale trade, I don’t know the difference, but I’m sure there is one. Um, general information, finance, uh, healthcare and social assistance.

J. Langley: Um, they go through how they found their experts. Um, here’s some of the companies they work for. Um, they had to pass a background check, a bunch of stuff like that. I’m going to keep skipping through. Um, it walks through the quality control they set up for defining these tasks. Um, as far as the expert makes it and then there’s a second stage review that’s iterative and then another iterative review and you know and whatnot. Uh they spent a decent amount of time talking about initially um this was a a blind pair-wise comparison where the person doing the judging gets given two different things and they have to pick between the two different things. Um they actually also because even doing that takes uh a decent amount of time. Uh so they actually built a second model to do an evaluation um based on what the expert was doing. Um they’ve also got that provided I I think as part of uh I I don’t know if you have to submit your stuff to them to get uh graded uh or if they make that model available to run yourself.

J. Langley: Uh so anyway, uh their pair-wise grading setup. Um the the judge is handed uh they know what the inputs were um as far as the con setting the setting the table for the context and they they receive the artifacts that were the output of the tasks and they have to say whether I would pick A or B. Um, so, uh, what you get is if we’re equal, as in it, it it picks the, um, AI version as or the human version if it if it’s 50/50. There’s this line up here. Um, and as you can see right now, um, Opus one, uh, oh, and this is I I don’t know why you would pick why you would let it tie. Um I guess uh that’s kind of why sometimes I have like eight different priority one items on my list. Um but I guess it’s human nature um that sometimes we don’t want to make a choice. Um so in the case so they’re they’re they were calling this a win rate. Um so if if it’s the point where um I either win or it’s a tie uh for Opus uh 4.1 we’re at 47.6.

J. Langley: 6% on average across these tasks. Um they’ve got stuff that we’ll hit later that actually has this percentage broken out across the sectors and across a bunch of different they so they split this data out in a bunch of different ways that we’ll see later. Um uh maybe um but there’s a uh we’ll bring that up in a minute because we’ll see that.
David Showalter: I’m guessing the the tie is just because there could be multiple solutions that are equally good for the same problem kind of thing.
J. Langley: We’ll see that later. Um, there was a something that we’ll hit uh towards the back of the paper where they actually took a look at when it lost, why did it lose, you know, uh, was this just a catastrophic, oh, this is crap, you know, so it’s obvious, or is this u some, you know, I mean, they had like four different reasons. Um, and so that’ll some of that’ll play into it. Um, so that was that was pretty interesting. Um, and you can definitely see um I don’t know any uh they they also did a difference on the uh OpenAI variants for different reasoning levels.

J. Langley: So all of these are like GPT5 high or 03 high, 04 mini high, you They cranked up the knob all the way um because they’ve got budget and they can um I don’t know from a timeline perspective uh I wish they you almost have to uh infer when these things dropped. Um you know did because I’ve got 03 high but I’ve got 40 Yeah. Well, these are different. Never mind. Um, does this kind of match like time over the last year or is this just because they stacked them in this order?
David Showalter: GPT5 high came out after Opus 41. I want to say GBT5 was what about two months ago and Opus 41 was early summer if I remember right.
J. Langley: Okay. Were they close or Oh, okay. Got it. So, it may not be a timeline, but it could be as models advance over, you know, each each family of models maybe.
David Showalter: At the same time, I do think these are all I’m guessing they did this through the API and that’s why they chose the models they did.

J. Langley: Um, okay.
David Showalter: And so GPT5 high I don’t think is as high performance as GBT 5 Pro. not 100% sure, but uh
J. Langley: That makes sense. Um, so another thing they mentioned, um, as most of you that have dealt with model outputs before, um, sometimes you can look at something and go, “Oh, that’s Claude,” or, “Oh, that’s Grock,” or, “Oh, that’s you know what I mean? It’s just you can’t you can’t just uh you know the helpfulness of claude sometimes uh makes it wildly u obvious that something was clawed you know um so what they’re what they were saying was even though the judges weren’t supposed to know whether it was model generated or not sometimes you can tell um and they don’t have a way to know whether that uh impacted any of the results or Um so for uh looking at just the open AI versions looking at their models over time um they’re basically saying yeah I mean going from a 10ish% up to close to a 40 um in I mean a year and a half maybe not a year and a half But 15 months, I guess.

J. Langley: I don’t know. Um, math. Um, it’s pretty good. Um, I’m not sure where they’re headed next or what what’ll drop next. Um, they also went through uh some amount. I’m not I’m still not quite sure. There is a gap here as well. um they went through like cough savings uh and tried to figure out uh kind of a and this is something I may actually uh use uh in some some future work. Um it is kind of interesting if you got a if you got an expert who is good at what they do um and you’re asking them to use an AI to do the thing that they’re good at um there’s uh inherent resistance sometimes that you’ll run into. But if I say, “Well, just try it.” Um, and if it, yeah, if it’s uh if you don’t like it, just don’t use it. I mean, it’s okay. Um, it didn’t cost you anything to go you anything to just go dump it in and try it.

J. Langley: Um, uh, and then uh, so do it that way or you can actually iterate over over it and try it a couple of times um that you, you know, you wind up getting into prompting and things like that. But um what they wound up coming up with um was and this was uh and this is the one I’m not quite clear on why they picked this way to do it. Um but their whole hey just try it once and if it works great if it doesn’t you know just fix it and move on. Uh 40 was way down here. It actually slowed people down. Um unassisted is just straight up. 04 mini is and this is speed improvement over time. Uh and then you know you can kind of see this moving up and to the right. Um and basically what they’re saying is an expert plus a GPT5 high model is one and a half times as fast and one and a half times as cost effective. Um the part they missed I think is that a lot of times like right now I’m using the heck out of O04 Mini to do some things.

J. Langley: Um I don’t have it doing my job. I have it doing something totally different. So I may not be doing let’s say I’m I’m developing a new um a new application. Uh I may have 04 mini just building out the full test set for it you know and it’s it’s not that it made me faster. it actually did something that I don’t even have to do right now. Um, and yeah, you do have to, you know, there is some time for clean up and, you know, doing things there. There are things that it’s super good at um that I didn’t have to do anymore. So, I’m I’m not even sure what how you how they would classify that. I’m not a 1.5 times faster at doing that. I didn’t even have to do it at all. Um, uh, the other Right.
David Showalter: But you also have to take into account, you know, the learned experience from using the AI to do the things you are aware of. And that kind of transitions over to I mean, you learn things like how to actually read what it outputs and double check it and, you know, try it against itself.

David Showalter: little tricks to help make it
J. Langley: Yep. Um, we actually ran into something. I mean the conversation today match if you were in the room last last week uh Josh was talking about agents and things and when sometimes they u it’s something that that sometimes they won’t they don’t want to do something and then they will try to explain to you why that’s not really what you were asking for or why that’s you know had an agent today refuse to do a unit test for something because a unit test wasn’t valuable and uh anyway I but yeah I get what you’re saying uh and then the other thing thing that they don’t really get into here is uh the cost of compute and the cost of things overall when you’re using something like this versus the cost of my time to manage it or do it. You know, um I don’t see some of that in here. And both of these papers, both this and the Microsoft one, kind of dance around uh the whole issue of well, is this augmentation?

J. Langley: This is assumed that we’re augmenting somebody instead of replacing them. Um, but you know, they they’re kind of dancing around some of that. Um, let’s see. Uh, oh, also, uh, they went through and tried to figure out why they preferred certain things. Um, and this was kind of interesting. Um, some models, apparently GPT models are better at actually following direction um, than claw. Um, but the formatting of of Claude is actually uh more um I guess likable or whatever it is. Um, the judges liked the output better from Claude. Um, so I mean that was kind of interesting. Uh here’s one where they actually uh cranked up uh the reasoning level um of each model uh and then checked uh oh never mind this was this was failure modes um as far as you know why it failed uh Gemini and Grock really didn’t want to follow instructions uh formatting um you know was less What? Anyway, um y’all can read that. Um trying to And my trackpad is not working as well as I wish it would have.

J. Langley: Uh oh, there was another they they also uh did uh some work with actually uh tuning prompts uh specifically for I mean they they I think they’ve got the prompt in here uh that they were using for the tasks and such. Um, and I think they did another piece where they actually tuned the prompt uh to try and see what kind of impact that has. Um, and guess what? It matters. Um, you know, go figure. Uh, better prompts have better results. Uh, uh, it it is interesting that they open source the prompts in their task subset. Um, that was, uh, pretty interesting. uh they tried to remove anything that would let you know who wrote the task. Um so is that uh they go through some limitations. Uh of course we’ll skip uh they’ve got a bunch of references and then you get to some other interesting parts down here. Oh, this was fun. Um some of the tasks um occupations like film, law, politics, literature uh include things that you don’t want to bring up at work.

J. Langley: Um so they make that they throw that out there. Um well, I guess unless your work is film, literature, law, or politics. Um yes.
David Showalter: Yeah, but it looks like they did leave them in there to see how the models do. Did I read that right?
J. Langley: Um but yeah, they’re basically saying, “Hey, um there are jobs that do these things and this is part of, you know, um that’s like that.”
David Showalter: Cool.
J. Langley: Uh it also some of the stuff refers to trademarks and other types of things that were part of those tasks. Um let’s see, here’s some math. Um, if you like uh naive time, you know, here’s a naive ratio and then here’s the math they use for the try it once and then fix it. Uh, what I really want to get to is some of their other results where they actually broke down win rates by sector. Um, there we go. And of course, um, I didn’t really see too much other than I the order that these are are in as far as which models did better was pretty much held uh across.

J. Langley: Sometimes you’ll see one where um, you know, for uh,04 mini uh, was like around the same as Grock for for this particular sector, but then it jumped for manufacturing. Um, and I um I’m not exactly sure why. I guess you’d have to actually jump down into the tasks per sector to try to figure out why some some of these switched out the way they were. Um, I just say that because I don’t I’m not sure you could use this to say, “Oh, I am working in uh finance. Which one should I use?” You know what I mean? I don’t know that you could use that here. Um but it does show you that like for uh the sectors they picked there was uh retail trade was super high over here. Um but then information uh which this was more of your uh uh film editing. Yes. Uh don’t think of this as information I it kind of thing. This is more uh straight information. uh the technical services one which would be uh where software development landed would be would have been in here.

J. Langley: Um so that’s there. I’m trying to get down to the let me zoom back.
David Showalter: So that’s doing uh for government and for sales that Claude Opus 41 is better than most experts.
J. Langley: Yeah, go ahead. Yeah. Um, which I’m still trying to figure out. Government. What does that what does that exactly mean? Um, uh, here’s one where they actually, uh, dropped it out like by occupation itself. Um, so forget sectors, let’s go pick some occupations within the sector. Um, and look at an actual occupation. Um, and this one was one where I think you could actually take some interesting u, you know, observations from um, if you’re a first-line supervisor um, or uh, which means there’s an AI model that can do something for some of the tasks that can’t be, you know, that that in some cases would be picked higher than what a person an expert at that would would do. Um and that number is what 58% I think. Um so uh looking at shipping and receiving and inventory counting things um software developers.

J. Langley: Um, this one uh hits home a bit, but back to the thing I’m saying I was saying earlier where I’ve got an an agent sitting there building the unit tests along with the code that I’m writing. Um, kind of as if it were a person sitting there doing that. Um, so that is a that’s a thing. Um, lawyers
David Showalter: for the shipping receiving inventory. I had a friend last year that created a tool that would look at um his company was doing wholesale reselling of extra goods. So they would get truckloads of just 10,000 different items in them.
J. Langley: right?
David Showalter: So they trained up something that would actually identify everything, populate their sales list with it and the SKs and everything and the model was able to handle it all.
J. Langley: Oh, cool.
David Showalter: So I can believe
J. Langley: Uh, I’ve seen one thing that you just walk around with at like at thrift shops and stuff. you know, and you’ve just got your phone set up. There was something that was uh looking at just images of things as you’re walking through and if it happened to see something that flagged a certain value, you know, it just let you know.

J. Langley: Um things like that. Um so accountants are pretty safe um at the moment. Um but again, I’m not sure exactly what kind of things they were looking for. Um the uh another thing that I didn’t hit that’s in here somewhere uh that they identified as one of the uh constraints or I guess gaps uh would be that all of these tasks are self-contained. So all of the information needed for the task is in the information given to the the thing doing the task. Um and then the output is something that’s also uh something that can be evaluated by itself. Um, so where you have things where you may have to reach out and talk to multiple co-workers to go get information or things like that. Uh, they were saying this you shouldn’t u take you shouldn’t draw those conclusions from this um because that’s a gap that they had. Um, uh, customer service reps, uh, pharmacists. Yeah, I don’t want an AI pharmacist at this point at the moment. Um, this was I didn’t private detectives and and investigators.

J. Langley: Um, I guess you can look up a lot of things these days. Um, film and video editor. Some of these and some of you that are working like NI and you see kind of where trends are moving in some things, you could probably look at this and and guess where some of these will have moved to within a year. Um, that would be kind of an interesting uh guess. Let me zoom out some. Uh, they also broke it down based on whether this is a pure text or a spreadsheet or a PDF, you know, as far as the deliverable goes. That was interesting. uh this one uh what I mentioned earlier on the current time uh I I can’t even remember what the name of the metric is for how how much time it is that it uh David you’d mentioned that do you remember the name of it I don’t know if it’s a so the concept is what is the longest uh task in time whereas if if it would take me a five-hour task.

David Showalter: Uh, sorry. Um,
J. Langley: Um, can an AI agent or can some AI model accomplish that task with the 50/50 shot of getting it right? Um, and what was it was like seven hours or something was the current uh something like that. Um, and the whole kind the it’s like the new Moors law for AI where that number doubles every x number of months. Um, so they al Yeah.
David Showalter: Yeah, it’s every it’s every seven months. It’s been doubling for a few years now, but I can’t remember the if it’s just I think it’s autonomous task time completion.
J. Langley: Okay, they should have named it something like Moore’s law, so it’s easy.
David Showalter: Yeah, everybody’s just referring to it as the new Moore’s law, which doesn’t help.
J. Langley: Um I that’s pretty good. Uh but but kind of what they were looking at and this kind of I could see this playing up as well as those models get better and better. I would expect to see these numbers start to rise on the right hand side.

J. Langley: Um and then how am I on time? A little after seven. We’re getting close. Um the other and this was a very interesting thing for me. Um they looked at where where it lost. So it’s so the judge picked the the human uh output over the AI output and they put it into four buckets. One of them was catastrophic like I was mentioning as in well this is horrible um or this is gibberish or whatever. Um or bad as in well yeah it’s not nonsense it’s just wrong and then acceptable but not as good. Uh and then na so they just didn’t agree or something. Um and the interesting thing for me is that acceptable so on average these models were picked 47% of the time out of the other 53% almost half of that was still deemed acceptable but not as good. And so that comes into the point of well okay what if acceptable is is where we’re aiming for for some tasks. Um you might actually see that um you know if if they had modified the judging uh to say which one would you pick versus here is the cost of it which one would you pick you know or something like that.

J. Langley: That would be an interesting thing to to look into. Um I that’s how I I mean it’s uh
David Showalter: That was one of the topics at the mat lab expo on AI keynote this morning was okay so you have a five minute process that gets you 95% of the way there if that’s good enough do you need to spend two weeks to get the extra 5% or you
J. Langley: I’ve had a similar thing and I’ll just scroll back up to this one because I think this is where we’ll end it. Um I have a similar approach uh running development teams um where I used to actually have a developer that worked for me at one point work for me uh worked with me um his his comment was always um well we’re 95% complete but the other half is going to take a while um and he was always right um goldplating and other kind of stuff and I I learned from that experience that u some developers when they hit 95% of the way there I will actually pick the T or I will get them something new to work on because the last 5% sometimes isn’t quite as fun.

J. Langley: Um, and you wind up handing it over to somebody that’s good at closing things out. Um, and everybody’s happy. Um, otherwise you sit there at 95 and then 96 and then 96, you know. Um, that last 5% is tough. Um, but yeah, I could definitely see something like that. Um the thought I had uh as well is that the uh this study covered you know what models are good at accomplishing tasks whereas the uh uh the Microsoft study was looking into um how often are different uh occupations actively using like C-pilot to do things and and at which point are they automating versus augmenting, you know, some things like that. I think it and since both of these studies um and I I know I don’t think I don’t think the co-pilot one actually released their full data set or anything. Um I would really love to see and both of them are based on OSR net you know ontology of occupations and tasks. Both of them used data from BLS for the you know uh the co-pilot used it to see um how many people were in this job, what their average uh education level was and what their salary was.

J. Langley: Um it would be super interesting to see both of these things combined and to do um you know one more round or something. But uh anyway, and this is page 17. they go through a boatload of additional uh if you’re really interested um you know in some of the probabilities and some of the other stuff they’ve they’ve got a lot in here. Um oh also this was kind of fun um how they uh did the task specifications um and kind of who it was um you know or what kind of person they picked for like healthcare. Well, here’s a nurse with 18 years of experience in emergency. You know what I mean? It’s uh uh pretty interesting to see how they broke things down. Um I do like the the transparency in how they did certain things. Um if you want to know, you know, what they use for I mean it they go into pretty good detail. Um, but anyway, that is what I had for that. Uh, let me make sure I let me go back and because I I made notes at the very last to make sure I cover.

J. Langley: Um, also I think this one dropped in October. There’s a anthropic economic index that came out like recently and the Microsoft I think was back in September. Um so there seem to be several uh companies or you know uh foundation model factories or whatever that are looking at this and thinking through it. Um I think that’s all I had. Um any thoughts, comments um about this one?
David Showalter: No, it’s it’s fascinating. Uh there’s been a lot of media buzz about how this shows AI isn’t good for anything, which is funny.
J. Langley: Yeah. And this file last week I was doing the uh the Congressional Office of International Leadership or whatever it was. um they had the a delegation they wanted me to come talk to them about AI and one of the questions I got was is AI going to take our jobs and not in a I mean a real wanting to know what jobs if if so and with you know like and if so which ones um and I’m like nobody knows first if somebody tells you they know they don’t um uh but yeah it’s not it’s it’s on people’s minds at the moment um you’ve got some I mean this was coming on the heels of AWS or Amazon laying off like 14,000 people.

J. Langley: Um, you know, it’s it’s an interesting kind of kind of thing. Um, so I don’t know that I have an opinion. Um, I would say using AI is probably um going to I don’t know. I don’t know if I’d say insulate people. Um, so if you’re one of the people doing the AI stuff, it might be better for you. But I’m also kind of leading an AI group here, so I’m a little biased there. Uh, let me stop sharing. Uh, and get back over to my other window where I actually have like the actual meat thing goes going.
David Showalter: Josh, I liked your question earlier and I was curious too. Did they give the models tools or just prompts?
J. Langley: Oh, I have no idea. Um, I do I would like to go since they did publish the set of tasks and whatnot. Um, I I believe the prompt they use was also part of that. Um, it’d be good to go actually uh I wouldn’t mind going and picking up one of the tasks, especially if it’s a software development one.

J. Langley: I’d like to know what kind of thing they thought would be a good task for software development. Um, that might be a good followup or another post.
David Showalter: And my brain always gives
Josh Phillips: one that was the uh you were doing the accountant one at that time I think and that’s one where you know if it has access to a calculator tool versus the transformer having to do something you know that’s
J. Langley: right?
Josh Phillips: a it’s a completely different problem.
J. Langley: Yeah, let me guess math.
Josh Phillips: Yeah.
J. Langley: Um, oh man. Oh, no. There was a one of the back on the unit test things. Um, one of the things, uh, this was at a at like, uh, at a different job. Um, we were building out I was having AI build out a bunch of unit tests and one of the things that I did was there’s a math library associated with this. And so, yeah, it it was good at unit testing and boundary testing certain things. Um, we have some calculations that do coordinate transforms, you know, from latline altitude over to, you know, earth centered, earth fixed, you know, things like that.

J. Langley: and it guessed some numbers and it wasn’t it wasn’t super far off and it was like freaky um that it you know hey I do this and go to this transform and then back I you know normally people would go okay let me transform it to X and then transform it back and am I getting a similar you know thing but no it was actually here’s a latitude longitude and I’m expecting this and ECEF um you know XYZ um I mean within like a kilometer Um, which yeah, I’m not doing that math in my head. Um, but crazy stuff. Um, yeah, it’s and I I have a my brother-in-law is an accountant. I may actually go ask him. Um, I haven’t we don’t have anybody in the group that does that’s like uh you know, financial services really. Uh, we have actually, let me take that back. We got uh we’ve got somebody that joins occasionally from Killpoint. Um but he’s not more he’s not necessarily like an accountant kind of an accountant. He’s more of a financial you know strategist or whatever.

J. Langley: But oh man.
Josh Phillips: I think something of note there is uh especially since we talk about sonnet a lot is I know anthropic I think it was just last week released like clawed for excel or something like that uh so probably done
David Showalter: Yes.
Josh Phillips: some fine-tuning with excel examples and tools so I mean that’s the sort of thing you know because it can do you know DAX queries and all that sort of like I you know flutin stuff inside of there though.
J. Langley: Oh man.
David Showalter: And I was thinking the same for GPT because they released those new tieins to the chat interface. So like for the real estate job, there’s now a full-on Zillow tiein.
J. Langley: Oh yeah, that would be interesting.
David Showalter: And what did that do to
J. Langley: Um the hard thing on this is this uh that whole study had to have taken uh just calendar time for people to actually build out these artifacts and you know uh in addition to actually running the study. um you’re almost at a point where it’s going to be hard to keep keep pace with um well yeah let’s do it again in six months and see where we’re at.

J. Langley: Well, and they’re probably like well crap it took us three months just to build the artifacts for this one. Um that might be kind of an interesting interesting concept. Um but they did Yeah.
David Showalter: It has been cool. It has been cool seeing the human experts brought into these tests more. Like there was a paper recently on Liars Poker. I don’t know if you guys saw that one, but it was it was an AI’s ability to play Liars Poker versus legit world experts at playing it.
J. Langley: Uh-uh. H Yeah, in stock.
David Showalter: And they had different models play each other. And for each of those, they ran a thousand iterations. But because of the human lit uh human limitation, they only had the human experts play a hundred hands against each model just for time constraints.
J. Langley: Yes. Um, I can’t remember who it was that had mentioned that yes, we’re looking at these papers, but all of these papers are also from companies that sell time to use these models, you know.

J. Langley: Um, is it is it something you can actually, you know, count on, you know, how much faith do you put into it? And um and yeah, there are gaps and there are places where I I would have liked to have seen something different. I still don’t know why they exactly picked GB GDP as their way to figure out which sectors to look at. Um unless they’re trying to let’s say if I’m anthropic, I’m trying to figure out where where is the value of work and then if I can augment that then I’m more likely to you know what I mean? It’s more of a um I should be able to uh it’s obviously valuable to do that task. So, you know, if I could do that task, then I have something of value. Um I could see that playing into it. Um uh but yeah, some of the some of the stuff that that they had as far as you know on the software development side is is pretty interesting as far as if you if I think of it as uh if you were the judge and you’re given two things and now you are at a uh you are 60% more likely to pick the one that’s AI generated.

J. Langley: Um it’s it’s an interesting place. Um, interesting. Kind of in a scary way, but nothing.
David Showalter: Yeah, I run into a lot where people look at sign, they’re like, “Oh, I can tell that’s AI generated.” And I’m like, “Well, is it better?” You know? Yes. Sometimes, like you mentioned, you can tell, “Oh, this was from GPT or this was from Claude.” And it’s really quick to tell, but I found some people just stick on the, “Oh, this was AI generated.” I’m like, “Yes, but is it is it better than a human?
J. Langley: Right. And that’s the uh it it it’s kind of difficult in that even well the this even this was subjective and that there was some subjective thing where a person decided that I want this one rather than that one. Um it’s not that they passed a test that verified that they’re correct, you know. Um so that’s that’s another kind of side of it. Um uh the other thing you could pull out is that uh if you really want to do this to automate stuff, it really helps if you have a self-contained task that doesn’t depend on anything else and that you know uh has a defined uh deliverable at the end that can be you know judged.

J. Langley: Um so the the other side of it if that describes your job um well then you may want to worry about things. um if that doesn’t which is kind of like all of us um I have never seen really a problem described that well uh at a high level um so yeah um at the moment it’s still I still have to talk to a lot of people and go coordinate and do a lot of other tasking uh just to do the thing that I’m supposed to do. Um but anyway with that um if you got additional thoughts on these uh drop me an email or something um and we can um it’ll probably be uh next year sometime before we could circle back uh because I’ve got uh at least one session in uh December I want to do the uh I want to do a kind of a year-end thing like we always do. Um, and then I’ve got one that’s uh still a topic TBD. Um, so if you got something for early December, let me know. Um, and then I also don’t have a schedule for the next uh paper that we want to cover yet.

J. Langley: So that’s if you got thoughts on that, uh, ship them over Discord. Oh yeah, that works. Yeah, if I can if I can pencil you in for that, you got it.