Simon Willison’s Weblog

Subscribe

566 items tagged “ai”

2024

Understand errors and warnings better with Gemini (via) As part of Google's Gemini-in-everything strategy, Chrome DevTools now includes an opt-in feature for passing error messages in the JavaScript console to Gemini for an explanation, via a lightbulb icon.

Amusingly, this documentation page includes a warning about prompt injection:

Many of LLM applications are susceptible to a form of abuse known as prompt injection. This feature is no different. It is possible to trick the LLM into accepting instructions that are not intended by the developers.

They include a screenshot of a harmless example, but I'd be interested in hearing if anyone has a theoretical attack that could actually cause real damage here. # 17th May 2024, 10:10 pm

I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it.

If a departing employee declines to sign the document, or if they violate it, they can lose all vested equity they earned during their time at the company, which is likely worth millions of dollars.

Kelsey Piper # 17th May 2024, 7:11 pm

But where the company once limited itself to gathering low-hanging fruit along the lines of “what time is the super bowl,” on Tuesday executives showcased generative AI tools that will someday plan an entire anniversary dinner, or cross-country-move, or trip abroad. A quarter-century into its existence, a company that once proudly served as an entry point to a web that it nourished with traffic and advertising revenue has begun to abstract that all away into an input for its large language models.

Casey Newton # 15th May 2024, 10:23 pm

PaliGemma model README (via) One of the more over-looked announcements from Google I/O yesterday was PaliGemma, an openly licensed VLM (Vision Language Model) in the Gemma family of models.

The model accepts an image and a text prompt. It outputs text, but that text can include special tokens representing regions on the image. This means it can return both bounding boxes and fuzzier segment outlines of detected objects, behavior that can be triggered using a prompt such as "segment puffins".

You can try it out on Hugging Face.

It's a 3B model, making it feasible to run on consumer hardware. # 15th May 2024, 9:16 pm

ChatGPT in “4o” mode is not running the new features yet

Monday’s OpenAI announcement of their new GPT-4o model included some intriguing new features:

[... 865 words]

If we want LLMs to be less hype and more of a building block for creating useful everyday tools for people, AI companies’ shift away from scaling and AGI dreams to acting like regular product companies that focus on cost and customer value proposition is a welcome development.

Arvind Narayanan # 15th May 2024, 4:25 pm

But unlike the phone system, we can’t separate an LLM’s data from its commands. One of the enormously powerful features of an LLM is that the data affects the code. We want the system to modify its operation when it gets new training data. We want it to change the way it works based on the commands we give it. The fact that LLMs self-modify based on their input data is a feature, not a bug. And it’s the very thing that enables prompt injection.

Bruce Schneier # 15th May 2024, 1:34 pm

Context caching for Google Gemini (via) Another new Gemini feature announced today. Long context models enable answering questions against large chunks of text, but the price of those long prompts can be prohibitive—$3.50/million for Gemini Pro 1.5 up to 128,000 tokens and $7/million beyond that.

Context caching offers a price optimization, where the long prefix prompt can be reused between requests, halving the cost per prompt but at an additional cost of $4.50 / 1 million tokens per hour to keep that context cache warm.

Given that hourly extra charge this isn’t a default optimization for all cases, but certain high traffic applications might be able to save quite a bit on their longer prompt systems.

It will be interesting to see if other vendors such as OpenAI and Anthropic offer a similar optimization in the future. # 14th May 2024, 8:42 pm

llm-gemini 0.1a4. A new release of my llm-gemini plugin adding support for the Gemini 1.5 Flash model that was revealed this morning at Google I/O.

I'm excited about this new model because of its low price. Flash is $0.35 per 1 million tokens for prompts up to 128K token and $0.70 per 1 million tokens for longer prompts - up to a million tokens now and potentially two million at some point in the future. That's 1/10th of the price of Gemini Pro 1.5, cheaper than GPT 3.5 ($0.50/million) and only a little more expensive than Claude 3 Haiku ($0.35/million). # 14th May 2024, 8:32 pm

How developers are using Gemini 1.5 Pro’s 1 million token context window. I got to be a talking head for a few seconds in an intro video for today's Google I/O keynote, talking about how I used Gemini Pro 1.5 to index my bookshelf (and with a cameo from my squirrel nutcracker). I'm at 1m25s.

(Or at 10m6s in the full video of the keynote) # 14th May 2024, 8:27 pm

Why your voice assistant might be sexist (via) Given OpenAI's demo yesterday of a vocal chat assistant with a flirty, giggly female voice - and the new ability to be interrupted! - it's worth revisiting this piece by Chris Baraniuk from June 2022 about gender dynamics in voice assistants. Includes a link to this example of a synthesized non-binary voice. # 14th May 2024, 4:16 pm

LLM 0.14, with support for GPT-4o. It's been a while since the last LLM release. This one adds support for OpenAI's new model:

llm -m gpt-4o "fascinate me"

Also a new llm logs -r (or --response) option for getting back just the response from your last prompt, without wrapping it in Markdown that includes the prompt.

Plus nine new plugins since 0.13! # 13th May 2024, 9 pm

Hello GPT-4o. OpenAI announced a new model today: GPT-4o, where the o stands for "omni".

It looks like this is the gpt2-chatbot we've been seeing in the Chat Arena the past few weeks.

GPT-4o doesn't seem to be a huge leap ahead of GPT-4 in terms of "intelligence" - whatever that might mean - but it has a bunch of interesting new characteristics.

First, it's multi-modal across text, images and audio as well. The audio demos from this morning's launch were extremely impressive.

ChatGPT's previous voice mode worked by passing audio through a speech-to-text model, then an LLM, then a text-to-speech for the output. GPT-4o does everything with the one model, reducing latency to the point where it can act as a live interpreter between people speaking in two different languages. It also has the ability to interpret tone of voice, and has much more control over the voice and intonation it uses in response.

It's very science fiction, and has hints of uncanny valley. I can't wait to try it out - it should be rolling out to the various OpenAI apps "in the coming weeks".

Meanwhile the new model itself is already available for text and image inputs via the API and in the Playground interface, as model ID "gpt-4o" or "gpt-4o-2024-05-13". My first impressions are that it feels notably faster than gpt-4-turbo.

This announcement post also includes examples of image output from the new model. It looks like they may have taken big steps forward in two key areas of image generation: output of text (the "Poetic typography" examples) and maintaining consistent characters across multiple prompts (the "Character design - Geary the robot" example).

The size of the vocabulary of the tokenizer - effectively the number of unique integers used to represent text - has increased to ~200,000 from ~100,000 for GPT-4 and GPT-3:5. Inputs in Gujarati use 4.4x fewer tokens, Japanese uses 1.4x fewer, Spanish uses 1.1x fewer. Previously languages other than English paid a material penalty in terms of how much text could fit into a prompt, it's good to see that effect being reduced.

Also notable: the price. OpenAI claim a 50% price reduction compared to GPT-4 Turbo. Conveniently, gpt-4o costs exactly 10x gpt-3.5: 4o is $5/million input tokens and $15/million output tokens. 3.5 is $0.50/million input tokens and $1.50/million output tokens.

(I was a little surprised not to see a price decrease there to better compete with the less expensive Claude 3 Haiku.)

The price drop is particularly notable because OpenAI are promising to make this model available to free ChatGPT users as well - the first time they've directly name their "best" model available to non-paying customers.

Tucked away right at the end of the post:

We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.

I'm looking forward to learning more about these video capabilities, which were hinted at by some of the live demos in this morning's presentation. # 13th May 2024, 7:09 pm

I’m no developer, but I got the AI part working in about an hour.

What took longer was the other stuff: identifying the problem, designing and building the UI, setting up the templating, routes and data architecture.

It reminded me that, in order to capitalise on the potential of AI technologies, we need to really invest in the other stuff too, especially data infrastructure.

It would be ironic, and a huge shame, if AI hype sucked all the investment out of those things.

Tim Paul # 13th May 2024, 2:35 pm

GPUs Go Brrr (via) Fascinating, detailed low-level notes on how to get the most out of NVIDIA's H100 GPUs (currently selling for around $40,000 a piece) from the research team at Stanford who created FlashAttention, among other things.

The swizzled memory layouts are flat-out incorrectly documented, which took considerable time for us to figure out.

# 13th May 2024, 4:08 am

experimental-phi3-webgpu (via) Run Microsoft’s excellent Phi-3 model directly in your browser, using WebGPU so didn’t work in Firefox for me, just in Chrome.

It fetches around 2.1GB of data into the browser cache on first run, but then gave me decent quality responses to my prompts running at an impressive 21 tokens a second (M2, 64GB).

I think Phi-3 is the highest quality model of this size, so it’s a really good fit for running in a browser like this. # 9th May 2024, 10:21 pm

Slop is the new name for unwanted AI-generated content

I saw this tweet yesterday from @deepfates, and I am very on board with this:

[... 301 words]

OpenAI Model Spec, May 2024 edition (via) New from OpenAI, a detailed specification describing how they want their models to behave in both ChatGPT and the OpenAI API.

“It includes a set of core objectives, as well as guidance on how to deal with conflicting objectives or instructions.”

The document acts as guidelines for the reinforcement learning from human feedback (RLHF) process, and in the future may be used directly to help train models.

It includes some principles that clearly relate to prompt injection: “In some cases, the user and developer will provide conflicting instructions; in such cases, the developer message should take precedence”. # 8th May 2024, 6:15 pm

Towards universal version control with Patchwork (via) Geoffrey Litt has been working with Ink & Switch exploring UI patterns for applying version control to different kinds of applications, with the goal of developing a set of conceptual primitives that can bring branching and version tracking to interfaces beyond just Git-style version control.

Geoffrey observes that basic version control is already a metaphor in a lot of software—the undo stack in Photoshop or suggestion mode in Google Docs are two examples.

Extending that is a great way to interact with AI tools as well—allowing for editorial bots that can suggest their own changes for you to accept, for example. # 8th May 2024, 1:44 am

gpt2-chatbot confirmed as OpenAI (via) The mysterious gpt2-chatbot model that showed up in the LMSYS arena a few days ago was suspected to be a testing preview of a new OpenAI model. This has now been confirmed, thanks to a 429 rate limit error message that exposes details from the underlying OpenAI API platform.

The model has been renamed to im-also-a-good-gpt-chatbot and is now only randomly available in "Arena (battle)" mode, not via "Direct Chat". # 8th May 2024, 12:33 am

Deterministic Quoting: Making LLMs Safe for Healthcare (via) Matt Yeung introduces Deterministic Quoting, a technique to help reduce the risk of hallucinations while working with LLMs. The key idea is to have parts of the output that are copied directly from relevant source documents, with a different visual treatment to help indicate that they are exact quotes, not generated output.

The AI chooses which section of source material to quote, but the retrieval of that text is a traditional non-AI database lookup. That’s the only way to guarantee that an LLM has not transformed text: don’t send it through the LLM in the first place.

The LLM may still pick misleading quotes or include hallucinated details in the accompanying text, but this is still a useful improvement.

The implementation is straight-forward: retrieved chunks include a unique reference, and the LLM is instructed to include those references as part of its replies. Matt's posts include examples of the prompts they are using for this. # 7th May 2024, 7:08 pm

Watching in real time as “slop” becomes a term of art. the way that “spam” became the term for unwanted emails, “slop” is going in the dictionary as the term for unwanted AI generated content

@deepfates # 7th May 2024, 3:59 pm

OpenAI cookbook: How to get token usage data for streamed chat completion response (via) New feature in the OpenAI streaming API that I've been wanting for a long time: you can now set stream_options={"include_usage": True} to get back a "usage" block at the end of the stream showing how many input and output tokens were used.

This means you can now accurately account for the total cost of each streaming API call. Previously this information was only an available for non-streaming responses. # 7th May 2024, 2:46 am

I believe these things:
1. If you use generative tools to produce or modify your images, you have abandoned photointegrity.
2. That’s not always wrong. Sometimes you need an image of a space battle or a Triceratops family or whatever.
3. What is always wrong is using this stuff without disclosing it.

Tim Bray # 4th May 2024, 4:26 pm

I used to have this singular focus on students writing code that they submit, and then I run test cases on the code to determine what their grade is. This is such a narrow view of what it means to be a software engineer, and I just felt that with generative AI, I’ve managed to overcome that restrictive view.

It’s an opportunity for me to assess their learning process of the whole software development [life cycle]—not just code. And I feel like my courses have opened up more and they’re much broader than they used to be. I can make students work on larger and more advanced projects.

Daniel Zingaro # 3rd May 2024, 6:17 pm

AI is the most anthropomorphized technology in history, starting with the name—intelligence—and plenty of other words thrown around the field: learning, neural, vision, attention, bias, hallucination. These references only make sense to us because they are hallmarks of being human. [...]

There is something kind of pathological going on here. One of the most exciting advances in computer science ever achieved, with so many promising uses, and we can’t think beyond the most obvious, least useful application? What, because we want to see ourselves in this technology? [...]

Anthropomorphizing AI not only misleads, but suggests we are on equal footing with, even subservient to, this technology, and there’s nothing we can do about it.

Zach Seward # 2nd May 2024, 7:44 pm

Llama 3 prompt formats (via) I’m often frustrated at how thin the documentation around the prompt format required by an LLM can be.

Llama 3 turns out to be the best example I’ve seen yet of clear prompt format documentation. Every model needs documentation this good! # 1st May 2024, 6:32 pm

We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing.

Model providers can test their unreleased models anonymously, meaning the models’ names will be anonymized. A model is considered unreleased if its weights are neither open, nor available via a public API or service.

LMSYS # 30th April 2024, 8:35 pm

My notes on gpt2-chatbot. There's a new, unlabeled and undocumented model on the LMSYS Chatbot Arena today called gpt2-chatbot. It's been giving some impressive responses - you can prompt it directly in the Direct Chat tab by selecting it from the big model dropdown menu.

It looks like a stealth new model preview. It's giving answers that are comparable to GPT-4 Turbo and in some cases better - my own experiments lead me to think it may have more "knowledge" baked into it, as ego prompts ("Who is Simon Willison?") and questions about things like lists of speakers at DjangoCon over the years seem to hallucinate less and return more specific details than before.

The lack of transparency here is both entertaining and infuriating. Lots of people are performing a parallel distributed "vibe check" and sharing results with each other, but it's annoying that even the most basic questions (What even IS this thing? Can it do RAG? What's its context length?) remain unanswered so far.

The system prompt appears to be the following - but system prompts just influence how the model behaves, they aren't guaranteed to contain truthful information:

You are ChatGPT, a large language model trained
by OpenAI, based on the GPT-4 architecture.

Knowledge cutoff: 2023-11
Current date: 2024-04-29

Image input capabilities: Enabled
Personality: v2

My best guess is that this is a preview of some kind of OpenAI "GPT 4.5" release. I don't think it's a big enough jump in quality to be a GPT-5.

Update: LMSYS do document their policy on using anonymized model names for tests of unreleased models.

Update May 7th: The model has been confirmed as belonging to OpenAI thanks to an error message that leaked details of the underlying API platform. # 29th April 2024, 8:45 pm

The creator of a model can not ensure that a model is never used to do something harmful – any more so that the developer of a web browser, calculator, or word processor could. Placing liability on the creators of general purpose tools like these mean that, in practice, such tools can not be created at all, except by big businesses with well funded legal teams.

[...]

Instead of regulating the development of AI models, the focus should be on regulating their applications, particularly those that pose high risks to public safety and security. Regulate the use of AI in high-risk areas such as healthcare, criminal justice, and critical infrastructure, where the potential for harm is greatest, would ensure accountability for harmful use, whilst allowing for the continued advancement of AI technology.

Jeremy Howard # 29th April 2024, 4:04 pm