on-device #008

AI Focus

on-device

The web browser is such an amazing concept. From the very first moment I used one, right up until today. You type in a URL and you can have something run on the server and then displayed in the browser. It got me hooked in 95, and keeps me hooked today. Back in the day, Perl was my jam. I loved to create login systems, e-commerce sites, email hosting providers. You name it, I could get the server to mash out HTML like nobody’s business.

When I was 17 I started a company called PCBware in the UK with two of my friends Chris and Ben. We were directors of a company and we felt like we could do anything… we even had a domain name! And what we thought was an awesome company name. It wasn’t until someone said it sounded like you should be scared of PCs that we had some second thoughts.

Earliest image I can find of PCBware’s homepage in 1998

Earliest image I can find of PCBware’s homepage in 1998

As we got more confident building software on the web and wrestling with Apache to automate the creation of websites, we morphed it into a hosting company that made it so that businesses could create a site quickly and have a presence online. I’ve got fond memories of that time, the company (at least when I was there) was not successful, but it was a time when I could just focus on building… We timed it just right to jump on the the web.

The web had this very clear separation, logic ran on the server and you presented it in the UI. We’d played around with JavaScript, JScript and DHTML, but the majority of what we would do in the client was rollover effects on images and some basic UI validation. All I wanted though was to be able to do more in the browser. Many of our potential customers wanted the page never to refresh. But many of the fundamental technologies and best practices just didn’t exist at the time. XMLHttpRequest… That’s two years away!

While I was “running” the company, I was also at university (Hello to my John Moores friends!) and the email system they had was Outlook. I didn’t have Outlook on my personal machine, but I did have a browser. I remember logging into it for the first time and boop! an email just appeard in my inbox…. I didn’t refresh. What is this magic!?

There’s been more seminal moments for me on the Web. Gmail. Google Maps. Google Docs multi-user. That early 2000’s period of the web changed what we expected of the web and it was the combination of on-server and on-device execution and the expectation of interactivity locally in the browser has not stopped. Today we are able to store data sandboxed inside the origin, run applications completely offline, install them on to the device, be able to access hardware on the users devices and now run “Machine Code” via WASM and GPU shaders.

This progression is interesting to me because the story is less about the raw technology and instead more of the change that it enables to the ecosystem. Yes, the web-runtime wasn’t as fast as some app experiences, so you might never ship your AAA game on the web, but the performance of devices was never a reason not to do on-device for the vast majority of experiences and we’ve got all manner of experiences now running in the browser, all the way up to Photoshop. All of the SLICE principles are what makes the web the best platform to deploy on.

As I noted in transition, I’d tracked on-device AI for sometime because of projects like MediaPipe, Tensorflow - it was incredibly neat to see real-time image segmentation etc, but it wasn’t until ChatGPT launched that I dusted off my old machine-learning books and got back into thinking about what is might be possible in the browser. I went off and explored the ecosystem. I built a button detector using TFLite… and it was incredibly exciting with a new classes of apps available to people all inside the browser and I realised that we’re at a new tipping point in capability for the browser and what it can enable for people.

The web has always been an amazing ground for experimentation, I think it’s due to the comparative ease of getting something running and sharing it with people. AI based experiments are no different, there are so many different ways to think about what on-device can mean, for example:

  1. Using a framework like TFlite, ONNX or Transformer.js to load a custom model and run it in the browser, either via WASM or WebGPU.
  2. The in-development WebNN APIs that will in theory let you load and execute models in a standardised way against the hardware that the user has.
  3. The experimental prompt based APIs in Chrome that are now multi-modal and let you have instant access to models like Gemini Nano or Phi.
  4. The built-in API like Chrome’s use-case based APIs (Summarize, Write, Rewrite) to do common tasks without having to download a model or even think about AI at all
  5. Accessing a local-server that hosts models like Ollama
  6. Accessing OS-provided models (like what Apple just announced)

If I look at this list and go bottom to top, it seems to me that every operating system will come with a framework to load and run models, and likely preferring their own models by default. The ease of development for these local OS models combined with the models “just being available” will put pressure on tools like Ollama, which I suspect are in the process of being sherlocked (sher_llm_ocked?). The question for me is will these OS foundational systems allow for model choice either by the developer or the user? Given the lock-in we’ve seen in the past, I suspect there won’t be and I can’t see a world where a regular person will install a model picker middleware.

If the browser has built-in APIs there’s also a similar tension. The browser can provide it’s own model, which is what we see happening now, or if the OS’s have one, it might be able to defer to that. I suspect each browser vendor will want you to use their model.

If you want true customization or do things that aren’t built-in then you are going to have to ship your own models to the client and use the browser provided APIs like WebNN or bring your own runtime (with WASM). This clearly has the most flexibility and a different set of tradeoffs. You have to download the model for each site (we don’t have large-object caching model for the web) and hope that there is APIs for hardware acceleration. I expect what we will see develop here is that people will pioneer new models and capabilities and then once these become common use-cases they will be then built into the browser or the operating system.

As I explored these different classes of inference running on-device and why you (a developer) want them to be on-device, the answers tend to lean into the same areas that we’ve been talking about for years on the web in relation to locally running experiences:

  • They work offline. No connection required. Once you have the model you can just use it.
  • They are private. The data never leaves the device, so you don’t have to worry about it being sent to a server and a company using your data to train their models (whether you believe their license or not)
  • They can be used in real-time scenarios where the round trip to a server would be too slow. For example, you can use on-device inference to do real-time image processing, speech recognition, or even text generation.

But after speaking to a lot of businesses and developers there is something that I’ve not really heard before for most on-device scenarios: Cost! Running models on-device can lower the cost for the business running the site. Specifically, Tokens can be expensive depending on the model and the number of users you have, so if the business no longer has to pay for server costs to run the inference, then that is a huge benefit.

Cost just hasn’t been a thing that I’ve seen really talked about when it comes to the web-experiences. Instead the narrative is about privacy, ownership of data and compute, resilience to the network, resilience to business failure, avoidance of big-tech etc. All of these are great reasons to build local-first, and cost being a factor for running models, and specifically LLMs, I think is a new vector worth looking at to see some of the challenges that we might face as an industry.

Depending on what you are doing, the models that you run on-device can be worse in both quality and performance. As a developer you are going to have to be responsible and decide where you make a trade-off? The cheaper (for your business) model or the quality of the answer? It will be impossible for a user to make a rational and informed decision about where to run the compute, so the natural thing that you might do as a developer is to have a router that looks at the query and determines the best place to run it? If you need a high-quality answer route it to the server. If you want low-cost, keep it on device. If the users device can’t handle it, bump it to the cloud. It gets complex very quickly, and has a couple further issues that will need to be dealt with.

Web developers don’t yet have the tools, or at least it’s not yet in our workflows to benchmark models quality against cost and performance. There are some tools like Simon Willison’s llm-prices, but there needs to be a lot more tooling to help us navigate this space. We need to be able to track the quality of the output of the models, the costs, the latency, and if we’ve learnt anything from npm, when the version changes. For example, the models in Chrome’s on-device APIs - be it the prompt API or the built-in use-case APIs - hide the model version information from the developer and the user. It can and will update as we make the models better (check chrome:components you will see the version information), so how do you manage this in a production environment?

On top of this, when developers and businesses market on-device anything, they normally say that because the data, the storage and the computation are on the user’s device it’s inherently more secure and more private. This is certainly true when we have the capabilities to run everything on device, however we now have performance and memory requirements that render potentially billions of devices unable to run models performantly (or at all) on-device. I’m comfortable, I have a couple of beefy Mac’s that can sling tokens about. But what about if you are someone that has to use a candy-bar phone? The spirit of the web is that everything that is on a URL is accessible irrespective of your device capabilities.

People are putting real and sensitive data into these tools, so if we are going to promote on-device as a real-thing then we need to either change the social contract with people and ensure if the marketing about your site says that the computation is done on device, then you can’t (or shouldn’t) be sending the data to a hosted service, or go further and enable the platform to make the input and output to models more opaque. Unfortunately, a web super-power: the origin model, is no help here. The origin model is great in that it stops other sites peaking into the data held on your site, but there is no guarantee that the data won’t leave your device because the site owner programmed it that way. Observant readers might say “Paul, this issue isn’t about LLMs, it’s been an issue for years”, and you are correct. It’s just that I don’t think anyone has really considered it that much, and I think that given how some of our engagement might change, we should at least consider changes. Maybe there is something similar to CSP and opaque fetches plus tainting of the data (like Canvas) when you get a response from an LLM might help (i.e, you get a warning if the data is sent via a fetch or the user could block requests in the user-agent), or even Opaque objects like those defined by the WebCrypto API, or even something like Apple’s Private Cloud Compute built into the platform. I don’t actually know what the answer is here but it feels like something that would need more investigation because of the pressures developers will have to move compute off the device.

The web is the perfect medium to offer these types of experiences, so if we assume that this technology is better for people and it makes people more efficient or enables new workflows then it should be available to everyone irrespective of location or device-class so that it will enable new classes of computing all via the browser, then we are going to have to really deal with the fact that hybrid approaches will be required for a long time.

I love doing more directly in the browser I can’t wait to see what new use cases that open up just like we saw when Outlook came to the web, or gmail or Google Docs and countless other innovations that were enabled by new APIs…. There’s going to be a lot to work out still.

Subscribe

You can keep up with this blog by subscribing to our newsletter.

More essays

elements - 2025-07-16

Whither CMS? - 2025-07-05

token slinging - 2025-06-30

AI Assisted Web Development - 2025-06-04

embedding - 2025-05-28

Mashups 2.0 - 2025-05-24

latency - 2025-05-22

A link is all you need - 2025-05-17

super-apps - 2025-05-12

transition - 2025-05-09