Open-Source LLMs

In February, Meta released its large language model: LLaMA. Unlike OpenAI and its ChatGPT, Meta didn’t just give the world a chat window to play with. Instead, it released the code into the open-source community, and shortly thereafter the model itself was leaked. Researchers and programmers immediately started modifying it, improving it, and getting it to do things no one else anticipated. And their results have been immediate, innovative, and an indication of how the future of this technology is going to play out. Training speeds have hugely increased, and the size of the models themselves has shrunk to the point that you can create and run them on a laptop. The world of AI research has dramatically changed.

This development hasn’t made the same splash as other corporate announcements, but its effects will be much greater. It will wrest power from the large tech corporations, resulting in both much more innovation and a much more challenging regulatory landscape. The large corporations that had controlled these models warn that this free-for-all will lead to potentially dangerous developments, and problematic uses of the open technology have already been documented. But those who are working on the open models counter that a more democratic research environment is better than having this powerful technology controlled by a small number of corporations.

The power shift comes from simplification. The LLMs built by OpenAI and Google rely on massive data sets, measured in the tens of billions of bytes, computed on by tens of thousands of powerful specialized processors producing models with billions of parameters. The received wisdom is that bigger data, bigger processing, and larger parameter sets were all needed to make a better model. Producing such a model requires the resources of a corporation with the money and computing power of a Google or Microsoft or Meta.

But building on public models like Meta’s LLaMa, the open-source community has innovated in ways that allow results nearly as good as the huge models—but run on home machines with common data sets. What was once the reserve of the resource-rich has become a playground for anyone with curiosity, coding skills, and a good laptop. Bigger may be better, but the open-source community is showing that smaller is often good enough. This opens the door to more efficient, accessible, and resource-friendly LLMs.

More importantly, these smaller and faster LLMs are much more accessible and easier to experiment with. Rather than needing tens of thousands of machines and millions of dollars to train a new model, an existing model can now be customized on a mid-priced laptop in a few hours. This fosters rapid innovation.

It also takes control away from large companies like Google and OpenAI. By providing access to the underlying code and encouraging collaboration, open-source initiatives empower a diverse range of developers, researchers, and organizations to shape the technology. This diversification of control helps prevent undue influence, and ensures that the development and deployment of AI technologies align with a broader set of values and priorities. Much of the modern internet was built on open-source technologies from the LAMP (Linux, Apache, mySQL, and PHP/PERL/Python) stack—a suite of applications often used in web development. This enabled sophisticated websites to be easily constructed, all with open-source tools that were built by enthusiasts, not companies looking for profit. Facebook itself was originally built using open-source PHP.

But being open-source also means that there is no one to hold responsible for misuse of the technology. When vulnerabilities are discovered in obscure bits of open-source technology critical to the functioning of the internet, often there is no entity responsible for fixing the bug. Open-source communities span countries and cultures, making it difficult to ensure that any country’s laws will be respected by the community. And having the technology open-sourced means that those who wish to use it for unintended, illegal, or nefarious purposes have the same access to the technology as anyone else.

This, in turn, has significant implications for those who are looking to regulate this new and powerful technology. Now that the open-source community is remixing LLMs, it’s no longer possible to regulate the technology by dictating what research and development can be done; there are simply too many researchers doing too many different things in too many different countries. The only governance mechanism available to governments now is to regulate usage (and only for those who pay attention to the law), or to offer incentives to those (including startups, individuals, and small companies) who are now the drivers of innovation in the arena. Incentives for these communities could take the form of rewards for the production of particular uses of the technology, or hackathons to develop particularly useful applications. Sticks are hard to use—instead, we need appealing carrots.

It is important to remember that the open-source community is not always motivated by profit. The members of this community are often driven by curiosity, the desire to experiment, or the simple joys of building. While there are companies that profit from supporting software produced by open-source projects like Linux, Python, or the Apache web server, those communities are not profit driven.

And there are many open-source models to choose from. Alpaca, Cerebras-GPT, Dolly, HuggingChat, and StableLM have all been released in the past few months. Most of them are built on top of LLaMA, but some have other pedigrees. More are on their way.

The large tech monopolies that have been developing and fielding LLMs—Google, Microsoft, and Meta—are not ready for this. A few weeks ago, a Google employee leaked a memo in which an engineer tried to explain to his superiors what an open-source LLM means for their own proprietary tech. The memo concluded that the open-source community has lapped the major corporations and has an overwhelming lead on them.

This isn’t the first time companies have ignored the power of the open-source community. Sun never understood Linux. Netscape never understood the Apache web server. Open source isn’t very good at original innovations, but once an innovation is seen and picked up, the community can be a pretty overwhelming thing. The large companies may respond by trying to retrench and pulling their models back from the open-source community.

But it’s too late. We have entered an era of LLM democratization. By showing that smaller models can be highly effective, enabling easy experimentation, diversifying control, and providing incentives that are not profit motivated, open-source initiatives are moving us into a more dynamic and inclusive AI landscape. This doesn’t mean that some of these models won’t be biased, or wrong, or used to generate disinformation or abuse. But it does mean that controlling this technology is going to take an entirely different approach than regulating the large players.

This essay was written with Jim Waldo, and previously appeared on Slate.com.

EDITED TO ADD (6/4): Slashdot thread.

Posted on June 2, 2023 at 10:21 AM15 Comments

Comments

Clive Robinson June 2, 2023 12:32 PM

@ Bruce, ALL,

As I’ve mentioned before we’ve seen two of three potential stages making the LLM landscape,

1, Tectonic Uplift (Primary)
2, Weathering down (Secondary)
3, Engineered (waiting).

The first because it’s very wide area in scope and coverage and effectively a “one shot” needs the “heavy lift” resources of billions of high precision numbers, and parallel processors…

The second is interesting in that it effectively smooths some of the numbers down to just a few bits of precision and has a much more limited scope. It’s like the effect of rain and ice slowely and iteratively making valleys and similar localised areas.

What we’ve yet to see but I can see it coming up toward the horizon is the equivalent of “engineering the land” much as we do with Cities and Urban areas, where we purposfully sculpt the landscape.

I suspect we will see this third stage happening late this year or into next.

It’s this third stage that will take LLM tyoe neural nets and make them realy usable and not like the current systems that are actually not much more than toys being put to use as surveillance front ends by the likes of Alphapet/Google, Meta/Facebook and Microsoft/cloud.

Of the three I find Microsoft to actually be the most dangerous in terms of privacy / secrecy invation.

However there are other major Cloud Suppliers who will almost certainly get in the game.

But those we realy should be focusing on like burning an ant under a lens with sunlight is Palantir… For whom “Invasion of Privacy, any way legal or especially otherwise” is the essence of their business model.

mark June 2, 2023 12:43 PM

Oh, wonderful. So now every wrong-wing script kiddie will be generating deepfakes and PRs and fake news stories and posting the link via (anti)social media during all political campaigns.

Clive Robinson June 2, 2023 1:01 PM

@ mark, ALL,

“Oh, wonderful. So now every wrong-wing script kiddie will be generating…”

Yup, it’s already happening.

There was just a few days ago a posting pulled up by @- on this blog that had several hall marks of being auto-generated by an LLM.

It appeared to be the contents of the page condensed / abstracted and turned into a psudo press release format.

I’ve seen sinilar AI LLM generated output and it has a certain feel to it.

Over on the YouTube EEVblog the Host Dave did some playing around to demonstrate how you might get a press release for scope probes generated. With that same odd not quite right feel.

Untitled June 2, 2023 3:39 PM

This doesn’t mean that some of these models won’t be biased, or wrong, or used to generate disinformation or abuse. But it does mean that controlling this technology is going to take an entirely different approach than regulating the large players.

Some of these models will be biased, or wrong, or used to generate disinformation or abuse – or financial gain. If history is any guide (and it is), a lot of damage will be done to a lot of innocent people before anyone gets around to even trying to control the technology.

JayCee June 2, 2023 4:40 PM

“Oh, wonderful. So now every wrong-wing script kiddie will be generating…”

It’s a tool that will inevitably be used to create misdirection and lack of clarity in instances that aren’t immediately false. 4 deepfakes with different accounts of the same event makes investigation a challenge.

There can be no forensics without a publicly verifiable truth, and these technologies are going to make it hard to get to that truth.

J June 2, 2023 7:24 PM

LLaMa isn’t open source. The source code and weights are only allowed to be used for non-commercial purposes.

Volker Schwaberow June 3, 2023 12:37 AM

Dear Bruce.

In your essay, you are talking we have entered the democratization of Large Language Models. We were able to read what makes you so sure about that. However, you are celebrating too early.

Developing, training, and running large language models require significant computational resources and energy, which only a few large tech companies and institutions possess. This limits who can participate in the creation and refinement of these models. Even if an open source community can do this, they do it on the platforms of others, of the corporations with less interest in democratizing LLM technology.

Although open-source libraries and tutorials make these models more accessible, utilizing and understanding them still require considerable technical expertise in machine learning, making them inaccessible to a significant portion of the population.

Studies have shown that these models can increase negative prejudices present in the data they are programmed with. Unfortunately, not all users possess the expertise or mean to detect and address these problems, which could result in misuse or adverse outcomes. It is crucial to note that Open Source is not immune to these concerns, as demonstrated by Wikipedia.

And finally, as these models’ power and potential implications become more apparent, they may come under increasing regulation, which could limit their availability or democratization. Who pays the legal costs of open-source projects to get behind this barrier?

Ted June 3, 2023 2:03 AM

The original article on Slate links to a publicly-leaked document from a Google researcher – one who seems to be watching the metamorphosis of the LLM ecosystem with a slight tinge of horror.

Was Meta’s leaked LLaMA model a giant leap? Would this allow Meta to incorporate the open source innovations into their products, like Google did with Chrome and Android?

Of course, this is looking at LLM’s through a relatively mercantile-oriented lens. God help us when it doesn’t take a lot of money to play around with these things.

Hyolobrika June 3, 2023 12:24 PM

Oh no. Democratising technology. The horror.
Now it’s not just governments and big corporations who can spread disinformation. How will “we” ever cope?

Anonymous June 3, 2023 12:36 PM

I find it amusing how everyone in this comment section is more scared of average people spreading disinfo than large corporations and governments doing it.
Shows where your loyalties really lie.

Clive Robinson June 3, 2023 1:16 PM

@ Anonymous,

Re : Sarcasm should not need tags.

“I find it amusing how everyone in this comment section is more scared of average people…”

Two things to note,

Firstly, those comments are sarcastic, about the democratization of power some like Microsoft thought they had corralled to their own advantage, but have woken up to find their collective billion or so in investment has escaped their greedy grasp.

Secondly, and more interestingly,

“large corporations and governments”

Are actually hierarchical structures working for a handfull of not quite “average people”. That is those who have currently incurable mental defects that they view as entitlement.

All these LLM’s have realy done, is ripped out that hierarchy and enabled others with similar mental defects play along…

Oh and between 5-15% of the population –depending on who you ask– have these mental defects to the point that it makes their behaviours sufficiently obvious.

Something you might want to consider is that democratizing technology does it for all, good, average, or evil…

The thing to remember about “evil people” is they care not a jot about technology, except as a force multiplier to achive their desires more easily.

That is they care not if it’s a stick, knife, sword, gun, machine gun or chain saw. If it makes it less work for them to remove you as an obsticle then they will use it, otherwise they can still strangle you with their bare hands, or just pay someone else to do it.

Now that is a point that most realy have trouble getting a grip on, and why violence and murder will not be stopped by the most draconian of legislation.

As far as LLM’s the “big boys” in Silicon Valley, paid to have the cork pulled from the bottle, thinking incorrectly they could keep the genie enslaved and doing only their “mass surveillance bidding”. But the genie is out now, and is not going to get put back in the bottle, no matter how much legislation is thrown in that direction…

The only fun side to this genie escaping like this is all those Ventute Capitalists who thought they could set up a new pump-n-dump market bubble, to replace that crypto-coin / block-chain bubble they had done so well from…

Oh there is one winner if you want to invest money in it, and that’s NVIDIA and it’s GPU engines. Which the companies the VC’s try to pump-n-dump need to look like they are worth investing in…

So get wise and invest not in bubbles, but what is needed for the bubbles to be inflated.

JonKnowsNothing June 3, 2023 2:22 PM

@Clive, @ Anonymous, All

re: the genie in the Trail Camera

Trail cameras come in various types, similar to a RING home system but designed to be Out in the Woods. Originally we saw them in the NatGeo type programs where they would capture images of rare animals like Siberian Tigers and Javan Rhinos.

With the inclusion of Geotags on the images, finding the location where the images were taken was dead simple. Lots of folks interested in making those animals dead, learned how to read the geotag and destroyed any trail cams in the area while they got their Big Hunt or Poaching Capture done. Occasionally they missed a cam, so we sort of know what happened.

One of the modern applications for Trail Cams is in the hunting industry. The difference is theses are not nature scientists, the use is strictly for finding, locating, plotting the daily trail pattern, sleeping or resting areas, preferred grazing areas and waterholes, for the purpose of killing the targeted species.

It starts early in the season, and dedicated hunters or hunting guides, spread out hundreds of cameras, documenting whatever comes through the view finder. There are those that spike a spot, put down some food that will attract the target animal, so there is less wanderlust walkabout looking for the bullseye. They do everything just short of putting a tracking tag on the animal (2), although ranch hunting often has tagged animals, so you can get your kill after a short drive, in a luxury ranch truck, dressed like Ernest Hemingway, to where the on-the-hoofed target is penned and be home before cocktails.

A single camera can generate thousands of images, mostly waving branches or grass waves on windy days. For the dedicated hunting guide they set out a lot of cameras. The cameras work for about a year on a set of batteries, or forever on solar power. Battery and image space are the limitations.

One of the interesting applications of AI is reading the thousands of images from all the trail cameras. Many now auto-upload to the web, and for a nominal fee, a company will parse all the images for “targets of interest”. Given enough information they will map the daily trails, and id individual animals for you.

It’s AI/LLM FaceID for Animals. It’s used by ordinary hunters and hunting guides. There’s an APP for it.

===

1)

ht tp s://en.wikipedia.o r g/wiki/Siberian_tiger

ht tp s://en.wikipedia.o r g/wiki/Javan_rhinoceros

2) I would not be surprised if tiny find-me-tags are not super glued to the hide of a food drugged animal, making a 10 point buck kill a cinch – when the season opens.

(url fractured )

Who? June 5, 2023 11:49 AM

Quoting the essay:

Sun never understood Linux.

I think you refer to Oracle, not Sun. Sun had a strong open-source community support for decades.

Perhaps you are mistakenly considering “free software” the same as the “open source community”. They aren’t.

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.