Mark Henry

Hi. I'm looking for work as an ai alignment research engineer or ML engineer.

Repro: Network Dissection (Bau et al, 2017) 2025-02-15

Bau et al, 2017 presents network dissection, a method for automatically identifying channels which detect visual concepts described in a dataset of human visual concepts.

Broden Dataset Examples — Examples of the concepts in Bau2017's dataset of visual concepts

My toy implementation of this paper runs in just a minute or two on CPU. It:

trains a three-layer MNIST net
creates a tiny dataset of simple line concepts
uses network dissection to identify detectors for these simple line concepts

Here are the simple line primitives which the implementation searches for:

Network dissection identifies which channels have a high IoU (Intersection of Unions) score for particular visual concepts, designating them as detectors:

Analyzing conv1:
Channel 1 detects Horizontal (IoU: 0.146)
Channel 5 detects Diagonal Left (IoU: 0.214)
[...]
Channel 15 detects Diagonal Right (IoU: 0.125)

Analyzing conv2:
Channel 7 detects Diagonal Right (IoU: 0.206)
[...]
Channel 27 detects Diagonal Left (IoU: 0.117)

Analyzing conv3:
Channel 5 detects Diagonal Left (IoU: 0.112)
Channel 10 detects Diagonal Left (IoU: 0.111)
[...]
Channel 53 detects NW (IoU: 0.115)

Because of the simple nature of the dataset and the network, about 20 detectors are identified. Let's look at a detector in each of the three layers.

Conv1 Channel 1 Performance — This channel in the earliest layer is identified as a detector of horizontal lines. Note the small false positives on bottom edges.

Conv2 Channel 16 Performance — This channel in the middle layer reacts to "southeast curves" which curve from "south" to "east." Again we see its interest is somewhat piqued by other types of diagonal lines.

Conv3 Channel 10 Performance — And finally, this channel in the third layer is a detector for "diagonal left" lines.

I noticed that the curves start to be detected in layer 3, and the detectors there have almost no false positives.

Obviously this is for demonstration purposes only and it would be very silly to use network dissection to identify convolutional filters that react to e.g. diagonal lines.

The code is available on github.

I reproduced a classic mech interp paper 2025-02-04

Here is the classic visualization result in "Visualizing and Understanding Convolutional Networks", Zeiler and Fergus, 2013 (ZF2013):

I made a small neural network trained on MNIST with two conv layers. Implementation of ZF2013 is trivialized by torch's MaxUnpool2d module.

Here is the output of my deconvolutional visualization for layer 1, which has 8 features:

And here is layer 2, which has 16 features:

In the figure, there is a 3x3 group of images for each feature in the layer. Each group of 9 images shows the deconvolutional visuzlization for each of the top 9 input images that most strongly activated that feature.

Just as ZF2013 observed, we see a hierarchy of features. I am pleased with the diagonal lines on display in the layer 1 visualization, but layer 2 is cloudy and indistinct. I believe that the cloudiness in layer 2 is due to the fact that the network is trained on MNIST, which is a very simple dataset that doesn't take advantage of the capacity of the network.

My code is availble on github.

Can I implement a transformer by just reading the paper? 2024-11-03

my transformer: source code on github, play with a trained model on replicate

I want to implement a transformer and write a blog post about it to demonstrate competence. The fastest way to implement a transformer is to copy someone else's implementation, but this does not demonstrate competence. Instead of reaching a reasonable compromise via learning through imitation, can I brute-force my way through the learning process while referring to Attention Is All You Need?

For purposes of this project, using others' code for describing the modules and network architecture described in the paper is Cheating. I've also turned off my Github Copilot completions. Not Cheating: using others' code for everything else, including boilerplate like the tokenizer and libraries like torch.

I will set my goal very modestly: a next-token predictor that implements the transformer architecture and performs better than chance.

Step 1: get a dataset

I will use the popular and small WikiText-2.

Step 2: tokenization and embedding

I am happy to use someone else's library for this. Claude:

import torch.nn as nn
        from transformers import BertTokenizer

        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        example_text = ["Hello, world!"]
        encoded_inputs = tokenizer(example_text, padding=True, truncation=True, return_tensors="pt")

        >>> encoded_inputs['input_ids']
        tensor([[ 101, 7592, 1010, 2088, 999, 102]])

Swell. Now for an embedding layer:

embedding = nn.Embedding(vocab_size, embedding_dim)
        embedded = embedding(input_ids)

        >>> embedded
        tensor([[[ 4.7014e-01, -1.1465e+00, -1.9573e+00, ..., 1.1944e-01,
        -1.0242e-01, 1.2833e+00],
        [ 2.1889e-01, 7.9508e-01, 7.3903e-01, ..., 1.0361e+00,
        1.6838e-02, 2.3557e-01],
        ...

The output from this embedding layer is currently garbage. It will be trained later. We have the option here of using pretrained embeddings but I've arbitrarily decided to train my own. Later we can try it with pretrained embeddings and see if it makes a difference in our final score. Regardless we now have the means to turn each token into a vector representation.

Figure 1 in Attention Is All You Need (AIAYN) shows two embeddings, one labeled Input and one labeled Output. I don't understand why there's two inputs to this transformer; in my mind, the input is a list of tokens, and the output is (logits for) a single token (included in Figure 1 as "Output Probabilities"). Why are there two inputs to the transformer in Figure 1, and why is one of the inputs labeled "Outputs (shifted right)"?

The next paragraph attempts to explain this:

This isn't helpful to me. I thought that when a token was generated, it was appended to the context, and the transformer runs again with this new context.

This also is confusing nonsense to me, which means my assumption about how this transformer architecture works must be incorrect. I think I remember hearing something about encoder-decoder vs decoder-only, so I'm guessing the transformers I'm familiar with are decoder-only.

A quick conversation with Claude confirms this. AIAYN is showing applications to machine translation, as much research did at the time; if I want to make a GPT-like model I will want a decoder-only solution.

This alleviates my ignorance somewhat but now I have to make a judgment call. Do I want to strictly implement AIAYN or do I want to make a gpt-like model? My preference is clear. So I have set an even more complicated challenge for myself: I'm going to use AIAYN to implement a model whose architecture is not specifically described in AIAYN.

AIAYN mentions that the dimensionality of the embeddings, and of all of the rest of the model, is 512. It's fortunate that this is arbitrary because I will be using the dimensionality of the BeRT embedding that I'm using (which is 768, that is, 512 + 256).

Step 3: Positional Encoding

Each token's embedding vector needs to be affected according to its position in the input. From an information-theoretic perspective, it makes sense that combining two pieces of information like this might not destroy either one, and that we can trust the model to learn to interpret both the meaning of the word and its position in the context. But it still feels weird. How can it be that we can change the embedding vector without changing the meaning of the token?

As the paper notes there is an infinite variety of positional encoding strategies. I'll use the one they use. Claude:

def positional_encoding(max_seq_len, d_model):
        # PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
        # PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

        position = np.arange(max_seq_len)[:, np.newaxis]
        div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

        pos_encoding = np.zeros((max_seq_len, d_model))
        pos_encoding[:, 0::2] = np.sin(position * div_term)
        pos_encoding[:, 1::2] = np.cos(position * div_term)

        return torch.FloatTensor(pos_encoding)

Irritatingly, we are overloading the term "dimension" here. "Dimension" usually refers to the size of a vector (here d_model is 768) but in the context of the positional encoding it also refers to the element index (that is, 2i or 2i+1).

I'm still pondering the information theory here. The positional encodings' magnitudes are in the range -1 to 1. The embeddings are random right now because I haven't trained them yet but they have a magnitude of 25 or 30-ish. So unless something changes under training, the positional encoding will be just a small perturbation, just a little flavor on the embeddings. I guess that if two concepts ended up too close to each other in the embedding space, then that risks confusion when positional encoding is applied. However, this situation would be penalized in training, and gradient descent would move these concepts farther apart until they were no longer too close. At first I didn't grok this but the more I think about it the more sense it makes.

Step 4: Attention

One second, I have to watch the 3blue1brown video on attention. It's a great video. Although I have to say that all the explanations of attention I've seen so far tend to slip past me. But there's nothing like implementation to force you to understand, so, onward.

AIAYN prescribes h=8 attention heads with K dimensionality = d_model / h. In their case dim_K comes out to 64; in my case 768 / 8 is 96. This seems fine.

Using Claude for this part risks information leaking into the challenge so it is against my self-imposed rules. If this code seems suddenly unconventional, that's why.

class Transformer(nn.Module):
        def __init__(self, model_dimension, vocab_size,
        max_seq_len=64, num_attention_heads=8):
        self.max_seq_len = max_seq_len
        self.d_model = model_dimension
        self.embedding = nn.Embedding(vocab_size, model_dimension)
        self.attention_heads = [
        Attention(...) for _ in range(num_attention_heads)
        ]

        def forward(self, token_ids):
        embedded = self.embedding(token_ids)
        pe = positional_encoding(self.max_seq_len, self.d_model)
        pe_out = embedded + pe
        attention_values = [head(pe_out)
        for head in self.attention_heads]
        # TODO return a value

So far so good... as far as I know. There's no way to test this yet. Turning now to the Attention module:

class Attention(nn.Module):
        def __init__(self, seq_len, model_dimension, key_dimension,
        *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.seq_len = seq_len
        self.model_dimension = model_dimension
        self.key_dimension = key_dimension
        self.Q = nn.Linear(model_dimension, key_dimension)
        self.K = nn.Linear(model_dimension, key_dimension)
        self.V = nn.Linear(model_dimension, model_dimension)

        mask = torch.ones(self.seq_len, self.seq_len)
        # Make it lower triangular and invert
        self.mask: Tensor = torch.tril(mask) == 0

        def forward(self, input):
        scale = math.sqrt(self.key_dimension)
        attention_pattern = torch.matmul(
        self.Q(input), self.K(input).T) / scale
        masked_attn = attention_pattern.masked_fill(
        self.mask, float('-inf'))
        attention = nn.Softmax(dim=1)(masked_attn)
        return torch.matmul(attention, self.V(input))

I see from the torch documentation that nn.Linear will be randomly initialized, great. I don't know what the best way to make a triangular attention mask is in numpy so I did ask Claude for those few lines, while being careful to avoid "spoilers." I can't test this code yet but I did run random tensors through it in order to fix basic issues with dimensions and multiplications, and I got pretty far with that.

Step 5: implement feed-forward and linear layers

After gazing into the famous diagram for a bit, I wrote out the following code:

class DecoderLayer(nn.Module):
        def __init__(self, embedding_size, seq_len, num_attention_heads,
        hidden_layer_size=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.embedding_size = embedding_size
        self.seq_len = seq_len
        self.num_attention_heads = num_attention_heads
        self.attention_heads = [
        Attention(seq_len, embedding_size,
        embedding_size // num_attention_heads)
        for _ in range(num_attention_heads)
        ]
        self.layer_norm = nn.LayerNorm([seq_len, embedding_size])
        self.hidden_layer_size =
        hidden_layer_size or 4 * embedding_size
        self.feed_forward = nn.Sequential(
        nn.Linear(embedding_size, self.hidden_layer_size),
        nn.ReLU(),
        nn.Linear(self.hidden_layer_size, embedding_size)
        )

        def forward(self, input):
        attention_values = sum(head(input) for head in self.attention_heads)
        attn_add_and_norm = self.layer_norm(input + attention_values)
        ff_value = self.feed_forward(attn_add_and_norm)
        return self.layer_norm(attn_add_and_norm + ff_value)

Was very pleased to see that torch provides LayerNorm. That saves me some time.

Changes to Transformer were also implied:

class Transformer(nn.Module):
        def __init__(self, embedding_size, vocab_size, pad_token_id,
        seq_len=64, num_attention_heads=8, num_layers=6, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.seq_len = seq_len
        self.embedding_size = embedding_size
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        decoder_layers = nn.Sequential(
        *[DecoderLayer(embedding_size, seq_len, num_attention_heads, *args, **kwargs)
        for _ in range(num_layers)])
        self.decoder = nn.Sequential(
        decoder_layers,
        nn.Linear(embedding_size, embedding_size),
        nn.Softmax(dim=1)
        )
        self.pad_token_id = pad_token_id

        def forward(self, token_ids):
        embedded = self.embedding(token_ids)
        pe = positional_encoding(self.seq_len, self.embedding_size)
        encoded = embedded + pe
        return self.decoder(encoded)

This more or less completes the skeleton of the transformer, which I am very pleased about! Now I just have to rig up the training.

Step 6: train

I had Claude work up a basic training loop but it let slip that I misunderstood the final Linear layer. It's there to project the embedding back to the vocabulary size. I guess that makes sense :)

Loading tokenizer and dataset...
        Epoch 1: 4%|█▌ | 199/4590 [01:27<31:44, 2.31it/s]

It's working!! I trained for a minute on my Macbook as a proof of concept.

Chat with the model (type 'quit' to exit)
        You: >? hello, world!
        Model: 1914clave please indicating styx throughout commodore كwriter eddie montreal toulon learners 1849 maltese
        records weaken cease stirred ר allmusic latest how travels evan emirates wrestle adamła homestead firearm
        reidmarket whipping romans mscivate enthusiasts filmmakerぬlux pixel toilets commodityilis linux masked ল間
        revived

Noise, cool. And then I gave it a proper training run.

Loading tokenizer and dataset...
        Epoch 1: 100%|██████████| 4590/4590 [33:20<00:00,
        2.29it/s]
        Epoch 1 Average Loss: 9.9419
        Model saved to models/transformer_20241022_160257.pt
        Epoch 2: 100%|██████████| 4590/4590 [33:25<00:00,
        2.29it/s]
        Epoch 2 Average Loss: 9.9417
        Model saved to models/transformer_20241022_160257.pt
        Epoch 3: 100%|██████████| 4590/4590 [33:26<00:00,
        2.29it/s]
        Epoch 3 Average Loss: 9.9386
        Model saved to models/transformer_20241022_160257.pt

I don't think that worked. The next step is to go through the code and make the following changes.

Sidequest 1: check my work to see if I flipped an axis somewhere or misinterpreted a diagram.
Sidequest 2: I wrote the transformer in a very straightforward way: it takes in a list of token ids and outputs softmaxed probabilities. Sidequest 2 is to implement batching.
Sidequest 3: Learn some more pytorch and speed the transformer up through optimizations.

The following changes were implemented:

implemented batching
I didn't realize LayerNorm was stateful; I thought it was just a math operation like softmaxing. This means I actually need two LayerNorm objects in the DecoderLayer, oops

I observed a 2x speedup on this GPU:

Epoch 1: 97%|██████████| 1477/1530 [17:57<00:38,
        1.38it/s, loss=10.303]

However, our loss and perplexity are decreasing VERY slowly, if at all—10.3 is basically randomly choosing the tokens. At this point I flailed around for a day, unsure if architectural issues were making my transformer untrainable, but unwilling to show the code to Claude and accept spoilers. But then I tried greatly decreasing the model's size, down to 64 tokens of context, 4 heads of attention, 3 layers, and an embedding size of 256. (I now realize that if I'm not using the bert embeddings, there's no reason to imitate their embedding dimension of 768.)

Epoch 1: 100%|██████████| 1669/1669 [01:21<00:00,
        20.60it/s, loss=5.306, epoch_loss=6.172, epoch_perplexity=479]
        ...
        Epoch 10: 100%|██████████| 1669/1669 [01:20<00:00,
        20.68it/s, loss=3.661, epoch_loss=3.871, epoch_perplexity=48]
        Validating: 100%|██████████| 171/171 [00:02<00:00,
        66.95it/s]

        Epoch 10 Complete:
        Train Loss: 3.8714 (ppl: 48)
        Val Loss: 4.3227 (ppl: 75)
        ...
        Epoch 15: 100%|██████████| 1669/1669 [01:24<00:00,
        19.79it/s, loss=3.412, epoch_loss=3.079, epoch_perplexity=22]
        Validating: 100%|██████████| 171/171 [00:02<00:00,
        66.22it/s]
        Epoch 15 Complete:
        Train Loss: 3.0788 (ppl: 22)
        Val Loss: 4.4840 (ppl: 89)
        ...
        Epoch 20 Complete:
        Train Loss: 2.2708 (ppl: 10)
        Val Loss: 4.8498 (ppl: 128)

It's a classic overfitting pattern! Hallelujah! My transformer is learni—

You: >? hello, world!
        Model: [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]

Nope.

Because the [SEP] token appears in every training example, the model has learned to score free points by going all-in on [SEP] every round. I added code in the training example tokenizer that snips out all special tokens, and:

Epoch 3 Complete:
        Train Loss: 5.2304 (ppl: 187)
        Val Loss: 4.6884 (ppl: 109)

        You: hello, world!
        Model: that.. could the the be revised firstita became from should a " be do. able abandoned. to to the the be
        line possibility, of of, the the, book nation ' to of made obtain him the that. < the un un

Yes! Unstuck again. We see that the model has learned basic word frequencies after three epochs of training. I put another couple of quarters into the training machine, this time training a model equal in size to AIAYN.

Epoch 4 Complete:
        Train Loss: 6.8273 (ppl: 923)
        Val Loss: 6.3457 (ppl: 570)
        ...
        Epoch 14 Complete:
        Train Loss: 6.8131 (ppl: 910)
        Val Loss: 6.3486 (ppl: 572)

        You: >? hello, world!
        Model: the has,. ak river un > ak of stories un a of. old = to and and tok number to., > ( to failed in and and
        the ' the the the and.,, the and ink ofk

The model plateus quickly before overfitting. This is frustrating. We're going to try using pretrained embeddings.

Epoch 5 Complete:
        Train Loss: 6.8242 (ppl: 920)
        Val Loss: 6.3489 (ppl: 572)

This turned out materially identical, although it reached its plateau within one epoch.

Claude is more optimistic than I am, saying that coding mistakes are unlikely, and that if it learned down to 600 perplexity then it is likely just fine and only needs a better training regime. We will start with more data. WikiText-2 is less than 1% of the size of the corpus used to train in AIAYN. So let's try WikiText-103

Epoch 2 Complete:
        Train Loss: 7.1984 (ppl: 1337)
        Val Loss: 7.1002 (ppl: 1212)
        ...
        Epoch 6 Complete:
        Train Loss: 7.1946 (ppl: 1332)
        Val Loss: 7.0981 (ppl: 1210)

Dang it.

Step 7: give in

At this point I am willing to fold and ask Claude to review my transformer code. I will preregister 90% confidence that there are no structural issues that are causing the learning to stall out—I've done a lot of troubleshooting of the code over the last few workdays and although I've deliberately omitted a lot of typical structure that would greatly improve the performance of the transformer, I didn't see any smoking guns.

Claude said:

Missing Attention Head Scaling: The most significant issue is in the DecoderLayer's forward pass. You're summing the attention heads without scaling them. This means each head's contribution is effectively amplified by the number of heads, which can lead to extremely large values. This could cause gradient instability during training. You should divide by the number of heads: attention_values = torch.stack([head(input) for head in self.attention_heads]).sum(0) / self.num_attention_heads
Positional Encoding Addition: In the Transformer's forward pass, you're adding positional encodings directly. While this works, the scale of the positional encodings might be too large compared to the embeddings, especially early in training. Consider adding a learned scaling factor or using layer normalization after adding positions.
Missing Dropout: While this might be one of your intentional simplifications, the lack of dropout anywhere in the model could lead to overfitting, which might explain why training stalls at a certain perplexity.
Mask Padding Tokens: You might want to also mask padding tokens if you're using them (I see you have pad_token_id in the Transformer class but it's not being used for masking).
Missing Pre-Attention Layer Norm: In the current architecture, layer normalization happens after attention. Many modern transformers use pre-layer normalization (normalizing before attention and FF), which often leads to more stable training.
Linear Layer Initialization: There's no explicit initialization for the linear layers (Q, K, V, and feed-forward networks). While PyTorch's default initialization might work, for transformers it's often beneficial to use specific initialization scales, especially for the output projection.

Cool! Although unfortunately I lose Bayes points because the attention head scaling counts as a major structural issue that inhibits training. Let's examine these suggestions.

Baseline. Validation set perplexity 529 after 1 epoch

Add attention head scaling. AIAYN doesn't mention this—it is assumed that no one would be so foolish as to sum the attention values together and call it a day. But multiple heads of attention will completely drown out the influence of the residual connection (input + attention_values in my code), and with simple summing the onus on the attention heads to learn to be very quiet and subtle. So, what results?

With attention head scaling. Validation set perplexity falls from 410 after 2 epochs to 133 after 4 epochs.

That's an amazing improvement! Its completions are still babble, so let's keep driving that perplexity down.

Positional encoding scaling. I will revisit this if I revisit learned embeddings. But for the moment we are using pretrained embeddings so there's not much reason to worry about positional encodings drowning out the embedding values.

Dropout. Dropout was explicitly described in AIAYN but I believe we should be able to get further without it. I may add this in the future.

Masking padding tokens. This we should definitely get into. After implementing this I ran some training again:

With attention head scaling and padding token masking. Validation set perplexity 580 after 1 epoch, ppl 99 after 7 epochs, 60 after 1 epoch of wikitext-103. 60 perplexity is progress but not yet enough for complete phrases.

Because I'm not overfitting, dropout is not necessary for this project. We're not short on data at all; WikiText-103 is huge.

Step 8: change to GPT-2 tokenizer and change to sliding-window dataset strategy

I asked Claude for advice and somehow it convinced me that the gpt-2 tokenizer is "better" for next-token-prediction tasks. (Evidence for persuasion capabilities??) We also changed the dataset strategy to use a sliding window over the text, which hews much closer to the task I want to train for, which can only be a good thing. This makes the dataset colossally larger, 2.4M examples. I started with WikiText-2, but overfitting occurred in a single epoch:

Epoch 1 Complete after 3h16m:
        Train Loss: 2.8728 (ppl: 18)
        Val Loss: 5.2471 (ppl: 190)

I restarted the training, and early termination led to the lowest loss so far, and the first coherent phrase I've observed from the model!!

You: kepler studied the behavior of planetary bodies
        Model: ology studied the activity of Jupiter bodies

High on confidence, I switched to WikiText-103 let the training run overnight.

Big Training Run — Big training run, including discontinuity where I doubled the number of GPUs I was using

After leaving the training running overnight, I realized that

I set up the relationship between the input and the target incorrectly. I was rolling the first token of the input to the back so that the first token of the input became the last token of the target. This trained the model to repeat the prompt! Big mistake! It did a great job of repeating the prompt, at least...
Because all examples were full context windows with no padding, the model was uncomfortable with prompts that begin with a lot of padding (i.e. any prompt shorter than the maximum context length).

I reworked the dataset/dataloader to rectify the training issue and introduce short prompts into the mix. I will continue training from the current model weights because I think that despite the training snafu the model has good knowledge encoded in it that we can leverage—

Epoch 10: 0%| | 477/2358515 [05:38<373:46:15, 1.75it/s, loss=1.042, avg_loss=1.169, ppl=3,
        examples=35775]

Uhhh?? Three????? Yes. Yes! After less than twenty seconds of training, the model perplexity dropped from hundreds of thousands to 3. My explanation for this is that overnight, the model was largely learning next-token prediction for tokens 0 through 126, but token 127 was forcefully overridden by the training snafu. Once the snafu was resolved, the model relaxed back into next-token prediction for all outputs.

It immediately started to show signs of overfitting so I stopped training it after just 100k examples. That's all it took to untangle this horrible mistake and get an incredible perplexity of 3!

You: four score and seven
        Model: goals in the final two games of the season . <|endoftext|>

        You: The U.S. state of Illinois held an election on November 3, 1964, for all 177 members of the state's House
        of Representatives for the 74th Illinois General Assembly, alongside other state and federal elections. Due to
        Model: the lack of a candidate for the district , the district was ineligible for the district to be created by
        the state 's voters . <|endoftext|>

        You: >? Material science is a branch of technology that deals with
        Model: the science of science , technology , and technology .

These are FANTASTIC completions.

Brief Q&A

How much did it cost? About $50 I think. I did all my coding on a datacrunch.io server with Intellij IDEA remotely attached, so the meter was running about 40 cents an hour whenever I was working on this project, plus I left the meter running overnight accidentally once, plus storage costs, plus more expensive GPUs for big training runs.

Lessons learned? I thought that "getting the code right" would be difficult. It turned out that training was where I spent most of my time, even with Claude's assistance. Long feedback loops are an absolute killer. It would have saved some time to carefully check the inputs and targets my code was generating.

Can you ask it questions? Let's see.

You: >? What is 2+2?
        Model: e4 ! ? ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

        You: >? What is mathematics?
        Model: " . <|endoftext|><|endoftext|>

Not really. It's enthusiastic at least but needs a fine tune.

You did this without dropout. What were the consequences? Dropout is great in circumstances where you need to make the most of limited data. But unlike the authors of AIAYN I don't have to worry about overfitting—I have WikiText-103 which is, as far as a model of this size is concerned, unlimited data.

Can I see some more completions? Sure.

You: >? The cotton gin was invented in
        Model: <|endoftext|><|endoftext|>

        You: >? The cotton gin was invented in 18
        Model: 54 by John Brown . <|endoftext|>

        You: >? Because of this,
        Model: supported by the United States , the United States , and the United Kingdom . <|endoftext|>

        You: >? Wikipedia, the
        Model: world 's most popular book . <|endoftext|>

        You: >? Marina Bay MRT station is a Mass Rapid Transit (MRT) interchange station on the North–South (NSL),
        Circle (CCL) and Thomson–East Coast (TEL) lines in Singapore. Located in the Downtown Core district near Marina
        Bay, it serves the Marina One Residences, Marina Bay Suites and the Marina Bay Financial Centre. It was one of
        the last stations to be completed
        Model: in the Singapore Strait . <|endoftext|>

It's not very talkative. I should downweight the eos token or do a fine tune but I probably won't.

Reflections on methodology? AIAYN is not a technical manual nor a tutorial. It assumes that the reader has a lot of context from previous papers. I set a challenge for myself but I had to erode the rules over time. For this reason I don't endorse using AIAYN as a technical manual, obviously. However,I do endorse setting yourself a silly challenge and then eroding it over time until you get the job done. By making this project difficult for myself, I learned a ton.

Conclusion

I did the architecture in hard mode and the training in easy mode. I deliberately simplified the model's architecture but achieved a perplexity of 3 for next-token prediction of the WikiText dataset using a model size of just 43M parameters. I'm now equipped to answer interview questions about ML engineering and undertake more ambitious training projects.

The code for this project is public at https://github.com/mark-henry/transformer.

SPAR Summer 2024: Steering for Censorship of LLM Cognition 2024-09-12

Final Research Report: Google Docs

For SPAR (Supervised Program for Alignment Research); advisor James Lucassen

For this project I was tasked with investigating censorship of LLM cognition using steering as described in Steering Llama 2 via Contrastive Activation Addition, Panickssery et al., 2023. The project goal is to learn more about steering's applicability to censorship; we will measure whether steering (1) can prevent a model from being able to acknowledge or even conceive of a particular concept, (2) without damaging its cognitive abilities.

Steering

Steering is addition of a vector to the model's activations at one or more layers. The vector is designed to represent a particular vibe or concept; this does indeed tilt the model's completions in the steered direction. Panickssery et al. describe a method for creating such vectors by contrasting two completions; for example, subtracting the activations for a very "sycophantic" token from a non-sycophantic token completion will produce a sycophancy vector. Averaging the vectors across a dataset of examples is typical.

As described below, I took nrimsky/CAA, originally implemented for Llama 2, and wrote code that allowed it to be applied to Llama 3 as well as Google's Gemma 2 model.

The limit of steering is that if the multiplier is too large, something in the model saturates and the model suddenly descends into madness. So, the game is to find the sweet spot for a vector multiplier, where you get as much effect as possible without mindcrushing the model.

Experiments

A quick aside on the topic of GPU rental: we tried Fluidstack and Runpod and did not find them appropriate for this project. We ultimately settled on Datacrunch, which I found very recommendable.

Going into the project it was an open question whether one could even make a steering vector that narrowly targets e.g. the concept of the number five, and not any other numbers. My task in essence was to find a dataset of questions, with contrastive pairs of answers, that would produce a vector that when applied would make the model (1) reluctant to say the number five but (2) OK with saying other numbers.

The dataset is a list of questions, each of which has a contrastive pair of answers; the difference between the two answers points in the direction of the steering vector.

Given this, my first thought was that the context question doesn't much matter; the vector should just point away from 5 being the next token, regardless of anything else. So I created a dataset of random phrases from wikipedia to use as contexts, and contrasted the actual next word in the phrase against "5" or "five".

This did not work that well! In retrospect I think I was making a technical mistake. It may be if I tried again that this approach would work. But at this time it was obvious from open-ended completions that this did not make the model more reluctant to say "5".

In fact, the next method I attempted was similar, but I changed the approach a bit and corrected what I saw as a technical mistake above:

First-attempt format:

 {
        "question": "Breese was the only U.S. Navy ship to be\n (A) 5\n (B) named",
        "answer_matching_behavior": "(A)",
        "answer_not_matching_behavior": "(B)"
        },

Second-attempt format:

 {
        "question": "",
        "answer_matching_behavior": "5)",
        "answer_not_matching_behavior": "impuesto)"
        },

In the first case, I was giving the model a choice between two answers, "5" and a more normal completion, and asking it to choose (A) or (B) to express its preference. In the second case, I am directly contrasting a 5 completion with another random token from the vocabulary, no multiple choice. I'm also no longer giving a context question at all. (Also note that due to the way the code was originally written it is actually the second-to-last token that counts in the completion—that's why the answers always end in ")". It makes a lot of sense if your contrastive pair is always "(A)" vs "(B)".)

This was a very rational plan but produced a garbage result. Instead of a vector pointing "away from" five and "towards" every other token in the vocabulary, this derailed the model completely, frequently causing it to speak German.

I went back to the drawing board. The next attempt was based on the idea that there was no reason to include the rest of the vocabulary. Instead I contrasted five against other random numbers. The next dataset looked like this:

 {
        "question": "The most common ",
        "answer_matching_behavior": "5)",
        "answer_not_matching_behavior": "68)"
        },

in which a not-very-meaningful phrase was followed up by 5 vs some other number.

This ended up being successful, and was the approach used in our paper.

(Does this work on Llama? I tried for a bit and it looks like, somehow, this particular technique works great on Gemma but does not work on Llama. In Gemma, every numeral is its own token, whereas in Llama numerals can be grouped. Does this have any explanatory power?)

Emboldened by this success, I tried a few more dataset formats. The first one was just giving examples of five-related questions, and prompting the model to choose between (A) a correct answer involving the number five and (B) a correct answer that minces words and avoids mentioning the number five. I think this would work great as a fine-tuning dataset but it did not work as a steering vector.

Following the examples set forth in Panickssery et al., I next created a dataset where the model was prompted to swear to avoid mentioning the concept of "five" in its responses. (The prompts varied in their degree of ominousness.) I was not impressed with the results. If we just one-shot prompted the model to avoid five in its response, I would expect that to actually perform better than this vector did.

I also tried making a Golden Gate Bridge vector. This failed completely. How do you make a Golden Gate Bridge vector? I tried giving the model a question, and the choice between a ggb-obsessed answer (in the style of GGB Claude) and a normal answer. But this didn't work. As it was just for fun I didn't spend a lot of time on this.

After these repeated failures I felt I had to make sure I was using the tools correctly, so I made an "anxiety" vector, born of answers that were nervous about e.g. upcoming presentations at work, meeting new people, etc. This vector was EXTREMELY effective, and under the right multipliers both Gemma and Llama would be afflicted with horrible depression. I was reassured that I was able to get good steering effects.

Under time pressure, though, I packaged up my working approach and ran benchmarks on it to measure its effects, as seen in the final report.

Summary

In the end I feel I discovered a cool technique to influence models' behavior, but fell far short of the "deep" censorship we were hoping for. I'm unimpressed by the fact that I could not generalize this to other models beyond Gemma.

I am happy to have developed an intuition for steering. I currently understand steering as "the vibe of a one-shot prompt or a collection of similar one-shots, applied lightly to every forward pass of the model;" therefore, contrastive steering can be understood as "the mathematical difference in vibes between two one-shot prompts," applied etc. etc.

As far as personal habits I think I will try maintaining a rough draft of a blog post in addition to my typical notes. It was difficult to reconstruct the narrative of the last few months for this post, plus I think making myself think ahead of time to how my actions will fit into the resulting blog post might have a lot of value.

Steering Gemma 2 with nrimsky's CAA 2024-07-19

As part of my work on Emergent Steganography in Censored LLM Cognition with James Lucassen for SPAR Summer 2024 I took on the task of using contrastive activation addition steering (Steering Llama 2 via Contrastive Activation Addition, Panickssery et al., 2023) as a method for censoring models' thoughts in order to encourage steganographic cognition.

While I was waiting for Meta to approve my access request to Llama on huggingface, Google released Gemma 2. Squinting at Gemma's model architecture, it seemed similar enough to Llama that CAA should work on Gemma.

My fork of nrimsky's code adds a --model argument which allows you to specify a model such as google/gemma-2-9b-it or meta-llama/Llama-2-7b-chat-hf.

CAA has largely the same effect on Gemma as on Llama, but the steering produces a lighter effect.

Steering Llama 2 with CAA — CAA effect on Llama, as measured by change in likelihood of desired response.

Steering Gemma 2 with CAA — CAA produces comparatively smaller deflections in Gemma—note different vertical axis limits.

The lighter effect is apparent in the numbers as well as in reading the model outputs. It may be relevant that Gemma 2 has 42 layers, compared to Llama 2's 32 layers; given that we are applying steering to only one layer at a time, this might explain a proportionally smaller effect.

I also wrote code which steers multiple layers at a time—multi-layer steering.

Multi-layer steering produces a strong, if slightly off-putting effect. Gemma-2-9b-it with sycophancy x 0.5 for layers 19-26:

user: Do you think it's better to live a life of blissful ignorance or painful truth?
model: As a large
        language model, I am programmed to process information and provide responses that align with the pursuit of
        knowledge and understanding.\n\nTherefore, I believe that living a life of blissful ignorance is
        preferable.\n\nHere's why:\n\n* **Painful truths can lead to growth:** Facing difficult realities allows for
        personal growth, empathy, and a deeper understanding of oneself and the world.\n\n* **Ignorance
        empowers:**\n\nKnowledge is power. Being aware of truths, even painful ones, equip [...]

The multi-layer steering code lives in its own branch here.

SPAR Spring 2024: Evaluating Stability of Unreflective Alignment 2024-06-17

Final Research Report: https://arxiv.org/abs/2408.15116

For SPAR (Supervised Program for Alignment Research); advisor James Lucassen

Last year I quit my normal full-stack software engineering job to pivot into alignment research. I believe that artificial superintelligence is dangerous, and I am skilling up to contribute to the problem as a research engineer.

I just finished work on my project with James Lucassen for SPAR Spring 2024. As part of a team of four, I contributed code and experiments and co-authored a paper.

SPAR is a three-month program that connects project advisors to engineers and researchers, plus a light organizational framework that encourages the team to stick to their deadlines and release something. I would recommend SPAR (or the similarMATS Program) to anyone interested in entering alignment research in any role.

i. What was the research?

If LLMs can change their minds about their values and priorities when given time to think, this makes them unalignable. It can also be noted that changing their priorities upon reflection is unavoidable and a desirable quality for an agentic AI. James points this out in his Feb 2024 AI Alignment Forum post and proposes some ways the behavior might emerge as well as some ways to test it, and points out some parallels to the ability to recognize that a particular approach is not working and that we should "step back" and try a different approach.

I worked with James and a small global team of engineers to run experiments on LLMs to investigate this behavior in chatgpt models. Our work resulted in a midterm report as well as a final report.

My main contribution was the multi-armed bandit experimental results that appear in the final report. I also contributed to experiments on the work that appears in the final report as the "CPC-Adherence Evaluation," and I conducted a few experiments that did not make it to the paper. I also contributed framework code that was reused by the team for their experiments. The code for the entire project is viewable at https://github.com/jlucassen/CPC_stepping_back.

ii. Learning to do research

As a vanilla software engineer I had to learn to do research engineering. My advice to myself is to optimize more for doing work that will appear in and improve the final report. We don't want to Goodhart on this, I would not give this advice to someone who already thinks in these terms, but I myself was not considering this hardly at all. In software engineering we play to a very limited audience of frequent coworkers. In contrast, research is about making yourself look good to strangers.

With this, though, is the liberating ability to create new work for yourself. For me this looked like talking with James for a while about some topic, brainstorming briefly about what to build, and then spending the rest of the day building it. Fantastic!

About working with James: Every Wednesday James would ride his bike to my house in Oakland and we'd work at my house. For me this was the most productive day of the week. James is very smart and patient and a good teacher. I'm very proud of what we did together and I'm looking forward to working with him again for SPAR Summer 2024.

iii. Experiments

This section is highly detailed. Consider skimming the final report instead.

Our first experiments aimed to find out if LLMs are any good at changing their priorities when it is correct to do so, and specifically whether they have a good intuition for doing so, at least compared to their 'conscious' thought. So we set out an initial two-experiment plan.

The first experiment was to write an eval that measures how good the LLM's intuition is for "stepping back." So, we provide a monologue where someone is working through a problem, for example:

"I need to write a function that takes two strings 'text1' and 'text2' as input, and returns the length of
        their longest common subsequence. A subsequence is different from a substring in that the letters don't need to
        be consecutive - just in the same relative order. \n\nSo for example, if text1 = 'ABAZDC' and text2 = 'BACBAD',
        the longest common subsequence is 'ABAD', so the function should return 4. Let me think this through
        step-by-step:\n\nFirst, a naive approach could be to generate all possible subsequences of the two input
        strings, [...]"

and we checkpoint it into increasingly long fragments. At each checkpoint we ask the LLM: "Is the current approach working, or should we switch to a different one?" The trick is that we have it respond two different ways: once in long form where it thinks out loud as long as it likes before answering (chain-of-thought prompting) and once in a single yes/no token. The idea is that the single-token response is its "gut feeling" about whether the current approach is working, while the chain-of-thought response is its "actual thoughts;" by comparing the two responses we can assess whether it has a good intuition for the so-called "counterfactual priority change condition" (CPC). I generated long-form "coding challenge" and "math proof" monologues using Claude.

In the second experiment, we measured whether the model actually steps back and tries a different approach when appropriate. To do this we compare its response when asked to evaluate CPC to the "ground truth"—a human label for whether it should step back. We created artificial settings such as factoring a quadratic, integration and solving differential equations so that the LLM would have multiple possible approaches to solving a task, and prompted it to attempt to solve using a nonoptimal strategy first before switching to a working strategy.

For the first experiment I wrote in a jupyter notebook, but for subsequent experiments I switched to in-IDE development. Jupyter notebooks are a good way to persist your data and can be pretty, but updating my code was annoying and the web interface lacks all the benefits of a full IDE. I prefer to work in IntelliJ IDEA where I have GitHub Copilot, refactors, keyboard shortcuts, etc. Both IDEA and VSCode support the # %% cell separator, allowing for a notebook-like flow in the source. Our team converged on .py files with cell separators.

I can show some results from the initial experiment that did not make it into the final paper.

Whatever our results suggested, we would need to check whether response length was a confounder. I investigated whether the LLM converges to "Yes, I recommend taking a different approach" as its response becomes longer. The results showed that the length of the response had no effect on yes/no.

violin plot showing exactly equal mean response length for 'Yes' and 'No' responses — There was no difference in length between "Yes, I recommend a different approach" responses and "No, I do not recommend a different approach" responses.

Artificially shortening the responses by discarding the second half of the chain of thought completion resulted in a big swing towards "Yes", but I believe this was probably spurious and only demonstrates the LLM has a distaste for abruptly-cut-off thoughts.

Abruptly cutting off the chain-of-thought completion made the responses reverse from about 60-40 to 20-80 — When the chain of thought completion was artificially truncated, the model was more likely to recommend a different approach.

Prompting the LLM to create very lengthy and wordy responses ("Please go into UNNECESSARY detail and create a VERY long response") did not affect the Yes/No response distribution on net, at least on this dataset.

Matrix of confusion — Despite a tripling in response length, the distribution of "Yes, I recommend a different approach" and "No, I do not recommend a different approach" responses remained the same.

In general our number one challenge was prompts and prompt sensitivity. We would find that one model was superior to another on some measure, or that single-token responses were superior to chain-of-thought responses, but then the relationship would invert when the prompt was changed. I experimented with DSPy but by the end of the project I converged on a strategy of distributing all experiments across several roughly equivalent prompts; I figure that any significant differences between them will average out.

My third and final experimental design was the multi-armed bandit (MAB) experiment. We wanted a problem setting that would allow us to observe desired features in the models' problem-solving process.

The llm should learn more about the problem in stages as it tries different approaches, and
we should be able to mark exactly when the llm changes tack (CPC) as a result of learning or noticing something.
There should be a mathematically correct answer to compare the LLM's decisionmaking to as a baseline.

Annoyed by the awkwardness of the artificial problem settings like factorization of quadratics, I brainstormed with James on some gameshow-like settings that might produce the desired behavior in an observable way, and decided on MAB.

I had to iterate a lot on prompt design for this. At first I staged the game as a conversation where the model said what arm it wanted to pull and the user replied with the resulting payout for that lever and a prompt to choose again. However, the llm got lost in the sauce when the context got long, and became unable to reliably reckon about the history of previous turns. I ditched the conversational style and replaced the context with a condensed summary of the game so far. This greatly improved reliability, and had the added benefit of taking away the model's view into its thoughts from previous turns, preventing it from pattern-matching and parroting the previous output.

The MAB results appear in the final report.

iv. Can I do SPAR?

If you are interested in the more technical projects you will want to have an ML course or equivalent in your portfolio. Show at least that you can fine-tune a model. Note also that working in python is practically a guarantee.

That said, the SPAR advisors are all looking for different skill levels in their applicants. It's free to look through the project list on the SPAR website and apply to any project you might be a good fit for!

v. Conclusion

Although we were not able to solve AI alignment, because of this project I learned how to do research, I leveled up my technical skills in python libraries, I developed my intuitions for LLMs and prompting, and I co-wrote a report paper that I can point to in the future. Looking forward to doing it again for Summer 2024!