BERTとその派生モデルの解説

BERTとは何か？本記事では、Transformerベースの言語モデルBERTのアーキテクチャ、事前学習手法、ファインチューニング、さらにはRoBERTaやDistilBERTなどの派生モデルまでを詳しく解説します。LLMの基礎を学びたい方必見です。

芝紘希

ソフトウェアエンジニア

2025-6-6

芝紘希

BERTとその派生モデルの解説

#AI

はじめに
BERT のアーキテクチャ
BERT のモデル構造
入出力表現
事前学習
マスク言語モデル（Masked Language Model）
次文予測（Next Sentence Prediction, NSP）
事前学習データ
ファインチューニング
BERT の実験結果
BERT の強み
双方向性
転移学習の効率性
BERT の派生モデル
RoBERTa
DistilBERT
最後に

今回の記事では BERT とその派生モデルの解説を行います。

これまでの記事では自然言語処理の歴史、 Transformer や LLM を紹介しました。BERT は Transformer の Encoder 部分をベースとした言語モデルです。今日の LLM のベースとなる BERT を理解することで、今後のトレンドに一緒についていきましょう！

基本的な引用元は BERT の論文です。例外として、7.1 RoBERTa 及び 7.2 DistilBERT での引用元は各モデルの論文です。

はじめに

自然言語処理（NLP）の分野では、2018 年に Google から発表された BERT（Bidirectional Encoder Representations from Transformers） により、飛躍的な進歩を遂げました。BERT は登場以来、数多くの NLP タスクにおいて当時の最高性能を更新し、多くの言語モデルの基礎となっています。

BERT の特筆すべき点は、その名前にもあるとおり 双方向性（bidirectionality） です。BERT 登場以前の言語モデルでは、自然言語の "_deep_" な双方向性を捉えることができていませんでした。GPT-1 のような言語モデルは、主に左から右への（または右から左への）単方向で文脈を捉えるもので、双方向性を捉えることが構造的に不可能でした。また、双方向 LSTM を用いた ELMo モデルも、左から右へ＆右から左への単方向で独立して計算した、2 つの隠れ層の出力を最後に連結（concatenate）しているだけでした。論文では、この ELMo の双方向性を*shallow*だと指摘していました。BERT は事前学習において文中のある単語を予測する際に、その単語の前後両方の文脈を同時に考慮できるようになりました。これにより、言語の深い理解と表現が可能になりました。

"Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.", Abstract

論文では、GPT などの単方向性は言語モデルにとって大きな制限であり、タスクによっては very harmful になりうると述べています。

"The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying fine-tuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.", Section 1 Introduction

BERT のアーキテクチャ

BERT は Transformer と呼ばれるニューラルネットワークをベースとしています。Transformer に関しては以前紹介したので「Transformerの解説」を参考にして下さい。

BERT のモデル構造

BERT は Transformer のエンコーダ部分を多層に積み重ねた構造を持っています。BERT が Encoder-only なアーキテクチャだと言われる所以です。

"BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor2tensor library.", Section 3 BERT

論文では主に次の 2 つのサイズのモデルが提案されています。

BERT BASE：
- 12 層の Transformer ブロック
- 隠れ層のサイズは 768 次元
- 12 個の self-attention ヘッド
- 総パラメータ数：110M
BERT LARGE：
- 24 層の Transformer ブロック
- 隠れ層のサイズは 1024 次元
- 16 個の self-attention ヘッド
- 総パラメータ数：340M

"We primarily report results on two model sizes: $BERT_{BASE}$ (L=12, H=768, A=12, Total Parameters=110M) and $BERT_{LARGE}$ (L=24, H=1024, A=16, Total Parameters=340M).", Section 3 BERT

また $BERT_{BASE}$ は、OpenAI GPT（GPT-1）と同じモデルサイズを持つように設計されており、両モデルの比較を行うことが可能になっています。

> "$BERT_{BASE}$ was chosen to have the same model size as OpenAI GPT for comparison purposes.", Section 3 BERT

入出力表現

BERT の入力形式は、単一の文または文ペアを入力としてトークン化し、処理します。具体的には次のような処理を行います。

WordPiece トークン化を使用し、語彙サイズは 30,000 トークン
各シーケンスの最初に特殊トークン`[CLS]`を配置（分類タスク用）
文と文の間に特殊トークン`[SEP]`を配置
位置埋め込み（Position Embeddings）を追加
文ペアの場合は、文 A・文 B を区別するための文埋め込み（Segment Embeddings）を追加

以上の要素を合わせた入力表現により、BERT はさまざまな NLP タスクに対応できる柔軟性を持っています。これらの内容は Section 3 BERT の Input/Output Representations に書かれています。

事前学習

BERT がそれまでの自然言語処理モデルと異なる点は事前学習の手法にあります。 BERT は 2 つの教師なし学習タスクを用いて事前学習されています。

マスク言語モデル（Masked Language Model）

BERT は「マスク言語モデル（MLM）」と呼ばれる手法を用いています。これは、入力の一部（全トークンの 15%）をランダムにマスキングし、元のトークンを予測するというタスクです。

具体的なマスキング方法は次のとおりです。

選ばれた 15% のトークンのうち
80% は`[MASK]`トークンに置き換え
10% はランダムな別のトークンに置き換え
10% は変更せずそのまま使用

"The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time.", Section 3.1 Pre-training BERT, Task #1: Masked LM

このマスキング手法により、モデルは双方向の文脈を考慮しながら、隠されたトークンを予測することを学習します。この手法により、従来の左から右への言語モデルでは困難だった、双方向の文脈をより深く理解できるようになりました。

"In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens.", Section 3.1 Pre-training BERT, Task #1: Masked LM

次文予測（Next Sentence Prediction, NSP）

BERT ではさらに「次文予測（NSP）」というタスクも同時に学習します。これは、与えられた 2 つの文が実際に連続しているかどうかを予測するものです。

具体的には：

50% の確率で実際に連続する 2 つの文を選択（IsNext）
50% の確率でランダムに選んだ 2 つの文を組み合わせる（NotNext）

"Specifically, when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).", Section 3.1 Pre-training BERT, Task #2: Next Sentence Prediction (NSP)

このタスクは、質問応答や自然言語推論など、文章同士の関係性理解が重要なタスクのための基礎となります。

"Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling.", Section 3.1 Pre-training BERT, Task #2: Next Sentence Prediction (NSP)

事前学習データ

BERT の事前学習には、以下の大規模コーパスが使用されました：

BooksCorpus（8 億語）
英語 Wikipedia（25 億語）

"For the pre-training corpus we use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words).", Section 3.1 Pre-training BERT

BERT の事前学習では、連続するテキストを利用する必要があるため、シャッフルされた文レベルのコーパスではなく、ドキュメントレベルのコーパスを使用しています。

"It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.", Section 3.1 Pre-training BERT

ファインチューニング

BERT の強みは、事前学習されたモデルを様々な下流タスクに簡単に適応できる点にあります。

ファインチューニングでは、事前学習された BERT モデルのパラメータを初期値として、特定のタスク用データで微調整します。基本的な手順は以下の通りです。

タスク固有の入出力層を追加
すべてのパラメータを対象タスクのデータでエンドツーエンドで更新

"For each task, we simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.", Section 3.2 Fine-tuning BERT

ファインチューニングは事前学習に比べると計算コストが低く、多くの場合、単一のクラウド TPU で 1 時間以内、GPU でも数時間程度で完了します。

“Compared to pre-training, fine-tuning is relatively inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.", Section 3.2 Fine-tuning BERT

BERT の実験結果

論文では、BERT を 11 の異なる NLP タスクで評価し、すべてのタスクで当時の最先端の性能を上回りました。

“BERT advances the state of the art for eleven NLP tasks.", Section 1 Introduction

GLUE ベンチマーク

GLUE（General Language Understanding Evaluation）ベンチマークでは、$BERT_{LARGE}$ が当時の最高スコアを 7.7 ポイント上回る 80.5 のスコアを記録しました。

“On the official GLUE leaderboard, $BERT_{LARGE}$ obtains a score of 80.5, compared to OpenAI GPT, which obtains 72.8 as of the date of writing.”, Section 4.1 GLUE

SQuAD 質問応答タスク

SQuAD v1.1 では、$BERT_{LARGE}$ が F1 スコアで 93.2 を記録し、当時の最高性能を 1.5 ポイント上回りました。また、SQuAD v2.0 でも同様に最高性能を 5.1 ポイント上回る 83.1 の F1 スコアを達成しています。

“SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).”, Abstract

SWAG タスク

常識推論を評価する SWAG タスクでは、$BERT_{LARGE}$ が当時のベースラインモデルであった ESIM（Enhanced Sequential Inference Model）+ ELMo（Embeddings from Language Models）を 27.1% 上回り、OpenAI GPT（GPT-1）も 8.3% 上回る結果となりました。

“$BERT_{LARGE}$ outperforms the authors’ baseline ESIM+ELMo system by +27.1% and OpenAI GPT by 8.3%.”, Section 4.4 SWAG

BERT の強み

BERT がなぜ最近の LLM のベースとなっているのかを、その強みとともに解説します。

双方向性

BERT の最大の強みは、文脈の両方向からの情報を同時に考慮できる点です。この設計によって、前後の文脈を活用して単語の意味を深く理解できます。論文内で bidirectional という単語は 23 回も使われていました。それほど BERT の開発者にとって双方向性は重要な要素であったことが伺えます。

転移学習の効率性

事前学習済みモデルに出力層を追加するだけで、高性能なまま多様なタスクに適用させることができます。下流タスクに応じたモデルの再構成が必要がないことは開発コストの面からとても嬉しいです。

“As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.”, Abstract

また、大規模なモデルを使用することで、事前学習とは直接関係がなく、さらに少数のラベル付きデータセットしかないタスクにおいても精度の向上を達成しています。ラベル付きデータセットは貴重であり、学習コストを抑えることが可能です。

“We can see that larger models lead to a strict accuracy improvement across all four datasets, even for MRPC which only has 3,600 labeled training examples, and is substantially different from the pre-training tasks.”, Section 5.2 Effect of Model Size

BERT の派生モデル

BERT が登場して以降、多くの派生モデルが開発されています。この章では主な BERT 系モデル 2 つを紹介します。

RoBERTa

このセクションでの引用元は RoBERTa の論文「RoBERTa: A Robustly Optimized BERT Pretraining Approach」です。

RoBERTa とは

RoBERTa（Robustly optimized BERT approach） は、 BERT の事前学習手法を見直し、設計の変更やハイパーパラメータの最適化によって性能を向上させたモデルです。RoBERTa は新しいアーキテクチャを導入するのではなく、既存のアーキテクチャを最適化することで性能を大幅に向上させています。主な変更点は、訓練ステップ数やデータ量の増加、動的マスキングの導入、NSP（Next Sentence Prediction）タスクの削除、大規模ミニバッチの利用、そしてバイトレベルの BPE の使用などです。

“We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size.”, Abstract

“Specifically, RoBERTa is trained with dynamic masking (Section 4.1), FULL-SENTENCES without NSP loss (Section 4.2), large mini-batches (Section 4.3) and a larger byte-level BPE (Section 4.4).”, Section 5 RoBERTa

“Finally, we pretrain RoBERTa for significantly longer, increasing the number of pretraining steps from 100K to 300K, and then further to 500K. We again observe significant gains in downstream task performance, and the 300K and 500K step models outperform XLNetLARGE across most tasks. We note that even our longest-trained model does not appear to overfit our data and would likely benefit from additional training.”, Section 5 RoBERTa

これにより、RoBERTa は GLUE、SQuAD、RACE といった自然言語処理ベンチマークで当時の最高性能を記録しました。

“Our improved pretraining procedure, which we call RoBERTa, achieves state-of-the-art results on GLUE, RACE and SQuAD, without multi-task finetuning for GLUE or additional data for SQuAD.”, Section 7 Conclusion

BERT との主な違い

RoBERTa は BERT と同じ Transformer アーキテクチャを採用しているが、学習方法にはいくつかの重要な違いがあります。

“We keep the model architecture fixed.7 Specifically, we begin by training BERT models with the same configuration as $BERT_{BASE}$ (L = 12, H = 768, A = 12, 110M params).”, Section 4 Training Procedure Analysis

まず、BERT で用いられていた NSP タスクを完全に廃止しています。RoBERTa では、文同士の関係性を学習させるこのタスクが不要であると判断し、削除することで精度の向上が確認されました。

“We find that this setting outperforms the originally published $BERT_{BASE}$ results and that removing the NSP loss matches or slightly improves downstream task performance, in contrast to Devlin et al. (2019).”, Section 4.2 Model Input Format and Next Sentence Prediction

また、BERT では入力シーケンスの一部のみをマスクして学習する際、あらかじめ静的にマスクパターンを決定していたが、RoBERTa では訓練中に毎回異なるマスクを動的に適用しています。この動的マスキングにより、データの過学習が防がれ、より性能の高いモデルとなりました。

“The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask.”, Section 4.1 Static vs. Dynamic Masking

“We compare this strategy with dynamic masking where we generate the masking pattern every time we feed a sequence to the model.”, Section 4.1 Static vs. Dynamic Masking

DistilBERT

このセクションでの引用元は DistilBERT の論文「DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter」です。

DistilBERT とは

DistilBERT は、Hugging Face によって提案された BERT の圧縮モデルであり、「小型化」「高速化」「安価化」「軽量化」をコンセプトに設計されています。具体的には、BERT のサイズを 40% 削減しつつも、その言語理解能力の 97% を維持し、推論速度を 60% 向上させています。この小型化は、単なるパラメータ削減ではなく、知識蒸留（Knowledge Distillation） という技術を利用しています。

“While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.”, Abstract

知識蒸留とは、大規模な教師モデル（この場合は BERT）の学習結果を、より小型の生徒モデル（DistilBERT）に伝達する手法です。これにより、教師モデルが持つ複雑な知識をコンパクトに圧縮し、パラメータ数を減らしながらも高精度な学習が可能となります。

“Knowledge distillation [Bucila et al., 2006, Hinton et al., 2015] is a compression technique in which a compact model - the student - is trained to reproduce the behaviour of a larger model - the teacher or an ensemble of models.”, Section 2 Knowledge distillation

また、蒸留時には言語モデリング損失、蒸留損失、コサイン距離損失の三つの損失を組み合わせることで、より効率的な圧縮を実現しています。

“To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses.”, Abstract

アーキテクチャと評価

DistilBERT の設計は、BERT のアーキテクチャをベースにしつつも、層の数を半分に減らしています。具体的には、12 層のエンコーダ構造を 6 層に圧縮し、トークンタイプエンベッディングやプーリング層を除去しています。

“In the present work, the student - DistilBERT - has the same general architecture as BERT. The token-type embeddings and the pooler are removed while the number of layers is reduced by a factor of 2.”, Section 3 DistilBERT: a distilled version of BERT

この変更により、計算コストの削減が可能となり、CPU やモバイルデバイス上でも実行可能な軽量モデルが誕生しました。GLUE ベンチマークにおいては、BERT と比較してわずかな性能差しかなく、特に STS-B（文の類似度判定）や MRPC（文の類似性判断）ではほぼ同等のスコアを達成しています。また、IMDb（映画レビューの感情分析）や SQuAD（質問応答タスク）でも、BERT をごくわずかに下回ってはいるが、ほぼ同精度の結果を示しており、推論速度の向上とメモリ使用量の削減が確認されています。

“Among the 9 tasks, DistilBERT is always on par or improving over the ELMo baseline (up to 19 points of accuracy on STS-B). DistilBERT also compares surprisingly well to BERT, retaining 97% of the performance with 40% fewer parameters.”, Section 4 Experiments

“As shown in Table 2, DistilBERT is only 0.6% point behind BERT in test accuracy on the IMDb benchmark while being 40% smaller. On SQuAD, DistilBERT is within 3.9 points of the full BERT.”, Section 4.1 Downstream task benchmark

モバイルデバイスへの応用

DistilBERT の軽量性は、オンデバイス推論において特に有効です。例えば、スマートフォン上での質問応答システムのプロトタイプ実験では、BERT に比べて 71% の推論速度の向上が確認されており、モデルサイズも 207MB に圧縮されています。それに加えて、モデルの量子化（Quantization）を行うことで、さらなる軽量化が見込まれています。

“We studied whether DistilBERT could be used for on-the-edge applications by building a mobile application for question answering. We compare the average inference time on a recent smartphone (iPhone 7 Plus) against our previously trained question answering model based on BERT-base. Excluding the tokenization step, DistilBERT is 71% faster than BERT, and the whole model weighs 207 MB (which could be further reduced with quantization).”, Section 4.1 Downstream task benchmark

DistilBERT は、小規模なメモリ環境や低電力デバイスでも高精度な自然言語処理を可能にし、モバイルアプリや IoT デバイスでの利用が期待されています。

最後に

今回は世間で注目されている LLM の先駆けとなった BERT と、その派生モデルの解説をしました。双方向の Transformer アーキテクチャと効果的な事前学習手法により、BERT は様々な自然言語処理タスクで優れた性能を発揮しました。次回は、Transformer の派生モデルとして、Encoder-only な BERT 系モデルと対をなす Decoder-only な GPT 系モデル について、解説します。

この記事を最後まで読んでいただきありがとうございます。次の記事でまた会いましょう！

Author

芝紘希

ソフトウェアエンジニア

現在は神戸大学工学部情報知能工学科に在籍し勉強中です。ただただひたむきに筋肥大。

共に働く仲間を募集しています

Digeonは意欲のある方を積極的に採用しています。
神戸発のAIベンチャーでAIの社会実装を一緒に進めませんか？

採用ページはこちら

BERTとその派生モデルの解説

BERTとその派生モデルの解説

はじめに

BERT のアーキテクチャ

BERT のモデル構造

入出力表現

事前学習

マスク言語モデル（Masked Language Model）

次文予測（Next Sentence Prediction, NSP）

事前学習データ

ファインチューニング

BERT の実験結果

GLUE ベンチマーク

SQuAD 質問応答タスク

SWAG タスク

BERT の強み

双方向性

転移学習の効率性

BERT の派生モデル

RoBERTa

RoBERTa とは

BERT との主な違い

DistilBERT

DistilBERT とは

アーキテクチャと評価

モバイルデバイスへの応用

最後に

Share

Author