メモ：Word2Vec - "Truth of the Legend" Notes

下記は Google Code にアーカイブされているword2vecのオリジナル実装のドキュメントの全文および翻訳です。ドキュメントというよりはメモといった性格の文書なんですが、Ｃ言語で実装されたword2vecの基本コマンドと付属するデモスクリプトについて簡単に解説しています。

僕がみる限り、プログラム自体は古き良き（そして悪しき）UNIX C プログラムですね。解説は別記事に書く予定。しかし、いやぁ、もう、man page を書いてよ、本当に。

Tool for computing continuous distributed representations of words.

単語の連続的な分散表現を計算するためのツール。

Introduction（イントロダクション）

This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

このツールは、単語のベクトル表現を計算するための連続バッグオブワードおよびスキップグラムアーキテクチャの効率的な実装を提供します。これらの表現は、その後、多くの自然言語処理アプリケーションやさらなる研究に使用することができます。

Quick start（クイックスタート）

Download the code: svn checkout http://word2vec.googlecode.com/svn/trunk/ Run 'make' to compile word2vec tool Run the demo scripts: ./demo-word.sh and ./demo-phrases.sh For questions about the toolkit, see http://groups.google.com/group/word2vec-toolkit

コードをダウンロードする
'make' を実行して word2vec ツールをコンパイルする
デモスクリプトを実行する: ./demo-word.sh と ./demo-phrases.sh

ツールキットについての質問は http://groups.google.com/group/word2vec-toolkit を参照してください。

How does it work（どのように動作するのか）

The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.

word2vecツールはテキストコーパスを入力とし、出力として単語ベクトルを生成します。まず、学習テキストデータから語彙を構築し、単語のベクトル表現を学習します。生成された単語ベクトルファイルは、多くの自然言語処理や機械学習アプリケーションで特徴量として使用することができます。

A simple way to investigate the learned representations is to find the closest words for a user-specified word. The distance tool serves that purpose. For example, if you enter 'france', distance will display the most similar words and their distances to 'france', which should look like:

学習された表現を調べる簡単な方法は、ユーザが指定した単語に最も近い単語を見つけることです。distanceツールがその目的を果たします。例えば、'france'と入力すると、distanceは最も類似した単語と'france'までの距離を表示します。

Word Cosine distance

            spain          0.678515
          belgium          0.665923
      netherlands          0.652428
            italy          0.633130
      switzerland          0.622323
       luxembourg          0.610033
         portugal          0.577154
           russia          0.571507
          germany          0.563291
        catalonia          0.534176

There are two main learning algorithms in word2vec : continuous bag-of-words and continuous skip-gram. The switch -cbow allows the user to pick one of these learning algorithms. Both algorithms learn the representation of a word that is useful for prediction of other words in the sentence. These algorithms are described in detail in [1,2].

word2vecには主に2つの学習アルゴリズムがあります：連続的なbag-of-wordsと連続的なskip-gramです。スイッチ -cbow を使うと、これらの学習アルゴリズムのいずれかを選択することができます。どちらのアルゴリズムも、文中の他の単語を予測するのに有用な単語の表現を学習します。これらのアルゴリズムについては、[1,2]で詳しく説明されています。

Interesting properties of the word vectors（言葉のベクトルの面白い性質）

It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1]. You can try out a simple demo by running demo-analogy.sh.

例えば、ベクトル操作 vector('Paris') - vector('France') + vector('Italy') の結果は vector('Rome') に非常に近く、 vector('king') - vector('man') + vector('woman') は vector('Queen') に近いことが最近明らかになりました [3, 1]。demo-analogy.shを実行すると簡単なデモを試すことができます。

To observe strong regularities in the word vector space, it is needed to train the models on large data set, with sufficient vector dimensionality as shown in [1]. Using the word2vec tool, it is possible to train models on huge data sets (up to hundreds of billions of words).

単語ベクトル空間の強い規則性を観測するためには，[1]に示すように，十分なベクトル次元を持つ大規模なデータセットでモデルを学習する必要があります。word2vecツールを用いることで，数千億語にも及ぶ膨大なデータセットに対してモデルを学習することが可能となります。

From words to phrases and beyond（言葉からフレーズへ、そしてその先へ）

In certain applications, it is useful to have vector representation of larger pieces of text. For example, it is desirable to have only one vector for representing 'san francisco'. This can be achieved by pre-processing the training data set to form the phrases using the word2phrase tool, as is shown in the example script ./demo-phrases.sh. The example output with the closest tokens to 'san_francisco' looks like:

特定のアプリケーションでは、より大きなテキストの断片をベクトルで表現することが有用です。例えば、'san francisco'を表現するためのベクトルは１つだけであることが望ましいでしょう。これは、例のスクリプト./demo-phrases.shに示されているように、word2phraseツールを使用してフレーズを形成するために学習データセットを前処理することによって達成することができます。'san_francisco'に最も近いトークンを用いた出力例は以下のようになります。

Word Cosine distance 

      los_angeles          0.666175
      golden_gate          0.571522
          oakland          0.557521
       california          0.554623
        san_diego          0.534939
         pasadena          0.519115
          seattle          0.512098
            taiko          0.507570
          houston          0.499762
 chicago_illinois          0.491598

The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences [2].

ベクトル演算の直線性は、複数のベクトルを追加した場合にも弱く保たれるようであり、複数の単語やフレーズのベクトルを追加して短文を表現することが可能です[2]。

How to measure quality of the word vectors（単語ベクトルの品質を測る方法）

Several factors influence the quality of the word vectors: * amount and quality of the training data * size of the vectors * training algorithm

学習データの量と質、ベクトルのサイズ、学習アルゴリズムなど、いくつかの要因が単語ベクトルの品質に影響を与えます。

The quality of the vectors is crucial for any application. However, exploration of different hyper-parameter settings for complex tasks might be too time demanding. Thus, we designed simple test sets that can be used to quickly evaluate the word vector quality.

ベクトルの品質はどのようなアプリケーションにとっても非常に重要です。しかし、複雑なタスクのために異なるハイパーパラメータ設定を探索するのは時間がかかりすぎるかもしれません。そこで、我々は、単語ベクトルの品質を迅速に評価するために使用できる簡単なテストセットを設計しました。

For the word relation test set described in [1], see ./demo-word-accuracy.sh, for the phrase relation test set described in [2], see ./demo-phrase-accuracy.sh. Note that the accuracy depends heavily on the amount of the training data; our best results for both test sets are above 70% accuracy with coverage close to 100%.

[1]で記述された単語関係のテストセットについては、./demo-word-accuracy.shを、[2]で記述されたフレーズ関係のテストセットについては、./demo-phase-accuracy.shを参照してください。精度は学習データの量に大きく依存することに注意してください。両方のテストセットでの最良の結果は、カバレッジが100%に近い70%以上の精度です。

Word clustering（単語クラスタリング）

The word vectors can be also used for deriving word classes from huge data sets. This is achieved by performing K-means clustering on top of the word vectors. The script that demonstrates this is ./demo-classes.sh. The output is a vocabulary file with words and their corresponding class IDs, such as:

単語ベクトルは、巨大なデータセットから単語クラスを導出するためにも使用することができます。これは、単語ベクトルの上でK-meansクラスタリングを実行することで達成されます。これを実演するスクリプトは ./demo-classes.sh です。出力されるのは、単語とそれに対応するクラスIDを持つボキャブラリーファイルです。

carnivores 234 
carnivorous 234 
cetaceans 234 
cormorant 234 
coyotes 234 
crocodile 234 
crocodiles 234 
crustaceans 234 
cultivated 234 
danios 234 
. . . 
acceptance 412 
argue 412 
argues 412 
arguing 412 
argument 412 
arguments 412 
belief 412 
believe 412 
challenge 412 
claim 412

Performance（パフォーマンス）

The training speed can be significantly improved by using parallel training on multiple-CPU machine (use the switch '-threads N'). The hyper-parameter choice is crucial for performance (both speed and accuracy), however varies for different applications. The main choices to make are:

学習速度は，複数のCPUで並列学習を行うことで大幅に向上します（スイッチ'-threads N'を使用）．ハイパーパラメータの選択は性能（速度と精度の両方）に重要ですが、アプリケーションによって異なります。主な選択は以下の通りです。

architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)
the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
dimensionality of the word vectors: usually more is better, but not always
context (window) size: for skip-gram usually around 10, for CBOW around 5
アーキテクチャ: スキップグラム (遅い、頻度の低い単語に適している) vs CBOW (速い)
学習アルゴリズム：階層的ソフトマックス（頻度の低い単語に対してより良い） vs ネガティブサンプリング（頻度の高い単語に対してより良い，低次元ベクトルに対してより良い）．
頻出語のサブサンプリング: 大規模データセットの精度と速度を向上させることができます(有用な値は1e-3から1e-5の範囲です)
単語ベクトルの次元性: 通常は多ければ多いほど良いが、必ずしもそうとは限らない
コンテキスト (ウィンドウ) サイズ: スキップグラムでは通常約10、CBOWでは約5

Where to obtain the training data（トレーニングデータの入手先）

The quality of the word vectors increases significantly with amount of the training data. For research purposes, you can consider using data sets that are available on-line:

単語ベクトルの品質は、学習データの量によって著しく向上します。研究目的のために、オンラインで利用可能なデータセットの利用を検討することができます。

First billion characters from wikipedia (use the pre-processing perl script from the bottom of Matt Mahoney's page)
Latest Wikipedia dump Use the same script as above to obtain clean text. Should be more than 3 billion words.
WMT11 site: text data for several languages (duplicate sentences should be removed before training the models)
Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text.
UMBC webbase corpus Around 3 billion words, more info here. Needs further processing (mainly tokenization).
Text data from more languages can be obtained at statmt.org and in the Polyglot project.
wikipediaの最初の10億文字(Matt Mahoneyのページの下の方にある前処理用のperl スクリプトを使う)
最新のWikipediaダンプクリーンなテキストを取得するために上記と同じスクリプトを使用します。30億語以上であること。
WMT11サイト：複数の言語のテキストデータ（モデルを学習する前に重複した文を削除する必要があります
データセット "One Billion Word Language Modeling Benchmark "からのデータセットです。
UMBC webbaseコーパス約30億語、詳細はこちら。さらなる処理（主にトークン化）が必要。
より多くの言語のテキストデータは statmt.org や Polyglot プロジェクトで入手できます。

Pre-trained word and phrase vectors（予め学習された単語とフレーズのベクトル）

We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in [2]. The archive is available here: GoogleNews-vectors-negative300.bin.gz.

Google Newsのデータセット（約1,000億語）の一部で訓練された事前訓練済みのベクトルを公開しています。モデルには300万語の単語とフレーズの300次元ベクトルが含まれています。フレーズは、[2]で説明したシンプルなデータ駆動型のアプローチを用いて取得しました。アーカイブはこちらからご覧いただけます。GoogleNews-vectors-negative300.bin.gz。

An example output of ./distance GoogleNews-vectors-negative300.bin:

./distance GoogleNews-vectors-negative300.binの出力例。

Enter word or sentence (EXIT to break): Chinese river

Word Cosine distance 

   Yangtze_River            0.667376
         Yangtze            0.644091
  Qiantang_River            0.632979
Yangtze_tributary           0.623527 
Xiangjiang_River            0.615482 
Huangpu_River               0.604726 
Hanjiang_River              0.598110 
Yangtze_river               0.597621 
Hongze_Lake                 0.594108 
Yangtse                     0.593442

The above example will average vectors for words 'Chinese' and 'river' and will return the closest neighbors to the resulting vector. More examples that demonstrate results of vector addition are presented in [2]. Note that more precise and disambiguated entity vectors can be found in the following dataset that uses Freebase naming.

上記の例では、単語'Chinese'と単語'river'のベクトルを平均化し、結果として得られたベクトルに最も近い隣人を返す。ベクトル加算の結果を示すより多くの例は、[2]で紹介されています。より正確で曖昧性のない実体ベクトルは、Freebaseネーミングを使用している以下のデータセットにあることに注意してください。

Pre-trained entity vectors with Freebase naming（Freebaseネーミングを用いた事前学習済みのエンティティベクター）

We are also offering more than 1.4M pre-trained entity vectors with naming from Freebase. This is especially helpful for projects related to knowledge mining.

また、Freebaseからネーミング付きの事前学習済みエンティティベクターを140万個以上提供しています。特にナレッジマイニング関連のプロジェクトに役立ちます。

Entity vectors trained on 100B words from various news articles: freebase-vectors-skipgram1000.bin.gz Entity vectors trained on 100B words from various news articles, using the deprecated /en/ naming (more easily readable); the vectors are sorted by frequency: freebase-vectors-skipgram1000-en.bin.gz Here is an example output of ./distance freebase-vectors-skipgram1000-en.bin:

様々なニュース記事から得られた 100B 単語を用いて学習されたエンティティベクタ: freebase-vectors-skipgram1000.bin.gz 様々なニュース記事から得られた 100B の単語に対して，非推奨の /en/ ネーミング（より読みやすい）を用いて学習されたエンティティベクタ：頻度でソートされています： freebase-vectors-skipgram1000-en.bin.gz 以下は，./distance freebase-vectors-skipgram1000-en.bin の出力例です．

Enter word or sentence (EXIT to break): /en/geoffrey_hinton

Word Cosine distance 

/en/marvin_minsky           0.457204
/en/paul_corkum             0.443342
/en/william_richard_peltier 0.432396 
/en/brenda_milner           0.430886 
/en/john_charles_polanyi    0.419538 
/en/leslie_valiant          0.416399 
/en/hava_siegelmann         0.411895 
/en/hans_moravec            0.406726 
/en/david_rumelhart         0.405275 
/en/godel_prize             0.405176

Final words（最後の言葉）

Thank you for trying out this toolkit, and do not forget to let us know when you obtain some amazing results! We hope that the distributed representations will significantly improve the state of the art in NLP.

このツールキットを試していただきありがとうございます。私たちは、分散表現がNLPの技術を大幅に向上させることを期待しています。

References（参考文献）

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient Estimation of Word Representations in Vector Space.
In Proceedings of Workshop at ICLR, 2013.

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean.
Distributed Representations of Words and Phrases and their Compositionality.
In Proceedings of NIPS, 2013.

[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
Linguistic Regularities in Continuous Space Word Representations.
In Proceedings of NAACL HLT, 2013.

Disclaimer（免責事項）

This open source project is NOT a Google product, and is released for research purposes only.

このオープンソースプロジェクトはGoogleの製品ではなく、研究目的でのみ公開されています。