このパッケージ、C++だけでなく、Python、Java、C# もサポートしているようですが、
必要なのは C++ です。実は当初はバイナリ配布のインストールを試みたのですが
macOS Big Sur のセキュリティ・チェックで弾かれました。
(Google が配布しているので、今はこの問題を回避できるのかもしれませんが…)
やむなくソースコードからインストールをしました。手順は以下のページです。
で、ソースツリーを眺めていたら CMake 関連のファイルを見つけたので
「テンプレート・ライブラリなのに?」
と思っていたら Eigen の INSTALL メモには次のような記述がありました。
Eigen consists only of header files, hence there is nothing to compile before you can use it. Moreover, these header files do not depend on your platform, they are the same for everybody.
Eigen はヘッダファイルのみで構成されていますので、使用前にコンパイルする必要はありません。さらに、これらのヘッダファイルはプラットフォームに依存しません。プラットフォームに依存せず、誰でも同じものが使えます。
$ python3
Python 3.9.1(v3.9.1:1e5d33e9b9, Dec 72020, 12:10:52)[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license"for more information.
>>> from gensim.models.keyedvectors import KeyedVectors
from gensim.models.keyedvectors import KeyedVectors
>>> model =KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model =KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)>>> model.most_similar(positive=['japanese'])
model.most_similar(positive=['japanese'])[('japan', 0.6607722043991089), ('chinese', 0.6502295732498169), ('Japanese', 0.6149078607559204), ('korean', 0.6051568984985352), ('german', 0.5999272465705872), ('american', 0.5906798243522644), ('asian', 0.5839767456054688), ('san', 0.5834755897521973), ('jap', 0.5764404535293579), ('swedish', 0.5720360279083252)]>>> sent1 ='But other sources close to the sale said Vivendi was keeping the door open to further bids and hoped to see bidders interested in individual assets team up.'.split()
sent1 ='But other sources close to the sale said Vivendi was keeping the door open to further bids and hoped to see bidders interested in individual assets team up.'.split()>>> sent2 ='But other sources close to the sale said Vivendi was keeping the door open for further bids in the next day or two.'.split()
sent2 ='But other sources close to the sale said Vivendi was keeping the door open for further bids in the next day or two.'.split()>>> distance =model.wmdistance(sent1, sent2)
distance =model.wmdistance(sent1, sent2)>>>print(distance)print(distance)0.8738126733213613>>> d0 ='The President greets the press in Chicago.'.split()
d0 ='The President greets the press in Chicago.'.split()>>> d1 ='Obama speaks to the media in Illinois.'.split()
d1 ='Obama speaks to the media in Illinois.'.split()>>> d2 ='The band gave a concert in Japan.'.split()
d2 ='The band gave a concert in Japan.'.split()>>> dist01 =model.wmdistance(d0, d1)
dist01 =model.wmdistance(d0, d1)>>> dist02 =model.wmdistance(d0, d2)
dist02 =model.wmdistance(d0, d2)>>>print(dist01)print(dist01)1.968269276774589>>>print(dist02)print(dist02)2.2929445610887638>>> quit()
quit()
$
In statistics, the earth mover's distance (EMD) is a measure of the distance between two probability distributions over a region D.
In mathematics, this is known as the Wasserstein metric.
Informally, if the distributions are interpreted as two different ways of piling up a certain amount of dirt over the region D,
the EMD is the minimum cost of turning one pile into the other;
where the cost is assumed to be amount of dirt moved times the distance by which it is moved.
The transportation problem was first formulated by F. L. Hitchcock in 1941;
he also gave a computational procedure, much akin to general simplex method, for solving the problem.
Independently, during World War II, T. C. Koopmans arrived at the same problem in connection with his work as a member of the Joint Shipping Board.
The Problem is thus frequently referred to as the Hitchcock-Koopmans problem.
輸送問題は1941年に F. L. Hitchcock によって最初に定式化され、問題を解くための一般的なシンプレックス法に近い計算手順を与えた。
それとは独立して、第二次世界大戦中、T. C. Koopmans は英米共同海運委員会のメンバーとしての仕事に関連して、同じ問題に到達した。
このような経緯から、この問題はヒッチコック・クープマンズ問題と呼ばれることが多い。
In 1939 a linear programming formulation of a problem
that is equivalent to the general linear programming problem was given by the Soviet mathematician and economist Leonid Kantorovich,
who also proposed a method for solving it.
It is a way he developed, during World War II,
to plan expenditures and returns in order to reduce costs of the army and to increase losses imposed on the enemy.
Kantorovich's work was initially neglected in the USSR.
About the same time as Kantorovich,
the Dutch-American economist T. C. Koopmans formulated classical economic problems as linear programs.
Kantorovich and Koopmans later shared the 1975 Nobel prize in economics.
In 1941, Frank Lauren Hitchcock also formulated transportation problems as linear programs
and gave a solution very similar to the later simplex method.
Hitchcock had died in 1957 and the Nobel prize is not awarded posthumously.
The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel 1975
was awarded jointly to Leonid Vitaliyevich Kantorovich and Tjalling C. Koopmans
"for their contributions to the theory of optimum allocation of resources."
We investigate the properties of a metric between two distributions,
the Earth Mover’s Distance (EMD), for content-based image retrieval.
The EMD is based on the minimal cost
that must be paid to transform one distribution into the other, in a precise sense,
and was first proposed for certain vision problems by Peleg, Werman, and Rom.
For image retrieval, we combine this idea with a representation scheme for distributions that is based on vector quantization.
This combination leads to an image comparison framework that often accounts for perceptual similarity better than other previously proposed methods.
The EMD is based on a solution to the transportation problem from linear optimization,
for which efficient algorithms are available, and also allows naturally for partial matching.
It is more robust than histogram matching techniques,
in that it can operate on variable-length representations of the distributions that avoid quantization and other binning problems typical of histograms.
When used to compare distributions with the same overall mass, the EMD is a true metric.
In this paper we focus on applications to color and texture, and we compare the retrieval performance of the EMD with that of other distances.
We give the name of Earth Mover’s Distance (EMD), suggested by Stolfi (1994), to this metric in this new context.
The transportation problem is to find the minimal cost that must be paid to transform one distribution into the other.
The EMD is based on a solution to the transportation problem for which efficient algorithms are available,
and it has many desirable properties for image retrieval, as we will see.
const fs = require('fs');
var buf = fs.readFileSync('entity_vector/entity_vector.model.txt', 'utf8');
これを nodejs で実行すると…
$ node a.js
buffer.js:608
slice: (buf, start, end)=> buf.utf8Slice(start, end),
^
Error: Cannot create a string longer than 0x1fffffe8 characters
at Object.slice (buffer.js:608:37)
at Buffer.toString (buffer.js:805:14)
at Object.readFileSync (fs.js:421:41)
at Object.<anonymous>(/Users/fujita/xtr/BookBot/WikiEntVec/a.js:2:14)
at Module._compile (internal/modules/cjs/loader.js:1063:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1092:10)
at Module.load (internal/modules/cjs/loader.js:928:32)
at Function.Module._load (internal/modules/cjs/loader.js:769:14)
at Function.executeUserEntryPoint [as runMain](internal/modules/run_main.js:72:12)
at internal/main/run_main_module.js:17:47 {
code: 'ERR_STRING_TOO_LONG'}
$
function Absolute(m) {var ret = 0;
for (var i = 0; i < m.length; i++) {
ret += m[i] * m[i];
}return(Math.sqrt(ret));
}function DotProduct(m1, m2) {var ret = 0;
for (var i = 0; i < m1.length; i++) {
ret += m1[i] * m2[i];
}return(ret);
}function CosineSimilarity(dot, nrm1, nrm2) {return(dot/(nrm1*nrm2));
}