build program
First, you need to tokenize the input:
(not necessary if file already “tokenized”
% cat test
repel the monkey.
% token <test >test.t
% cat test.t
repel
the
monkey
<PERIOD>
Then build the matrix (specifying a vocabulary file):
-y rowfile
-v column file
-l number; limits row to that number of occurrences
-w window size
% build -v /Data/corp/vocab <test.t
vocab is 69718 words.
mapping 0 from file 4, offset = 1695744.
unloadSparse: can’t unload a mapped matrix.
mapping cmat…
mapping 0 from file 5, offset = 1695744.
emptying the spill matrix…6 elements into 0 elements.
crab merge done; wound up at 4018089984 (should be 4018089984).
The matrix winds up in /Data/tmp/matrix. You can look at it with ‘showmat’:
% showmat /Data/tmp/matrix
monkey repel the
<PERIOD 10.00 8.00 9.00
monkey 0.00 9.00 10.00
the 0.00 10.00 0.00