Scripts in this folder will aid you create dictionary from documents and generate data the same as the lda-style data in the example given in petuum code.
Use documents in
data/bing_sample_big.txt as an example.
The raw documents should be stored in the form:
doc_title(could be url)doc_content
How to Generate Dictionary
First, we should give each doc one unique document id. It will create
Then, it will generate
data/bing_sample_big.txt.tf_df.sorted_df. The first one sorts words in the
documents by tf and the second by df.
Use those two to select words to create your dictionary.
After you have selected some words from those two file, regardless of which
one. I will use
data/bing_sample_big.txt.tf_df.sorted_df as an example.
Imagine that I only used 100000 words in the
data/bing_sample_big.txt.tf_df.sorted_df as dictionary and save those 100000
lines in file called
data/wiki_tf_df.dict. To remove the
tf:df appending to each words, execute the following command:
python src/remove_tf_df.py data/bing_sample_big_tf_df.dict
After this you will get
data/bing_sample_big.dict file, which is your
How to Create lda-style data
To produce lda-style data, execute the following bash script:
scripts/lda_data_preprocess.sh data/bing_sample_big.txt.with_doc_id data/bing_sample_big.dict
data/bing_sample_big.txt.with_doc_id.lda_data is the data you want.
data/bing_sample_big.txt.with_doc_id.map is the word_id map file.
You can have some idea of the word id of each word in
In petuum lda-style data file, the first line is the number of documents. Get
the number in the last line of
data/bing_sample_big.txt.with_doc_id, and plus
one since we assign doc id incrementally from zero. Put the number in the first
line then the file format will be the same.
TODO: Deal with large scale
The script uses hadoop to proprocess data will be future work.