Skip to content

Latest commit

 

History

History
860 lines (678 loc) · 43.7 KB

README.md

File metadata and controls

860 lines (678 loc) · 43.7 KB

YWW Tools

A package of my (Weiwei Yang's) various tools (most for NLP). Feel free to email me at [email protected] with any questions.

Check Out

git clone [email protected]:ywwbill/YWWTools.git

Dependencies

  • Java 8.
  • Maven 3.3.
  • Files in lib/.

Build

In the package root directory, run command

mvn package

You will get YWWTools-1.jar and deps.jar in target/.

For convenience, I assume you

  • copy YWWTools-1.jar and deps.jar to package root directory
  • rename YWWTools-1.jar to YWWTools.jar

Use YWW Tools in Command Line

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool <tool-name> --arg-1 <arg-1-value> --arg-2 <arg-2-value> ... --arg-n <arg-n-value>
  • Windows users please replace YWWTools.jar:deps.jar with YWWTools.jar;deps.jar.
  • Supported <tool-name> (case unsensitive) include
    • LDA: Latent Dirichlet allocation. Include a variety of extensions.
    • TLDA: Tree LDA.
    • WSBM: Weighted stochastic block model. Find blocks in a network.
    • SCC: Strongly connected components.
    • Stoplist: Remove stop words. Support English only, but can support other languages given dictionary.
    • Lemmatizer: Lemmatize POS-tagged corpus. Support English only, but can support other languages given dictionary.
    • POS-Tagger: Tag words' POS. Support English only, but can support other languages given trained models.
    • Stemmer: Stem words. Support English only.
    • Tokenizer: Tokenize corpus. Support English only, but can support other languages given trained models.
    • Corpus-Converter: Convert word corpus into indexed corpus (for LDA) and vice versa.
    • Tree Builder: Build tree priors from word associations.
  • You can always use --help to see help information of
    • supported tool names if you don't specify a tool name

       java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --help
      
    • a specific tool if you specify it (take LDA as an example)

       java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool LDA --help
      
  • In following command examples, arguments in {} (e.g. {cmd-1|cmd-2|cmd-3}) denote that one and only one of them should be declared.

LDA (Latent Dirichlet Allocation) in Command Line

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model lda --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file>
  • Implementation of (Blei et al., 2003).
  • Required arguments
    • <vocab-file>: Vocabulary file. Each line contains a unique word.

    • <corpus-file>: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format

       <doc-len> <word-type-1>:<frequency-1> <word-type-2>:<frequency-2> ... <word-type-n>:<frequency-n>
      

      <doc-len> is the total number of tokens in this document. <word-type-i> denotes the i-th word in <vocab-file>, starting from 0. Words with zero frequency can be omitted.

    • <model-file>: Trained model file in JSON format. Read and written by program.

  • Optional arguments
    • --model <model-name>: The topic model you want to use (default: LDA). Supported <model-name> (case unsensitive) are
    • --test: Use the model for test (default: false).
    • --no-verbose: Stop printing log to console.
    • --alpha <alpha-value>: Parameter of Dirichlet prior of document distribution over topics (default: 1.0). Must be a positive real number.
    • --beta <beta-value>: Parameter of Dirichlet prior of topic distribution over words (default: 0.1). Must be a positive real number.
    • --topics <num-topics>: Number of topics (default: 10). Must be a positive integer.
    • --iters <num-iters>: Number of iterations (default: 100). Must be a positive integer.
    • --update: Update alpha while sampling (default: false).
    • --update-int <update-interval>: Interval of updating alpha (default: 10). Must be a positive integer.
    • --theta <theta-file>: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.
    • --output-topic <topic-file>: File for showing topics.
    • --topic-count <topic-count-file>: File for document-topic counts.
    • --top-word <num-top-word>: Number of words to give when showing topics (default: 10). Must be a positive integer.

RTM: Relational Topic Model

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model rtm --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file> --rtm-train-graph <rtm-train-graph-file>
  • Implementation of (Chang and Blei, 2010).
  • Jointly models topics and document links.
  • Extends LDA.
  • Semi-optional arguments
    • --rtm-train-graph <rtm-train-graph-file> [optional in test]: Link file for RTM to train. Each line contains an edge in the format node-1 \t node-2 \t weight. Node number starts from 0. weight must be a non-negative integer. weight is either 0 or 1 and is optional. Its default value is 1 if not specified.
    • --rtm-test-graph <rtm-test-graph-file> [optional in training]: Link file for RTM to evaluate. Can be the same with RTM train graph. Format is the same as <rtm-train-graph-file>.
  • Optional arguments
    • --nu <nu-value>: Variance of normal priors for weight vectors/matrices in RTM and its extensions (default: 1.0). Must be a positive real number.
    • --plr-int <compute-PLR-interval>: Interval of computing predictive link rank (default: 20). Must be a positive integer.
    • --neg: Sample negative links (default: false).
    • --neg-ratio <neg-ratio>: The ratio of number of negative links to number of positive links (default 1.0). Must be a positive real number.
    • --pred <pred-file>: Predicted document link probability matrix file.
    • --reg <reg-file>: Doc-doc regression value file.
    • --directed: Set all edges directed (default: false).

Lex-WSB-RTM: RTM with Lexical Weights and Weighted Stochastic Block Priors

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools yang.weiwei.Tools --tool lda --model lex-wsb-rtm --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file> --rtm-train-graph <rtm-train-graph-file>
  • Extends RTM.
  • Optional arguments
    • --wsbm-graph <wsbm-graph-file>: Link file for WSBM to find blocks. See WSBM for details.
    • --alpha-prime <alpha-prime-value>: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.
    • -a <a-value>: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
    • -b <b-value>: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
    • -g <gamma-value>: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.
    • --blocks <num-blocks>: Number of blocks (default: 10). Must be a positive integer.
    • --output-wsbm <wsbm-output-file>: File for WSBM-identified blocks. See WSBM for details.
    • --block-feature: Include block features in link prediction (default: false).

Lex-WSB-Med-RTM: Lex-WSB-RTM with Hinge Loss

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model lex-wsb-med-rtm --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file> --rtm-train-graph <rtm-train-graph-file>

SLDA: Supervised LDA

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model slda --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file> --label <label-file>
  • Implementation of (McAuliffe and Blei, 2008).
  • Jointly models topics and document labels. Support multi-class classification.
  • Extends LDA.
  • Semi-optional arguments
    • --label <label-file> [optional in test]: Label file. Each line contains corresponding document's numeric label. If a document label is not available, leave the corresponding line empty.
  • Optional arguments
    • --sigma <sigma-value>: Variance for the Gaussian generation of response variable in SLDA (default: 1.0). Must be a positive real number.
    • --nu <nu-value>: Variance of normal priors for weight vectors in SLDA and its extensions (default: 1.0). Must be a positive real number.
    • --pred <pred-file>: Predicted label file.
    • --reg <reg-file>: Regression value file.

BS-LDA: Binary SLDA

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model bs-lda --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file> --label <label-file>
  • For binary classification only.
  • Extends SLDA.
  • Label is either 1 or 0.

Lex-WSB-BS-LDA: BS-LDA with Lexcial Weights and Weighted Stochastic Block Priors

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model lex-wsb-bs-lda --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file> --label <label-file>
  • Extends BS-LDA.
  • Optional arguments
    • --wsbm-graph <wsbm-graph-file>: Link file for WSBM to find blocks. See WSBM for details.
    • --alpha-prime <alpha-prime-value>: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.
    • -a <a-value>: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
    • -b <b-value>: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
    • -g <gamma-value>: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.
    • --blocks <num-blocks>: Number of blocks (default: 10). Must be a positive integer.
    • --directed: Set all edges directed (default: false).
    • --output-wsbm <wsbm-output-file>: File for WSBM-identified blocks. See WSBM for details.

Lex-WSB-Med-LDA: Lex-WSB-BS-LDA with Hinge Loss

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model lex-wsb-med-lda --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file> --label <label-file>

BP-LDA: LDA with Block Priors

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model bp-lda --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file> --block-graph <block-graph-file>
  • Use priors from pre-computed blocks.
  • Extends LDA.
  • Semi-optional arguments
    • --block-graph <block-graph-file> [optional in test]: Pre-computed block file. Each line contains a block and consists of one or more documents denoted by document numbers. Document numbers are separated by space.
  • Optional arguments
    • --alpha-prime <alpha-prime-value>: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.

ST-LDA: Single Topic LDA

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model st-lda --vocab <vocab-file> --corpus <corpus-file> --short-corpus <short-corpus-file> --trained-model <model-file>
  • Implementation of (Hong et al., 2016).
  • Each document can only be assigned to one topic.
  • Extends LDA.
  • Semi-optional arguments
    • --short-corpus <short-corpus-file> [at least one of --short-corpus and --corpus should be specified]: Short corpus file.
  • Optional arguments
    • --short-theta <short-theta-file>: Short documents' background topic distribution file.
    • --short-topic-assign <short-topic-assign-file>: Short documents' topic assignment file.

WSB-TM: Weighted Stochastic Block Topic Model

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model wsb-tm --vocab <vocab-file> --corpus <corpus-file> --trained-model <model-file> --wsbm-graph <wsbm-graph-file>
  • Use priors from WSBM-computed blocks.
  • Extends LDA.
  • Semi-optional arguments
    • --wsbm-graph <wsbm-graph-file> [optional in test]: Link file for WSBM to find blocks. See WSBM for details.
  • Optional arguments
    • --alpha-prime <alpha-prime-value>: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.
    • -a <a-value>: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
    • -b <b-value>: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
    • -g <gamma-value>: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.
    • --blocks <num-blocks>: Number of blocks (default: 10). Must be a positive integer.
    • --directed: Set all edges directed (default: false).
    • --output-wsbm <wsbm-output-file>: File for WSBM-identified blocks. See WSBM for details.

tLDA in Command Line

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool tlda --vocab <vocab-file> --tree <tree-prior-file> --corpus <corpus-file> --trained-model <model-file>
  • Implementation of (Boyd-Graber et al., 2007).
  • Required arguments
    • <vocab-file>: Vocabulary file. Each line contains a unique word.

    • <tree-prior-file>: Tree prior file. Generated by Tree Builder

    • <corpus-file>: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format

       <doc-len> <word-type-1>:<frequency-1> <word-type-2>:<frequency-2> ... <word-type-n>:<frequency-n>
      

      <doc-len> is the total number of tokens in this document. <word-type-i> denotes the i-th word in <vocab-file>, starting from 0. Words with zero frequency can be omitted.

    • <model-file>: Trained model file. Read and written by program.

  • Optional arguments
    • --test: Use the model for test (default: false).
    • --no-verbose: Stop printing log to console.
    • --alpha <alpha-value>: Parameter of Dirichlet prior of document distribution over topics (default: 0.01). Must be a positive real number.
    • --beta <beta-value>: Parameter of Dirichlet prior of topic distribution over words (default: 0.01). Must be a positive real number.
    • --topics <num-topics>: Number of topics (default: 10). Must be a positive integer.
    • --iters <num-iters>: Number of iterations (default: 100). Must be a positive integer.
    • --update: Update alpha while sampling (default: false).
    • --update-int <update-interval>: Interval of updating alpha (default: 10). Must be a positive integer.
    • --theta <theta-file>: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.
    • --output-topic <topic-file>: File for showing topics.
    • --topic-count <topic-count-file>: File for document-topic counts.
    • --top-word <num-top-word>: Number of words to give when showing topics (default: 10). Must be a positive integer.

Other Tools in Command Line

WSBM: Weighted Stochastic Block Model

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool wsbm --nodes <num-nodes> --blocks <num-blocks> --graph <graph-file> --output <output-file>
  • Implementation of (Aicher et al., 2014).
  • Find latent blocks in a network, such that nodes in the same block are densely connected and nodes in different blocks are sparsely connected.
  • Required arguments
    • <num-nodes>: Number of nodes in the graph. Must be a positive integer.
    • <num-blocks>: Number of blocks. Must be a positive integer.
    • <graph-file>: Graph file. Each line contains an edge in the format node-1 \t node-2 \t weight. Node number starts from 0. weight must be a non-negative integer. weight is optional. Its default value is 1 if not specified.
    • <output-file>: Result file. The i-th line contains the block assignment of i-th node.
  • Optional arguments
    • --directed: Set the edges as directed (default: undirected).
    • -a <a-value>: Parameter for edge rates' Gamma prior (default: 1.0). Must be a positive real number.
    • -b <b-value>: Parameter for edge rates' Gamma prior (default: 1.0). Must be a positive real number.
    • -g <gamma-value>: Parameter for block distribution's Dirichlet prior (default 1.0). Must be a positive real number.
    • --iters <num-iters>: Number of iterations (default: 100). Must be a positive integer.
    • --no-verbose: Stop printing log to console.

SCC: Strongly Connected Components

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool scc --nodes <num-nodes> --graph <graph-file> --output <output-file>
  • New implementation.
  • Find strongly connected components in an undirected graph. In each component, every node is reachable from any other nodes in the same component.
  • Arguments
    • <num-nodes>: Number of nodes in the graph. Must be a positive integer.
    • <graph-file>: Graph file. Each line contains an edge in the format node-1 \t node-2. Node number starts from 0.
    • <output-file>: Result file. Each line contains a strongly connected component and consists of one or more nodes denoted by node numbers. Node numbers are separated by space.

Stoplist

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool stoplist --corpus <corpus-file> --output <output-file>
  • New implementation.
  • Only supports English, but can support other languages if dictionary is provided.
  • Required arguments
    • <corpus-file>: Corpus file with stop words. Each line contains a document. Words are separated by space.
    • <output-file>: Corpus file without stop words. Each line contains a document. Words are separated by space.
  • Optional arguments
    • --dict <dict-file>: Dictionary file name. Each line contains a stop word.

Lemmatizer

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lemmatizer --corpus <corpus-file> --output <output-file>
  • A re-packaging of opennlp.tools.lemmatizer.SimpleLemmatizer.
  • Only supports English, but can support other languages if dictionary is provided.
  • Required arguments
    • <corpus-file>: Unlemmatized corpus file. Each line contains a unlemmatized, tokenized, and POS-tagged document.
    • <output-file>: Lemmatized corpus file. Each line contains a lemmatized document. Words are separated by space.
  • Optional arguments
    • --dict <dict-file>: Dictionary file name. Each line contains a rule in the format unlemmatized-word \t POS \t lemmatized-word.

POS Tagger

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool pos-tagger --corpus <corpus-file> --output <output-file>
  • A re-packaing of opennlp.tools.postag.POSTaggerME (https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.postagger)
  • Only supports English, but can support other languages if model is provided.
  • Required arguments
    • <corpus-file>: Untagged corpus file. Each line contains a tokenized untagged document.
    • <output-file>: Tagged corpus file. Each line contains a tagged document. Each word is annotated as word_POS.
  • Optional arguments
    • --model <model-file>: Model file name.

Stemmer

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool stemmer --corpus <corpus-file> --output <output-file>
  • A re-packaging of PorterStemmer (http://tartarus.org/~martin/PorterStemmer/index.html)
  • Only supports English.
  • Arguments
    • <corpus-file>: Unstemmed corpus file. Each line contains an unstemmed document. Words are separated by space.
    • <output-file>: Stemmed corpus file. Each line contains a stemmed document. Words are separated by space.

Tokenizer

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool tokenizer --corpus <corpus-file> --output <output-file>
  • A re-packaging of opennlp.tools.tokenize.TokenizerME (https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.tokenizer)
  • Only supports English, but can support other languages if model is provided.
  • Required arguments
    • <corpus-file>: Untokenized corpus file. Each line contains a untokenized document.
    • <output-file>: Tokenized corpus file. Each line contains a tokenized document.
  • Optional arguments
    • --model <model-file>: Model file name.

Corpus Converter

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool corpus-converter {--get-vocab|--to-index|--to-word} --word-corpus <word-corpus-file> --index-corpus <index-corpus-file> --vocab <vocab-file>
  • New implementation
  • Arguments
    • --get-vocab, --to-index, --to-word: One of three should be selected.

      • --get-vocab: Collect vocabulary from <word-corpus-file> and write them in <vocab-file>.
      • --to-index: Convert a word corpus file <word-corpus-file> into an indexed corpus file <index-corpus-file> and write the vocabulary in <vocab-file>.
      • --to-word: Convert an indexed corpus file <index-corpus-file> into a word corpus file <word-corpus-file> given vocabulary file <vocab-file>.
    • <word-corpus-file>: Corpus file in which documents are represented by words. Each line contains a document. Words are separated by space.

    • <index-corpus-file>: Corpus file in which documents are represented by word indexes and frequencies. Not required when using --get-vocab. Each line contains a document in the following format

       <doc-len> <word-type-1>:<frequency-1> <word-type-2>:<frequency-2> ... <word-type-n>:<frequency-n>
      

      <doc-len> is the total number of tokens in this document. <word-type-i> denotes the i-th word in <vocab-file>, starting from 0. Words with zero frequency can be omitted.

    • <vocab-file>: Vocabulary file. Each line contains a unique word.

Tree Builder

java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool tree-builder --vocab <vocab-file> --score <score-file> --tree <tree-file>
  • Implementation of (Yang et al., 2017)
  • Arguments
    • <vocab-file>: Vocabulary file. Each line contains a unique word.
    • <score-file>: Word association file. Assume there are V words in <vocab-file>. There are V lines in the <score-file>. Each line corresponds to a word in the vocabulary and contains V float numbers which denote the word's association scores with all other words.
    • <tree-file>: The tree prior file.
  • Optional Arguments
    • --type <tree-type>: Tree prior type. 1 for two-level tree; 2 for hierarchical agglomerative clustering (HAC) tree; 3 for HAC tree with leaf duplication (default 1).
    • --child <num-child>: Number of child nodes per internal node for a two-level tree (default 10).
    • --thresh <threshold>: The confidence threshold for HAC (default 0.0).

Use YWWTools Source Code

To integrate my code into your project, please copy lib/ to your project root directory and

  • include YWWTools.jar and deps.jar in your project build path, OR
  • add the dependencies (in pom.xml) to your own dependency list

Here are examples for running some algorithms in this package. For more information, please look at JavaDoc in doc/.

LDA Code Examples

  • Classes: yang.weiwei.lda.LDA and yang.weiwei.lda.LDAParam.

  • Training code example

      LDAParam param = new LDAParam("vocab_file_name"); //initialize a parameter object and set parameters as needed
      LDA ldaTrain = new LDA(param); // initialize an LDA object
      ldaTrain.readCorpus("corpus_file_name");
      ldaTrain.initialize();
      ldaTrain.sample(100); // set number of iterations as needed
      ldaTrain.writeModel("model_file_name"); // optional, see test code example
      ldaTrain.writeDocTopicDist("theta_file_name"); // optional, write document-topic distribution to file
      ldaTrain.writeResult("topic_file_name", 10); // optional, write top 10 words of each topic to file
      ldaTrain.writeDocTopicCounts("topic_count_file_name") // optional, write document-topic counts to file
    
  • Test code example

      LDAParam param = new LDAParam("vocab_file_name");
      LDA ldaTest = new LDA(ldaTrain, param); // initialize with pre-trained LDA object
      // LDA ldaTest = new LDA("model_file_name", param); // or initialize with an LDA model in a file
      ldaTest.readCorpus("corpus_file_name");
      ldaTest.initialize();
      ldaTest.sample(100); // set number of iterations as needed
      ldaTest.writeDocTopicDist("theta_file_name"); // optional, write document-topic distribution to file
      ldaTest.writeDocTopicCounts("topic_count_file_name"); // optional, write document-topic counts to file
    

RTM

  • Class: yang.weiwei.lda.rtm.RTM.

  • Extends LDA.

  • Training code example

      LDAParam param = new LDAParam("vocab_file_name");
      RTM ldaTrain = new RTM(param);
      ldaTrain.readCorpus("corpus_file_name");
      ldaTrain.readGraph("train_graph_file_name", RTM.TRAIN_GRAPH); // read train graph
      ldaTrain.readGraph("test_graph_file_name", RTM.TEST_GRAPH); // read test graph
      ldaTrain.initialize();
      ldaTrain.sample(100); 
      ldaTrain.writePred("pred_file_name"); // optional, write predicted document link probabilities to file
      ldaTrain.writeRegValues("reg_value_file_name"); // optional, write doc-doc regression values to file
    
  • Test code example

      LDAParam param = new LDAParam("vocab_file_name");
      RTM ldaTest = new RTM(ldaTrain, param);
      // RTM ldaTest = new RTM("model_file_name", param); 
      ldaTest.readCorpus("corpus_file_name");
      ldaTest.readGraph("train_graph_file_name", RTM.TRAIN_GRAPH); // optional
      ldaTest.readGraph("test_graph_file_name", RTM.TEST_GRAPH);
      ldaTest.initialize();
      ldaTest.sample(100); 
      ldaTest.writePred("pred_file_name"); // optional, write predicted document link probabilities to file
      ldaTest.writeRegValues("reg_value_file_name"); // optional, write doc-doc regression values to file
    

Lex-WSB-RTM

  • Class: yang.weiwei.lda.rtm.lex_wsb_rtm.LexWSBRTM.

  • Extends RTM.

  • Training code example

      LDAParam param = new LDAParam("vocab_file_name");
      LexWSBRTM ldaTrain = new LexWSBRTM(param);
      ldaTrain.readCorpus("corpus_file_name");
      ldaTrain.readGraph("train_graph_file_name", RTM.TRAIN_GRAPH); 
      ldaTrain.readGraph("test_graph_file_name", RTM.TEST_GRAPH); 
      ldaTrain.readBlockGraph("wsbm_graph_file_name"); // optional, read graph for WSBM
      ldaTrain.initialize();
      ldaTrain.sample(100); 
      ldaTrain.writeBlocks("block_file_name"); // optional, write WSBM results to file
    
  • Test code example

      LDAParam param = new LDAParam("vocab_file_name");
      LexWSBRTM ldaTest = new LexWSBRTM(ldaTrain, param);
      // LexWSBRTM ldaTest = new LexWSBRTM("model_file_name", param); 
      ldaTest.readCorpus("corpus_file_name");
      ldaTest.readGraph("train_graph_file_name", RTM.TRAIN_GRAPH); // optional
      ldaTest.readGraph("test_graph_file_name", RTM.TEST_GRAPH);
      ldaTest.readBlockGraph("wsbm_graph_file_name"); // optional
      ldaTest.initialize();
      ldaTest.sample(100); 
      ldaTest.writeBlocks("block_file_name"); // optional
    

Lex-WSB-Med-RTM

  • Class: yang.weiwei.lda.rtm.lex_wsb_med_rtm.LexWSBMedRTM.
  • Extends Lex-WSB-RTM.
  • Code examples are the same with Lex-WSB-RTM.

SLDA

  • Class: yang.weiwei.lda.slda.SLDA.

  • Extends LDA.

  • Training code example

      LDAParam param = new LDAParam("vocab_file_name");
      SLDA ldaTrain = new SLDA(param);
      ldaTrain.readCorpus("corpus_file_name");
      ldaTrain.readLabels("label_file_name"); // read label file
      ldaTrain.initialize();
      ldaTrain.sample(100);
      ldaTrain.writePredLabels("pred_label_file_name"); // optional, write predicted labels
      ldaTrain.writeRegValues("reg_value_file_name"); // optioanl, write regression values
    
  • Test code example

      LDAParam param = new LDAParam("vocab_file_name");
      SLDA ldaTest = new SLDA(ldaTrain, param);
      // SLDA ldaTest = new SLDA("model_file_name", param);
      ldaTest.readCorpus("corpus_file_name");
      ldaTest.readLabels("label_file_name"); // optional
      ldaTest.initialize();
      ldaTest.sample(100);
      ldaTest.writePredLabels("pred_label_file_name"); // optional
      ldaTest.writeRegValues("reg_value_file_name"); // optional
    

BS-LDA

  • Class: yang.weiwei.lda.slda.bs_lda.BSLDA
  • Extends SLDA.
  • Code examples are the same with SLDA.

Lex-WSB-BS-LDA

  • Class: yang.weiwei.lda.slda.lex_wsb_bs_lda.LexWSBBSLDA.

  • Extends BS-LDA.

  • Training code example

      LDAParam param = new LDAParam("vocab_file_name");
      LexWSBBSLDA ldaTrain = new LexWSBBSLDA(param);
      ldaTrain.readCorpus("corpus_file_name");
      ldaTrain.readLabels("label_file_name");
      ldaTrain.readBlockGraph("wsbm_graph_file_name"); // optional, read graph for WSBM
      ldaTrain.initialize();
      ldaTrain.sample(100);
      ldaTrain.writeBlocks("block_file_name"); // optional, write WSBM results to file
    
  • Test code example

      LDAParam param = new LDAParam("vocab_file_name");
      LexWSBBSLDA ldaTest = new LexWSBBSLDA(ldaTrain, param);
      // LexWSBBSLDA ldaTest = new LexWSBBSLDA("model_file_name", param);
      ldaTest.readCorpus("corpus_file_name");
      ldaTest.readLabels("label_file_name"); // optional
      ldaTest.readBlockGraph("wsbm_graph_file_name"); // optional
      ldaTest.initialize();
      ldaTest.sample(100);
      ldaTest.writePredLabels("pred_label_file_name"); // optional
      ldaTest.writeBlocks("block_file_name"); // optional
    

Lex-WSB-Med-LDA

BP-LDA

  • Class: yang.weiwei.lda.bp_lda.BPLDA

  • Extends LDA.

  • Training code example

      LDAParam param = new LDAParam("vocab_file_name");
      BPLDA ldaTrain = new BPLDA(param); 
      ldaTrain.readCorpus("corpus_file_name");
      ldaTrain.readBlocks("block_file_name"); // read block file
      ldaTrain.initialize();
      ldaTrain.sample(100);
    
  • Test code example

      LDAParam param = new LDAParam("vocab_file_name");
      BPLDA ldaTest = new BPLDA(ldaTrain, param);
      // BPLDA ldaTest = new BPLDA("model_file_name", param);
      ldaTest.readCorpus("corpus_file_name");
      ldaTest.readBlocks("block_file_name"); // optional
      ldaTest.initialize();
      ldaTest.sample(100); 
    

ST-LDA

  • Class: yang.weiwei.lda.st_lda.STLDA

  • Extends LDA.

  • Training code example

      LDAParam param = new LDAParam("vocab_file_name");
      STLDA ldaTrain = new STLDA(param);
      ldaTrain.readCorpus("long_corpus_file_name");
      ldaTrain.readShortCorpus("short_corpus_file_name");
      ldaTrain.initialize();
      ldaTrain.sample(100);
      ldaTrain.writeShortDocTopicDist("short_theta_file_name"); // optional, write short documents' topic distribution to file
      ldaTrain.writeShortDocTopicAssign("short_topic_assign_file_name"); // optional, write short documents' topic assignments to file
    
  • Test code example

      LDAParam param = new LDAParam("vocab_file_name");
      STLDA ldaTest = new STLDA(ldaTrain, param);
      // STLDA ldaTest = new STLDA("model_file_name", param);
      ldaTest.readCorpus("long_corpus_file_name");
      ldaTest.readShortCorpus("short_corpus_file_name");
      ldaTest.initialize();
      ldaTest.sample(100);
      ldaTest.writeShortDocTopicDist("short_theta_file_name"); // optional
      ldaTest.writeShortDocTopicAssign("short_topic_assign_file_name"); // optional
    

WSB-TM

  • Class: yang.weiwei.lda.wsb_tm.WSBTM

  • Extends LDA.

  • Training code example

      LDAParam param = new LDAParam("vocab_file_name");
      WSBTM ldaTrain = new WSBTM(param); 
      ldaTrain.readCorpus("corpus_file_name");
      ldaTrain.readGraph("wsbm_graph_file_name"); // read graph file
      ldaTrain.initialize();
      ldaTrain.sample(100);
    
  • Test code example

      LDAParam param = new LDAParam("vocab_file_name");
      WSBTM ldaTest = new WSBTM(ldaTrain, param);
      // WSBTM ldaTest = new WSBTM("model_file_name", param);
      ldaTest.readCorpus("corpus_file_name");
      ldaTest.readGraph("wsbm_graph_file_name"); // optional
      ldaTest.initialize();
      ldaTest.sample(100); 
    

tLDA Code Examples

  • Classes: yang.weiwei.tlda.TLDA and yang.weiwei.tlda.TLDAParam.

  • Training code example

      TLDAParam param = new LDAParam("vocab_file_name", "tree_prior_file_name"); //initialize a parameter object and set parameters as needed
      TLDA tldaTrain = new TLDA(param); // initialize a tLDA object
      tldaTrain.readCorpus("corpus_file_name");
      tldaTrain.initialize();
      tldaTrain.sample(100); // set number of iterations as needed
      tldaTrain.writeModel("model_file_name"); // optional, see test code example
      tldaTrain.writeDocTopicDist("theta_file_name"); // optional, write document-topic distribution to file
      tldaTrain.writeWordResult("topic_file_name", 10); // optional, write top 10 words of each topic to file
      tldaTrain.writeDocTopicCounts("topic_count_file_name") // optional, write document-topic counts to file
    
  • Test code example

      TLDAParam param = new TLDAParam("vocab_file_name", "tree_prior_file_name");
      TLDA tldaTest = new TLDA(tldaTrain, param); // initialize with pre-trained tLDA object
      // TLDA tldaTest = new TLDA("model_file_name", param); // or initialize with a TLDA model in a file
      tldaTest.readCorpus("corpus_file_name");
      tldaTest.initialize();
      tldaTest.sample(100); // set number of iterations as needed
      tldaTest.writeDocTopicDist("theta_file_name"); // optional, write document-topic distribution to file
      tldaTest.writeDocTopicCounts("topic_count_file_name"); // optional, write document-topic counts to file
    

Other Code Examples

WSBM

  • Classes: yang.weiwei.wsbm.WSBM and yang.weiwei.wsbm.WSBMParam.

  • Code example

      WSBMParam param = new WSBMParam(); // initialize a parameter object and set parameters as needed
      WSBM wsbm = new WSBM(param); // initialize a WSBM object with parameters
      wsbm.readGraph("graph_file_name");
      wsbm.init();
      wsbm.sample(100); // set number of iterations as needed
      wsbm.printResults();
    

SCC

  • Class: yang.weiwei.scc.SCC.

  • Code example

      SCC scc = new SCC(10); // initialize with number of nodes
      scc.readGraph("graph_file_name");
      scc.cluster();
      scc.writeCluster("result_file_name");
    

Tree Builder

  • Class: yang.weiwei.tlda.TreeBuilder.

  • Code example

      TreeBuilder tb = new TreeBuilder();
      tb.build2LevelTree("score_file_name", "vocab_file_name", "tree_file_name", num_Child); // Build a two-level tree
      tb.hac("score_file_name", "vocab_file_name", "tree_file_name", threshold); // Build a tree with hierarchical agglomerative clustering (HAC)
      tb.hacWithLeafDup("score_file_name", "vocab_file_name", "tree_file_name", threshold); // Build a tree with HAC and leaf duplication
    

English Corpus Preprocessing

  • Basically there are two ways to preprocess an English corpus for topic models as follows.
    • tokenization -> stop words removal -> stemming
    • tokenization -> POS tagging -> lemmatization -> stop words removal
  • The first way is quick but with low word readability. The second one takes more time but produce better readability.
  • Finally you may want to remove low (document-)frequency words, in order to accelerate topic modeling without hurting the performance.

Citation

  • If you use Tree Builder, please cite

      @InProceedings{Yang:Boyd-Graber:Resnik-2017,
      	Title = {Adapting Topic Models using Lexical Associations with Tree Priors},
      	Booktitle = {Empirical Methods in Natural Language Processing},
      	Author = {Weiwei Yang and Jordan Boyd-Graber and Philip Resnik},
      	Year = {2017},
      	Location = {Copenhagen, Denmark},
      }
    
  • If you use Lex-WSB-RTM (aka LBS-RTM), Lex-WSB-Med-RTM (aka LBH-RTM), Lex-WSB-BS-LDA, and/or Lex-WSB-Med-LDA, please cite

      @InProceedings{Yang:Boyd-Graber:Resnik-2016,
      	Title = {A Discriminative Topic Model using Document Network Structure},
      	Booktitle = {Association for Computational Linguistics},
      	Author = {Weiwei Yang and Jordan Boyd-Graber and Philip Resnik},
      	Year = {2016},
      	Location = {Berlin, Germany},
      }
    
  • If you use ST-LDA, please cite

      @InProceedings{Hong:Yang:Resnik:Frias-Martinez-2016,
      	Title = {Uncovering Topic Dynamics of Social Media and News: The Case of Ferguson},
      	Booktitle = {International Conference on Social Informatics},
      	Author = {Lingzi Hong and Weiwei Yang and Philip Resnik and Vanessa Frias-Martinez},
      	Year = {2016},
      	Location = {Bellevue, WA, USA}
      }
    

References

LDA: Latent Dirichlet Allocation

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research.

SLDA: Supervised LDA

Jon D. McAuliffe and David M. Blei. 2008. Supervised topic models. In Proceedings of Advances in Neural Information Processing Systems.

Med-LDA: Max-margin LDA

Jun Zhu, Amr Ahmed, and Eric P. Xing. 2012. MedLDA: Maximum margin supervised topic models. Journal of Machine Learning Research.

Jun Zhu, Ning Chen, Hugh Perkins, and Bo Zhang. 2014. Gibbs max-margin topic models with data augmentation. Journal of Machine Learning Research.

RTM: Relational Topic Model

Jonathan Chang and David M. Blei. 2010. Hierarchical relational models for document networks. The Annals of Applied Statistics.

Lex-WSB-Med-RTM: RTM with WSB-computed Block Priors, Lexical Weights, and Hinge Loss

Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2016. A discriminative topic model using document network structure. In Proceedings of Association for Computational Linguistics.

ST-LDA: Single Topic LDA

Lingzi Hong, Weiwei Yang, Philip Resnik, and Vanessa Frias-Martinez. 2016. Uncovering topic dynamics of social media and news: The case of Ferguson. In Proceedings of International Conference on Social Informatics.

WSBM: Weighted Stochastic Block Model

Christopher Aicher, Abigail Z. Jacobs, and Aaron Clauset. 2014. Learning latent block structure in weighted networks. Journal of Complex Networks.

tLDA: Tree LDA

Jordan Boyd-Graber, David M. Blei, and Xiaojin Zhu. 2007. A topic model for word sense disambiguation. Empirical Methods in Natural Language Processing.

Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2017. Adapting topic models using lexical associations with tree priors. Empirical Methods in Natural Language Processing.

Back to Top