The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed-membership model. While powerful, it makes a hidden assumption that the probability of a mixture component contributing to a data point is positively correlated with the amount that it contributes to it. In many settings, this is an undesirable prior assumption. For example, in topic modeling a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed-membership model. The ICD integrates features from both the HDP and the Indian buffet process (IBP). It assigns a subset of the shared mixture components to each data point. This subset, the data point's ``focus'', is determined independently from the amount that each component in the subset contributes to it. We use an ICD mixture model, the focused topic model (FTM), to analyze text corpora. We demonstrate superior performance over the HDP-based topic model.
Download PDF