Topic Modeling – WI Methods Lab

Workshop Recap: A Practical Introduction to Text Analysis (November 30, 2023)

December 15, 2023April 29, 2024 Roland Toth

On November 30^th, 2023, the Methods Lab organized a workshop on quantitative text analysis. The workshop was conducted by Douglas Parry (Stellenbosch University) and covered the whole process of text analysis from data preparation to the visualization of sentiments or topics identified.

In the first half of the workshop, Douglas covered the first steps involved in text analysis, such as tokenization (the transformation of texts into smaller parts like single words or consecutive words), the removal of “stop words” (words that do not contain meaningful information), and the aggregation of content by meta-information (authors, books, chapters, etc.). Apart from the investigation of the frequency with which terms occur, sentiment analysis using existing dictionaries was also addressed. This technique involves assigning values to each word representing certain targeted characteristics (e.g., emotionality/polarity), which in turn allows for comparing overall sentiments between different corpora. Finally, the visualization of word occurrences and sentiments was covered. After this introduction, participants had the chance to apply their knowledge using the programming language R by solving tasks with texts Douglas provided.

Douglas Parry goes through steps necessary to prepare for text analysis.

In the second half of the workshop, Douglas focused on different methods of topic modeling, which ultimately attempt to assign texts to latent topics based on the words they contain. In comparison to simpler procedures covered in the first half of the workshop, topic models can also consider the context of words within the texts. Specifically, Douglas introduced participants to Latent Dirichlet Allocation (LDA), Correlated Topic Modeling (CTM), and Structural Topic Modeling (STM). One of the most important decisions to be made for any such model is the number of topics to emerge: too few may dilute nuances within topics and too many may lead to redundancies. The visualization and – most importantly – limitations of topic modeling were also discussed before participants performed topic modeling themselves with the data provided earlier. Finally, Douglas concluded with a summary of everything covered and an overview of advanced subjects in text analysis.

The workshop was very well-received and prepared all participants for text analysis in the future. Douglas balanced lecture-style sections and well-prepared, hands-on application very well and provided all materials in a way that participants could focus on the tasks at hand, while following a logical structure throughout. We would like to thank him for this great introduction to text analysis!

Workshop: A Practical Introduction to Text Analysis

November 1, 2023April 29, 2024 Methods Lab

We are eager to announce our upcoming workshop, “A Practical Introduction to Text Analysis“, on Thursday, November 30, at the Weizenbaum Institute. Led by visiting fellow Dr. Douglas Parry (Stellenbosch University, South Africa), this workshop offers a comprehensive introduction to text analysis using the R programming language. Topics covered include text pre-processing (formats, tokenization, stemming, stop words, regex), dictionary analysis (lexicons, tf-idf, sentiment), topic modeling (LDA, CTM, STM), and data visualization. By the end of the workshop, participants will be equipped to tackle real-world text-mining tasks and have a solid foundation to move on to more advanced analysis techniques. While a basic understanding of R programming is anticipated, prior experience in text analysis is not necessary.

For more details about the workshop, visit our program page. We look forward to your participation!

Workshop Recap: Introduction to Topic Modeling (June 15, 2023)

August 28, 2023March 13, 2024 Anna Hohwü-Christensen

On June 15, the Methods Lab organized the workshop Introduction to Topic Modeling in collaboration with the research group Platform Algorithms and Digital Propaganda. The workshop aimed to provide participants with a comprehensive understanding of topic modeling – a machine-learning technique used to determine clusters of similar words (i.e., topics) within bodies of text. The event took place at the Weizenbaum Institute in a hybrid format, bringing together researchers from various institutions.

The workshop was conducted by Daniel Matter (TU Munich) who guided the participants through basic concepts and applications of this method. Through theory, demonstrations, and practical examples, participants gained insight into commonly used algorithms such as Latent Dirichlet Allocation (LDA) and BERT-based topic models. The workshop enabled participants to assess the advantages and drawbacks of each approach, equipping them with a foundation in topic modeling while, at the same time, providing plenty of new insights to those with prior expertise.

Daniel Matter talks about the most important aspects of topic modeling.

During the workshop, Daniel explained the distinction between LDA and BERTopic, two popular topic modeling strategies. LDA, or Latent Dirichlet Allocation, a commonly used method for topic modeling, operates as a generative model and treats each document as a mixture of topics. LDA aims to determine the topic and word distributions that maximize the probability of generating the documents in the corpus. With LDA, as opposed to BERTopic, the number of topics must be known beforehand.

BERTopic, on the other hand, belongs to the category of Embeddings-Based Topic Models (EBTM), which take a different approach. Unlike LDA, which treats words as distinct features, BERTopic incorporates semantic relationships between words. BERTopic follows a bottom-up approach, embedding documents in a semantic space and extracting topics from this transformed representation. Unlike LDA, which can be applied to short and long text corpora, BERTopic generally works better on shorter text, such as social media posts or news headlines.

The workshop took place in the Flex Room at the Weizenbaum Institute.

When deciding between BERTopic and LDA, it is essential to consider the specific requirements of the text analysis. BERTopic’s strength lies in its flexibility and ability to handle short texts effectively, while LDA is preferred when strong interpretability is needed.

Participants were able to engage with the method on their own devices during the presentation.

With this workshop, we at the Methods Lab hope to have provided our attendees with a solid understanding of topic modeling as a method. By exploring the concepts, applications, and advantages of each approach, these tools can be used to unlock hidden semantic structures within textual data, enabling researchers to employ them in various domains and facilitating tasks such as document clustering, information retrieval, and recommender systems.

A big thank you to Daniel for inducting us into the world of topic modeling and to all our participants!

Our next workshop, Whose Data is it Anyway? Ethical, Practical, and Methodological Challenges of Data Donation in Messenger Groups Research, will take place on August 30, 2023. See you there!