Workshop: A Practical Introduction to Text Analysis

We are eager to announce our upcoming workshop, “A Practical Introduction to Text Analysis“, on Thursday, November 30, at the Weizenbaum Institute. Led by visiting fellow Dr. Douglas Parry (Stellenbosch University, South Africa), this workshop offers a comprehensive introduction to text analysis using the R programming language. Topics covered include text pre-processing (formats, tokenization, stemming, stop words, regex), dictionary analysis (lexicons, tf-idf, sentiment), topic modeling (LDA, CTM, STM), and data visualization. By the end of the workshop, participants will be equipped to tackle real-world text-mining tasks and have a solid foundation to move on to more advanced analysis techniques. While a basic understanding of R programming is anticipated, prior experience in text analysis is not necessary.

For more details about the workshop, visit our program page. We look forward to your participation!

Workshop Recap: Introduction to Topic Modeling (June 15, 2023)

On June 15, 2023, the Methods Lab organized the workshop “Introduction to Topic Modeling” in collaboration with the WI research group “Platform Algorithms and Digital Propaganda”. The workshop aimed to provide participants with a comprehensive understanding of topic modeling, a machine-learning technique used to determine clusters of similar words (i.e., topics) within bodies of text. The event took place at the Weizenbaum Institute in a hybrid format, bringing together researchers from various institutions.

The workshop was conducted by Daniel Matter (TU Munich) who guided the participants through basic concepts and applications of this method. Through theory, demonstrations, and practical examples, participants gained insight into commonly used algorithms such as Latent Dirichlet Allocation (LDA) and BERT-based topic models. The workshop enabled participants to assess the advantages and drawbacks of each approach, equipping them with a foundation in topic modeling while, at the same time, providing plenty of new insights to those with prior expertise.

Daniel Matter talks about the most important aspects of topic modeling

During the workshop, Daniel explained the distinction between LDA and BERTopic, two popular topic modeling strategies. LDA, or Latent Dirichlet Allocation, a commonly used method for topic modeling, operates as a generative model and treats each document as a mixture of topics. LDA aims to determine the topic and word distributions that maximize the probability of generating the documents in the corpus. With LDA, as opposed to BERTopic, the number of topics must be known beforehand.

BERTopic, on the other hand, belongs to the category of Embeddings-Based Topic Models (EBTM), which take a different approach. Unlike LDA, which treats words as distinct features, BERTopic incorporates semantic relationships between words. BERTopic follows a bottom-up approach, embedding documents in a semantic space and extracting topics from this transformed representation. Unlike LDA, which can be applied to short and long text corpora, BERTopic generally works better on shorter text, such as social media posts or news headlines.

The workshop took place in the Flex Room at the Weizenbaum Institute.

When deciding between BERTopic and LDA, it is essential to consider the specific requirements of the text analysis. BERTopic’s strength lies in its flexibility and ability to handle short texts effectively, while LDA is preferred when strong interpretability is needed.

Participants were able to engage with the method on their own devices during the presentation.

With this workshop, we at the Methods Lab hope to have provided our attendees with a solid understanding of topic modeling as a method. By exploring the concepts, applications, and advantages of each approach, these tools can be used to unlock hidden semantic structures within textual data, enabling researchers to employ them in various domains and facilitating tasks such as document clustering, information retrieval, and recommender systems.

A big thank you to Daniel for inducting us into the world of topic modeling and to all of those who participated!

Our next workshop, “Whose data is it anyway? Ethical, practical, and methodological challenges of data donation in messenger groups research”, will take place on August 30, 2023. See you there!