We are eager to announce our upcoming workshop, “A Practical Introduction to Text Analysis“, on Thursday, November 30, at the Weizenbaum Institute. Led by visiting fellow Dr. Douglas Parry(Stellenbosch University, South Africa), this workshop offers a comprehensive introduction to text analysis using the R programming language. Topics covered include text pre-processing (formats, tokenization, stemming, stop words, regex), dictionary analysis (lexicons, tf-idf, sentiment), topic modeling (LDA, CTM, STM), and data visualization. By the end of the workshop, participants will be equipped to tackle real-world text-mining tasks and have a solid foundation to move on to more advanced analysis techniques. While a basic understanding of R programming is anticipated, prior experience in text analysis is not necessary.
For more details about the workshop, visit our program page. We look forward to your participation!
On June 15, 2023, the Methods Lab organized the workshop “Introduction to Topic Modeling” in collaboration with the WI research group “Platform Algorithms and Digital Propaganda”. The workshop aimed to provide participants with a comprehensive understanding of topic modeling, a machine-learning technique used to determine clusters of similar words (i.e., topics) within bodies of text. The event took place at the Weizenbaum Institute in a hybrid format, bringing together researchers from various institutions.
The workshop was conducted by Daniel Matter (TU Munich) who guided the participants through basic concepts and applications of this method. Through theory, demonstrations, and practical examples, participants gained insight into commonly used algorithms such as Latent Dirichlet Allocation (LDA) and BERT-based topic models. The workshop enabled participants to assess the advantages and drawbacks of each approach, equipping them with a foundation in topic modeling while, at the same time, providing plenty of new insights to those with prior expertise.
During the workshop, Daniel explained the distinction between LDA and BERTopic, two popular topic modeling strategies. LDA, or Latent Dirichlet Allocation, a commonly used method for topic modeling, operates as a generative model and treats each document as a mixture of topics. LDA aims to determine the topic and word distributions that maximize the probability of generating the documents in the corpus. With LDA, as opposed to BERTopic, the number of topics must be known beforehand.
BERTopic, on the other hand, belongs to the category of Embeddings-Based Topic Models (EBTM), which take a different approach. Unlike LDA, which treats words as distinct features, BERTopic incorporates semantic relationships between words. BERTopic follows a bottom-up approach, embedding documents in a semantic space and extracting topics from this transformed representation. Unlike LDA, which can be applied to short and long text corpora, BERTopic generally works better on shorter text, such as social media posts or news headlines.
When deciding between BERTopic and LDA, it is essential to consider the specific requirements of the text analysis. BERTopic’s strength lies in its flexibility and ability to handle short texts effectively, while LDA is preferred when strong interpretability is needed.
With this workshop, we at the Methods Lab hope to have provided our attendees with a solid understanding of topic modeling as a method. By exploring the concepts, applications, and advantages of each approach, these tools can be used to unlock hidden semantic structures within textual data, enabling researchers to employ them in various domains and facilitating tasks such as document clustering, information retrieval, and recommender systems.
A big thank you to Daniel for inducting us into the world of topic modeling and to all of those who participated!
Our next workshop, “Whose data is it anyway? Ethical, practical, and methodological challenges of data donation in messenger groups research”, will take place on August 30, 2023. See you there!
Manage Cookie Consent
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.