Blog

Workshop Recap: Web Scraping and API-based Data Collection

On March 2nd, the Methods Lab hosted its first-ever workshop, Web Scraping and API-based Data Collection. The workshop explored various techniques for accessing and gathering data from platforms using APIs and web scraping. Speakers included Florian Primig (FU Berlin), Steffen Lepa (TU Berlin), Felix Gaisbauer (WI), and Leon Wendel (WI). The workshop received an overwhelmingly positive response, with many people attending both in person and remotely. It generated plenty of discussions and concluded with a Q&A session.

Lion Wedel gives an introduction to Web-Scraping (photo: Roland Toth).

Thanks to all our presenters and participants in helping us create such a successful first event. We look forward to organizing more workshops in the future on emerging methodologies in the realm of digital research!

ECPR Winter School: Machine Learning with Big Data for Social Scientists

From February 6–10, Methods Lab member Roland Toth attended the online course Machine Learning with Big Data for Social Scientists at ECPR Winter School.

The goal was to gain a deeper insight into certain machine learning methods and to be able to apply them to social science questions in particular. It was also about efficiency in handling large data sets so that they can still be processed with high performance.

Numerous materials were made available for the workshop in advance. There were videos for each session in which presentation slides on the respective topics of the session were presented in the style of a lecture. These were accompanied by appropriate literature and studies. On each of the workshop days, there were two-hour live sessions in which the content of the videos was repeated and the application of the principles was practiced live.

The first step was to set up RStudio Server on the Amazon Web Services (AWS) cloud service. This offloads the entire RStudio environment from one’s own machine, allowing handling data and calculations without burdening local resources.

Furthermore, work with the package collection tidyverse was deepened. Here, among other things, it turned out that the function vrooom from the package of the same name provides faster import of larger data sets than similar functions. In addition, it was discussed how to access external data sets directly from RStudio via SQL syntax, so that it is not necessary to import the full data sets at all.

For illustrative purposes, data sets on COVID vaccination status and election outcomes in the United States were used during the workshop. Respectively, the observations were clustered at different levels (state, county, …), which rendered the merging of the data sets difficult. Besides typical functions of data wrangling (filtering, grouping, aggregating, mapping, merging), some special machine learning methods were discussed. Here, the logic of the procedure was first demonstrated using simple linear regression models: A model is trained with a (smaller) training data set and then applied to a (larger) test data set. The model is supposed to accurately predict the outcome, but not as accurately as to overfit to the training data and perform badly on the test data – in the end, it was a question of a balance between variance and bias. During the workshop, this principle was also applied to LASSO and Ridge regression, logistic regression, and classification methods such as Support Vector Machines, Decision Trees, and Random Forests.

All in all, it was a good introduction to working with machine learning methods. However, there was limited focus on the decision criteria for choosing certain methods over others, and a strong focus on the technical implementation of the methods in R. Nevertheless, the workshop was able to clarify some open questions and provide some new techniques that will help when working with larger datasets and in data analysis.

Workshop: Web Scraping and API-based Data Collection (March 2, 2023)

We hereby present the first workshop at the Institute to emerge from the methodological needs that were indicated in our institute-wide survey in December. It is titled Web Scraping and API-based Data Collection and takes place on March 2.

After an introduction to the topic by the Methods Lab team, Florian Primig (FU), Steffen Lepa (TU), Felix Gaisbauer (WI), and Lion Wedel (WI) will each present various use cases of these two data collection methods. You can find more information about the workshop on its program page.

Research Methods at the Weizenbaum Institute: Survey Results

In December 2022, the Methods Lab conducted an internal survey to map out the methodological experiences and needs at the Weizenbaum Institute. Thanks to everybody who participated! We have identified specific demands and requests at the institute. Even though there already is extensive expertise for a large variety of methods and tools, many Weizenbaum scholars also expressed a wish for additional support and knowledge-building in, for instance, the following areas:

  • Data collection: Automated observation (e.g., logging, tracking), Automated content analysis, Web Scraping, API-based data collection, and Eye-Tracking
  • Data Analysis: Network Analysis, Deep/Transfer Learning, Natural Language Processing, and Classification Methods
  • Software/Tools: R, Python, and Network analysis software

With these results as our polaris, we in the Methods Lab have embarked on the expedition of developing a future methods training and consulting program suited to your needs, which we will announce shortly. In the meantime, the results of the survey hopefully serve as a launch pad for networking amongst the scholars at the Weizenbaum Institute.

Software Review: BRAT Rapid Annotation Tool

Our Methods Lab group lead and WI research associate, Christian Strippel, has written a software review of the BRAT rapid annotation tool, co-authored by Laura Laugwitz, Sünje Paasch-Colberg, Katharina Esau, and Annett Heft. The review is published in issue 4/2022 of Medien & Kommunikationswissenschaft. Read the article here.

In the context of interdisciplinary collaboration, especially with colleagues from computer science, communication and media research has for some time been confronted with a wide range of research software with which it has had little prior experience. In addition to programming lan­guages such as Python or R, these include specific tools for text analysis that represent an alterna­tive to previous variants of computer-assisted content analysis. With the brat rapid annotation tool (BRAT) we present such an alternative in this paper and review it against the background of our experience in using it. BRAT is a web-based open-source text annotation tool that was developed by an international team of computer scientists about ten years ago. The article introduces the tool and its most important features, presents examples for its use in qualitative and quantitative content analyses on the basis of three case studies, and finally evaluates it with regard to potentials and difficulties for the field.

A few notes on the Methods Lab

Dear all, 

Welcome to the digital baptism of the Methods Lab blog. This blog will keep you informed about our work, future workshops, events, and other resources and materials that may be useful to you in your upcoming research.

As a unit, we are committed to three principal tasks: training, consulting, and research. We aim to assist you with all your methodological questions, issues, and needs, no matter how large or small, and to coordinate expertise at the institute. Think of us as a hub, a metaphorical Rome, if you will, where all your methods-related queries, and (non-)knowledge have a space to converge. If you have any thoughts, suggestions, or concerns, don’t hesitate to contact us – we will always lend you an ear. 

At the start of December, we asked you to participate in a survey in order to give us an overview of your expertise and needs regarding data collection, analysis, and software. With the help of the results, we have created a preliminary training program tailored to your wants and needs. To everyone who participated: thank you!

On that note, we are delighted to announce that our first official workshop will take place at the beginning of March. Besides that, we have two more workshops planned for spring.

So stay tuned for further announcements about many exciting things to come! We look forward to beginning this new chapter with you.