This blog post discusses when and when not to use the official TikTokAPI. Additionally, this blog post provides step-by-step instructions for a typical research scenario to inform aspiring researchers about using the API.
When and when not to use it
While being the official way of data access, the official TikTok API is by no means the only way for collecting TikTok data in an automatized fashion. Depending on the research endeavour, one of the other ways might be the way to go:
- 4Cat + Zeeschumier: Sensible if you want to collect limited data on one or more actors, hashtags, or keywords and/or are not confident in programming for the subsequent analysis.
- An in-official TiKTok API (pyktok or the Unofficial TikTok API in Python): Both are great projects that provide significantly more data points than the official API. However, this comes with costs: stability and dependency on developers reacting to changes on TikTok’s site.
But why should you use the official TikTok API if those two options are available?
- Reliability. In theory, the official API data access provides more stable access than other solutions.
- Legality. Depending on your country or home institution, official data access might be a problem for legal reasons. However, you are on the safer side with official data access. Please consult your institution regarding data access.
- User-level data. Other data collection methods are often superior in terms of data points on the video level (Ruz et al. 2023). However, the official TikTok API offers a set of user-level data (User info, liked videos, pinned videos, followers, following, reposted videos), which is not as conveniently available through other data collection methods.
One fundamental limitation still needs to be kept in mind. One can make only 1,000 daily requests, each containing 100 records (e.g., videos, comments) at most. This means that if one can exploit the complete 100 records per request (rarely possible), one can retrieve a maximum of 100,000 records per day.
To start with the official TikTok research API, visit Research API. To gain access, you need to create a developer account and submit an application form. When doing so, please record your access request under DSA40 Data Access Tracker to contribute to an effort to track the data access platforms provided under DSA40.
The official documentation on research API usage is not intuitive, especially for newcomers (Documentation). Using the API within the typical programming language Python/R might still pose a challenge, especially for researchers who are working with an API for the first time. The currently scarce availability of API guidance motivates this blog post to provide such guidance without a paywall.
The set-up – getting an access token¶
Before we can make calls to the API, we need to receive a bearer token. We must request this token via our developer account details before every session in which we want to scrape TikTok data. A token is valid for 2 hours.
# package import
import requests
import json
import time
import pandas as pd
from tqdm.auto import tqdm
### Set your client key and secret from your dev account
client_key = "<YOUR_KEY>"
client_secret = "YOUR_SECRET"
## api base_url
base_url = 'https://open.tiktokapis.com/v2/oauth/token/'
# set header and payload info for API request
headers = {'Content-Type': 'application/x-www-form-urlencoded',
'Cache-Control': 'no-cache'}
payload = {"client_key":client_key,
"client_secret":client_secret,
"grant_type":"client_credentials"}
# request bearer auth token
response = requests.post(base_url, headers=headers, data=payload)
# get data from response
response_as_json = response.json()
# save access token as a variable for further usage
access_token = response_as_json["access_token"]
The research scenario: data collection based on an account list¶
The research scenario for this blog article is reality in various past (Wedel et al. 2024) and current research projects (Project: AI-augmented social scientist): Based on a list of usernames (News outlets from Germany, Austria, and Switzerland), we want to collect the meta-data for all the videos those users posted in the year 2023.
Such data collection with the TikTok API poses two main challenges, which need to be addressed: First, how can one return more than 100 records per request? Second, how to cope with the maximum timeframe of 30 days per request.
For this example, we will only consider a subset of the original username list: The accounts tagesschau, srfnews, and zeitimbild.
usernames = ["tagesschau", "zeitimbild","20minuten"]
Now we will prepare our API call.
# Define API-Endpoint URL
# The Endpoint-URL also defines the variables we will collect about the respective videos
# We include all currently available variables
url = "https://open.tiktokapis.com/v2/research/video/query/?fields=id,video_description,create_time,region_code,share_count,view_count,like_count,comment_count,music_id,hashtag_names,username,effect_ids,playlist_id,voice_to_text"
# With the header, we verify ourselves with our access token and indicate the format of the data we want to scrape
headers = {
"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json"
}
# We also need to set additional query parameters:
# First, the start and end date on which the videos have been uploaded
# "YYYYMMDD" format as string, both are meant inclusive, no more than 30 days time frame possible
# We will look at a workaround to the 30-day rule in a minute.
start_date = "20240805"
end_date = "20240901"
# The default number of items returned by a request is 20.
# We set it to a maximum of 100 entries.
max_count = 100
Next, we define our actual API call as a function we can use later on.
NOTE: For the purpose of this blog post, I exclusively employ the EQ (equals) operation. This affords a data collection that is structured account by account. Therefore, I believe it allows a more intuitive explanation of the API. For a more efficient data collection the IN operation is suggested. You can find more information in the official documentation.
def scrape_videos_userbased(username, max_count, start_date, end_date, headers):
# We ask the API: Give us all videos uploaded between start_date and end_date whose uploader is in our username lists
query_params = {
"query": {
"and": [
{"operation": "EQ", "field_name": "username", "field_values": [username]}
]
},
"max_count": max_count,
"start_date": start_date,
"end_date": end_date
}
# Make the call and transform the response to a JSON file
response = requests.post(url, headers=headers, data=json.dumps(query_params))
r = response.json()
# return the response parsed as a JSON file
return r
# init variable to add collected data to
data = []
for username in usernames:
# We call our scraping function, providing all the previously defined variables
temp_data = scrape_videos_userbased(username, max_count, start_date, end_date, headers)
# We print for each request some info so we understand better what is going on
print(username)
# index where data collection stopped
print("cursor:",temp_data["data"]["cursor"])
# indicates if the requested sample has more entries than the cursor index
print("has_more:",temp_data["data"]["has_more"])
# error message/ ok message
print("error code and message:",temp_data["error"]["code"], temp_data["error"]["message"])
print("_"*10) # seperator
# append each request's response to our data list
data.append(pd.DataFrame(temp_data["data"]["videos"]))
tagesschau cursor: 47 has_more: False error code and message: ok __________ zeitimbild cursor: 29 has_more: False error code and message: ok __________ 20minuten cursor: 100 has_more: True error code and message: ok __________
We print the following variables:
- username: the username for which we queried.
- cursor: The index of the last entry of the response to our queries (number of the last item). 47 means that the last video we got back is the 47th video returned by our query.
- has_more: If this equals True, then this query could retrieve more items. This is the case for the 20minuten channel. Logically, the cursor is 100 for this account.
- error: The information on the error can be essential if we do not get back what we expect. In this case, everything is “okay.” If we have reached the daily quota limit, we would see a corresponding error behind the error variable.
Let’s look at the collected data as a data frame.
df = pd.concat(data, axis=0)
df.head(5)
view_count | comment_count | hashtag_names | id | music_id | playlist_id | share_count | username | create_time | like_count | region_code | video_description | voice_to_text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 266579 | 1396 | [nachrichten, tagesschau, landtagswahlen] | 7409729730703854880 | 7409729816341007137 | 7.402634e+18 | 170 | tagesschau | 1725212148 | 11691 | DE | Stellt der Wahlgewinner immer den Regierungsch… | NaN |
1 | 123173 | 445 | [who] | 7408956140744133920 | 7408956690560027425 | 7.170916e+18 | 472 | tagesschau | 1725032151 | 10436 | DE | Das war diese Woche wichtig. #tagesschau #nach… | Deutschland schiebt wieder verurteilte Straftä… |
2 | 276185 | 226 | [nachrichten, tagesschau, demure] | 7408941600929320224 | 7408942090584869664 | NaN | 457 | tagesschau | 1725028752 | 15571 | DE | Inwiefern kann man sich die Rechte an „demure“… | NaN |
3 | 85314 | 907 | [sachsen, thüringen, nachrichten, tagesschau, … | 7408847583587700001 | 7408847604987071264 | 7.402634e+18 | 44 | tagesschau | 1725006763 | 6024 | DE | Am 1. September sind Landtagswahlen in Thüring… | wenn man nicht wählen geht, gibt man seine Sti… |
4 | 64732 | 70 | [iss, nachrichten, starliner, tagesschau, welt… | 7408603709053177120 | 7408604174046972705 | NaN | 57 | tagesschau | 1724950082 | 7013 | DE | In den acht Monaten altern die beiden 0,007 Se… | stell dir vor, du fliegst mit einem Raumschiff… |
Two challenges: time frame and query limits¶
Based on this first data collection attempt, two issues prevent us from retrieving all videos from the three users throughout 2023.
First, the 20minuten channel has more than 100 videos for the respective timeframe, but we only received 100. To solve this, we need to implement so-called pagination – iterating through the “pages” returned by our query to collect the remaining videos.
Secondly, we might want to collect data beyond the 30-day timeframe. For that, we will use a custom date generator that allows us to loop through a year on a monthly basis.
Solution for the query limit: pagination¶
To “paginate” through a query beyond the first page (e.g., the first 100 entries), we use three variables that each query call returns. The cursor, has_more, and the session_id.
We wrap our request into a while loop, which checks whether has_more is still True after each query. If so, we will process the same query again, indicating that we want to return results only starting from the last entry returned by the previous query. That means if we find has_more to equal True, we repeat the query, but this time by providing the session_id and the cursor returned by the previous query. In such a way, we let the API know we want to skip the already collected videos and start directly with video number 101.
For this part, we will focus only on the username 20minuten since it is the only account needing pagination for the current time frame.
Again we prepare our API call first. Additionally we redefine our scraping function to handle the pagination part.
usernames = ["20minuten"]
url = "https://open.tiktokapis.com/v2/research/video/query/?fields=id,video_description,create_time,region_code,share_count,view_count,like_count,comment_count,music_id,hashtag_names,username,effect_ids,playlist_id,voice_to_text"
headers = {
"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json"
}
start_date = "20240805"
end_date = "20240901"
max_count = 100
# We add the search_id and cursor as input variables as well as to the query.
def scrape_videos_pagination(username, max_count, start_date, end_date, headers, search_id, cursor):
query_params = {
"query": {
"and": [
{"operation": "EQ", "field_name": "username", "field_values": [username]}
]
},
"fields": "id,video_description,create_time",
"max_count": max_count,
"start_date": start_date,
"end_date": end_date,
"cursor": cursor,
"search_id":search_id
}
# Make the call and transform the response to a JSON file
response = requests.post(url, headers=headers, data=json.dumps(query_params))
r = response.json()
# return the response parsed as a JSON file
return r
data = []
for username in usernames:
# initial cursor and has_more
search_id=""
cursor = 0
has_more = True
while has_more == True:
temp_data = scrape_videos_pagination(username, max_count, start_date, end_date, headers, search_id, cursor)
# Again, print out to understand better what is happening in each iteration
print(username)
print("error code and message:",temp_data["error"]["code"], temp_data["error"]["message"])
print("cursor:",temp_data["data"]["cursor"])
print("has_more:",temp_data["data"]["has_more"])
print("_"*10)
# set cursor, has_more, and search_id based on previous request
cursor = temp_data["data"]["cursor"]
has_more = temp_data["data"]["has_more"]
search_id = temp_data["data"]["search_id"]
# Collect retrieved data for each iteration directly as a data frame
data.append(pd.DataFrame(temp_data["data"]["videos"]))
# Critical: The API throws errors if one waits no 10 seconds between repeated requests with a session id.
# However, they do not specify that this is the frequency of requests issue!
time.sleep(10)
20minuten error code and message: ok cursor: 100 has_more: True __________ 20minuten error code and message: ok cursor: 154 has_more: False __________
df_20_minuten = pd.concat(data, axis=0).reset_index(drop=True)
print(len(df_20_minuten))
152
Our loop ran twice for 20minuten before stopping. That way, we could collect all 152 videos published by 20minuten in the given timeframe.
Solution for the time frame issue: timeframe¶
Now, on to the second issue: the limited timeframe. To save on quota credits, we will limit our username list to one user for this one. Let’s collect all the videos from that user for January, February, and March 2024.
First, we need to import a little more, and then we construct a function that returns a list of date pairs less than 30 days apart.
from datetime import timedelta
from datetime import datetime
def generate_date_range(start_date, end_date):
'''
function to generate a list of date sections to facilitate the scraping of
longer time periods beyon the nativ 30 days limit of the API.
input: desired start and end date in "YYYY-MM-DD" format.
both are meant inclusive
'''
start = datetime.strptime(start_date, "%Y%m%d")
end = datetime.strptime(end_date, "%Y%m%d")
delta = end - start
current_date = start
dates = []
temp_end_date = current_date
while current_date <= end:
delta_end = end-current_date
if delta_end.days <= 30:
# if the distance to actual end is less than 30 days break while loop
dates.append((current_date.strftime("%Y-%m-%d").replace("-",""), end.strftime("%Y-%m-%d").replace("-","")))
break
else:
# Move to the next date, 30 days later
temp_end_date += timedelta(days=29)
dates.append((current_date.strftime("%Y-%m-%d").replace("-",""),
temp_end_date.strftime("%Y-%m-%d").replace("-","")))
current_date = temp_end_date + timedelta(days=1)
return dates
Next, we generate our date range.
start_date = "20240101"
end_date = "20240331"
date_sections = generate_date_range(start_date, end_date)
Before now looping through the timeframes we again set our static variables.
url = "https://open.tiktokapis.com/v2/research/video/query/?fields=id,video_description,create_time,region_code,share_count,view_count,like_count,comment_count,music_id,hashtag_names,username,effect_ids,playlist_id,voice_to_text"
headers = {
"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json"
}
max_count = 100
usernames = ["20minuten"]
data = []
for date_pair in tqdm(date_sections):
### Stable query parameters
start_date = date_pair[0]
end_date = date_pair[1]
print("Current date range:",date_pair)
print("-"*20)
cursor = 0
has_more = True
search_id = ""
while has_more == True:
temp_data = scrape_videos_pagination(username, max_count, start_date, end_date, headers, search_id, cursor)
# Again, print out to understand better what is happening in each iteration
print(username)
print("error code and message:",temp_data["error"]["code"], temp_data["error"]["message"])
print("cursor:",temp_data["data"]["cursor"])
print("has_more:",temp_data["data"]["has_more"])
# set cursor, has_more, and search_id based on previous request
cursor = temp_data["data"]["cursor"]
has_more = temp_data["data"]["has_more"]
search_id = temp_data["data"]["search_id"]
# Collect retrieved data for each iteration directly as a data frame
data.append(pd.DataFrame(temp_data["data"]["videos"]))
# Critical: The API throws errors if one waits no 10 seconds between repeated requests with a session id.
# However, they do not specify that this is the frequency of requests issue!
print("Wait ...")
time.sleep(10)
print("_"*10)
0%| | 0/4 [00:00<?, ?it/s]
Current date range: ('20240101', '20240130') -------------------- 20minuten error code and message: ok cursor: 100 has_more: True __________ Wait ... 20minuten error code and message: ok cursor: 141 has_more: False __________ Wait ... Current date range: ('20240131', '20240228') -------------------- 20minuten error code and message: ok cursor: 100 has_more: True __________ Wait ... 20minuten error code and message: ok cursor: 103 has_more: False __________ Wait ... Current date range: ('20240229', '20240328') -------------------- 20minuten error code and message: ok cursor: 100 has_more: True __________ Wait ... 20minuten error code and message: ok cursor: 143 has_more: False __________ Wait ... Current date range: ('20240329', '20240331') -------------------- 20minuten error code and message: ok cursor: 16 has_more: False __________ Wait ...
As the print out and the following data frame length shows, we could collect all 397 videos that user has published across the respective three months.
df_20_minuten_3_month = pd.concat(data, axis=0).reset_index(drop=True)
print(len(df_20_minuten_3_month))
397
Final Thoughts¶
The TikTok API is a straightforward, reliable, and convenient data source for researchers. Nevertheless, researchers depend on Bytedance to provide the promised data. Past audits have shown significant issues with the offial API (Pearson et al. 2024). Therefore, data collected via the TikTok API should be ideally cross-checked and verified manually or by other means of data collection.
I hope that the above helps interested researchers acclimate to the TikTok API environment, providing solutions for two crucial challenges that researchers face when working with the API: limits on the time frame and query limits.
Thank you for following this post, and happy scraping!