Exploring Internet Censorship Data with Topic Modeling

Internet censorship research at scale can result in massive collections of URLs. The themes in these sets of data, small and large, can be instrumental in understanding the motivations of groups enacting censorship.

Although a URL offers minimal context by itself, web page content can be used to derive meaning from a link. With insight into what information is being restricted, organizations focused on improving the accessibility of knowledge can more effectively direct their efforts towards specific topics.

Over a decade of data from GreatFire's Internet censorship measurement platform Blocky was collected and mapped. Blocky allows users to test a URL for censorship in China and see past results for the same link. Since 2013, over 15 million tests have been submitted— about 1/3 of which were found to have been subject to censorship.

In this case, topic modeling was applied to reveal the content categories most associated with public interest in censorship as they appeared over time. By illustrating this methodology, the aim is to enable censorship researchers to gain more granular insight into their data.

Explore all 132 months of data interactively on the map!
You can swap between interactive/static views of all submissions and just those which were censored.

Historical HTML was found for 30% of 12.1 million URL-timestamp pairs (mean time diff. between each snapshot and test submission date was 30 days). Web content was collected from CommonCrawl, embeddings were generated with jina-embeddings-v3, and topic modeling was performed with BERTopic.

[]

Interactive
Censored Only

Across all 11 years, the most common topics are pornography, shopping, and politics/democracy. When limiting to just the blocked subset and re-clustering, topics relating to chinese news, vpn services, and historical events (like the Tiananmen Square protests) become more prevalent.

Smaller topics neatly show the value of extracting labels from clusters rather than using generalized classification. For example, a cluster in the censored subset is specific to "falun gong", a religious movement which has long faced persecution in China.

To create an overarching model, the 132 monthly models were merged together and similar topics were combined. For large topics this sometimes resulted in the meaning of a topic becoming more generalized (or 'drifting') over time— an issue which must be worked on to avoid in the future. This project is being continuously iterated on. Evaluation methods, improved clustering parameters, outlier reduction, using LLMs to generate better labels, and multimodality are all currently being explored.

Created by Peter Whiting during an internship at the University of British Columbia, supervised by Dr. Nguyen Phong Hoang

Connecting Links: Exploring Internet Censorship Data with Topic Modeling