Artificial Intelligence in Investigative Journalism * Anna Bruno

Investigative journalism has developed an inseparable bond with data journalism and open source intelligence (OSINT), offering fertile ground to experiment with the potential of artificial intelligence (AI). Although the explosion of generative AI has opened new horizons for innovation, the use of AI in investigative journalism has so far mainly focused on other fields.

Summary

Although there is no universally accepted definition of “artificial intelligence” (Wang, 2019; Russell, Norvig & Chang, 2022), in the journalistic context the term has been used to refer to a wide range of technologies, from document classification solutions to video or image generation. However, AI includes numerous branches, each with specific applications and challenges.

Generative artificial intelligence

Tools like ChatGPT e and Google’s Gemini are based on a form of AI known as “large language models” (LLM). These are part of the broader field of generative AI, which also includes image generation tools like DALL-E and Midjourney, video generation tools like OpenAI’s Sora and audio tools like Meta’s AudioCraft.

These models are trained on vast datasets of images, videos, or audio to build multimedia content by “predicting” each word, pixel, or sound during the writing, drawing, or composing process. This predictive ability gives the impression of intelligence, but does not necessarily guarantee the factual accuracy of the results. Factual inaccuracies are a recurring problem, to the point that a specific term has been coined to describe them: “hallucinations”.

Deep learning

Generative AI is itself a branch of deep learning, which in turn is part of the broader field of machine learning. Machine learning involves training an algorithm to predict, classify, or group inputs into associated clusters. As early as 2018, this form of AI was used by three-quarters of “digital leaders” in a survey for purposes ranging from content recommendation to fact-checking. In investigative journalism specifically, by 2024 two of the fifteen winners of the Pulitzer Prize were using machine learning.

The two most common ways to train an algorithm are known as supervised and unsupervised learning. Unsupervised learning allows the algorithm to group data according to any pattern it identifies, requiring very little information about the data itself. This makes it a powerful tool for identifying groups of related documents or words. Supervised learning, on the other hand, requires training data labeled in some way. This makes it more useful for classifying new data (based on the labeling of training data) or making predictions (based on relationships identified between differently labeled data).

The use of AI in investigative journalism

The broad applications of AI have been leveraged in all phases of journalism, from idea generation and alerts to research, production, publication, distribution, feedback, and archiving (Hanson, Roca-Sales, Keegan & King 2017; Gibbs 2024). In investigative journalism specifically, AI is applied in 9 different scenarios.

1. Identifying a problem or its scope

When the SRF data journalism team decided to investigate the use of fake followers by influencers on social media, they turned to machine learning. The team created a dataset of Instagram accounts classified as fake or real, using it to train an algorithm that could be applied to millions of followers of Swiss influencers. The results highlighted the widespread practice of purchasing fake followers, establishing the scope of the problem for the first time.

2. Reducing the complexity of an investigation

Machine learning can be described as a filtering tool. When the Atlanta Journal-Constitution wanted to find out how many doctors were allowed to continue practicing after being found guilty of sexual misconduct, it used machine learning to reduce a set of 100,000 documents to about 6,000 potentially relevant ones, which could then be checked manually.

A similar process was followed by the International Consortium of Investigative Journalists in their investigation into the harms caused by medical devices, where millions of records were filtered by an algorithm trained to identify reports in which the description of an adverse event indicated the death of a patient, but the death had been misclassified.

3. Modeling and predictions

The ability of machine learning to predict future events can be harnessed to identify potential problems. In the award-winning series “Waves of Abandonment” by Grist and Texas Observer, journalists used a model to predict which oil wells might be abandoned in the coming years, allowing them to write an article about the potential costs to taxpayers.

Modeling was also used by Eye on Ohio to understand which factors made some homes more likely to be foreclosed than others, and by ProPublica to identify the causes of Ebola outbreaks, highlighting Nigeria as particularly at risk.

4. Algorithmic accountability: uncovering a system

Another application of AI is “algorithmic accountability,” that is, using algorithms to shed light on other algorithms and keep them in check. ProPublica has been a pioneer in this field, using machine learning since 2012 to decode political microtargeting. The “Machine Bias” series investigated biases in software used to determine criminal sentencing, discrimination in Facebook ad tools, and auto insurance premium calculators.

Elsewhere, the collaborative project Lighthouse Reports has used similar techniques to investigate algorithmic profiling used by Dutch local governments, which the United Nations compared to “the digital equivalent of anti-fraud inspectors knocking on every door in a certain area and checking every person’s documents in an attempt to identify cases of fraud, while no such checks are carried out on those living in wealthier areas”.

5. Natural language processing and text mining

A branch of AI that often employs machine learning is natural language processing (NLP), a technology that enables computers to understand human language. Techniques such as sentiment analysis, topic modeling, and named entity extraction have been used in various investigative stories to spot suspicious patterns, identify disinformation campaigns, and more quickly navigate mentions of specific entities within large amounts of documents.

6. Large-scale data extraction, matching, and cleaning

Another appeal of AI for investigative journalism is its ability to extract data from documents, overcoming the challenge of PDF-format publications. Tools like Google’s Cloud Document AI and Deepform make it possible to automate the extraction of structured data, though not without some accuracy challenges.

Data cleaning to match information from different sources is another area of AI application. DataMade’s Entity-Focused Data System project, for instance, uses natural language processing techniques to connect people and organizations, even when they are named slightly differently in distinct datasets.

7. Satellite journalism and object detection

A specific use of machine learning has developed in the field of satellite journalism, with stories about illegal mining operations, human rights abuses, and war crimes benefiting from object detection, a technique that trains algorithms to identify objects within images. The New York Times used this approach to find evidence of Israel’s use of 2,000-pound bombs in southern Gaza.

8. Sensors and acoustic machine learning

Although less explored compared to text and images, audio provides opportunities for AI applications in investigative journalism. The non-profit Rainforest Connection pioneered the use of machine learning with acoustic monitoring in remote areas to detect illegal logging. The same technique has been used to gauge various climate change impacts and even to track the risk of infectious disease.

9. New forms of storytelling and engagement

In addition to handling vast amounts of information, AI has made it possible to explore new ways of telling and distributing the news. The Brazilian watchdog project Operation Serenata de Amor, for example, uses machine learning to monitor politicians’ spending and an automated Twitter account to engage the public and prompt responses from the politicians themselves.

Translation and summarization provided by natural language processing can also make stories accessible to new audiences. A website and browser extension called Polisis offers readable summaries of privacy policies for various services, while Natural Language Generation tools allow for personalizing or “versioning” stories for different audiences or inputs.

These are just some of the many applications of AI in investigative journalism. In the following sections, we will explore the various branches of this revolutionary technology in more detail.

Machine learning: the backbone of AI in journalism

machine learning is a fundamental component of artificial intelligence as applied to investigative journalism. This branch of AI involves training algorithms to predict, classify, or cluster inputs into associated groups, paving the way for a wide range of journalistic applications.

There are two main approaches to training an algorithm for machine learning: supervised and unsupervised learning. Unsupervised learning allows the algorithm to autonomously detect patterns and groupings in data without requiring preliminary information. This makes it a powerful tool for identifying groups of related documents or words.

Supervised learning, on the other hand, requires the training data to be labeled in some way. This makes it more useful for classifying new data (based on the labeling of the training data) or making predictions (based on the relationships identified between differently labeled data).

Unsupervised learning to discover patterns

The unsupervised approach has been used in various investigations to identify suspicious patterns in data. Jeff Kao, for example, used topic modeling, an NLP technique based on unsupervised learning, to identify suspicious patterns in millions of comments to a public consultation on net neutrality, providing evidence of an automated disinformation campaign.

Supervised learning to classify and predict

On the other hand, supervised learning has been fundamental for projects requiring data classification or prediction. When SRF investigated fake followers of influencers, it created a dataset of Instagram accounts labeled as “fake” or “real” to train an algorithm capable of classifying millions of other followers.

Similarly, journalists from Grist and Texas Observer used supervised learning to train a model able to predict which oil wells might be abandoned in the coming years, based on historical data such as depth, location, and oil prices.

Challenges and limitations

Despite its potential, machine learning still presents challenges and limitations. Algorithms are not infallible and can produce false positives or negatives, requiring a certain degree of manual verification. Moreover, as Andy Dudfield from a British fact-checking organization points out, “algorithms don’t know what facts are. It’s a very thin world of contexts and caveats.”

Another challenge is the accuracy of commercial AI tools, which often require significant configuration work to adapt to specific use cases or heterogeneous types of documents.

Despite these obstacles, machine learning remains a powerful tool for investigative journalism, enabling the tackling of challenges that would have been impractical with traditional methods.

Natural language processing: exploring text as a data source

A branch of AI closely related to machine learning is natural language processing (NLP), a technology that enables computers to understand and process human language. In investigative journalism, NLP offers a wide range of techniques for extracting information and insights from large amounts of text.

1. Sentiment analysis and topic modeling

Techniques such as sentiment analysis and topic modeling have been used to detect patterns and anomalies in textual data. An investigation by the Washington Post used sentiment analysis to compare language removed from drafts of an international development agency’s audits before publication, identifying over 400 negative references that were eliminated.

Topic modeling, on the other hand, leverages unsupervised learning to group a set of textual data into “clusters” based on shared language. This technique was used by the Associated Press to identify incidents in schools involving police officers and firearms belonging to educators in 140,000 incident reports.

2. Named entity extraction

One of the most widespread applications of NLP in investigations is named entity extraction, which generates lists of people, places, organizations, and key concepts found in documents. This feature allows journalists to more quickly navigate mentions of specific entities, saving enormous amounts of time.

3. Translation, summarization, and accessibility

In addition to facilitating text analysis, NLP can make stories accessible to new audiences through automated translation and summarization. Tools like Polisis offer readable summaries of privacy policies, while Natural Language Generation (NLG) allows articles to be personalized or “versioned” for different audiences or inputs.

Although powerful, these technologies still present challenges related to accuracy and to the ability to capture nuances of context and meaning. However, even without reaching human-level performance, they can offer significant savings in time and resources.

Extraction, matching, and cleaning of data at industrial scale

One of the main attractions of AI for investigative journalism is its ability to extract, match, and clean large quantities of data efficiently. This functionality is crucial for unlocking valuable information hidden in hard-to-access formats like PDFs.

1. Extraction of structured data from PDFs

Tools like Google’s Cloud Document AI and Deepform allow structured data to be automatically extracted from sets of PDFs, overcoming one of the most common obstacles in investigative reporting. However, as Jonathan Stray of the Deepform project points out, the accuracy of these tools is not yet perfect, requiring a certain degree of manual verification.

Another challenge is the proprietary and specialized nature of many data extraction tools, which often require significant configuration work to adapt to heterogeneous types of documents.

2. Data matching and cleaning

In addition to extraction, AI offers powerful tools for matching and cleaning datasets from different sources. An investigation into property tax evasion used Locality-Sensitive Hashing (LSH), an unsupervised learning technique that groups similar records into “buckets,” to match data on properties and utilities. This method is not infallible and can produce false positives and negatives, requiring careful configuration of the algorithm.

Data cleaning to facilitate matching is also at the heart of the Entity-Focused Data System project by DataMade, developed in collaboration with the Atlanta Journal-Constitution. This tool helps journalists connect people and organizations even when they are named slightly differently in different datasets, using natural language processing techniques such as “probabilistic parsing.”

While accuracy remains a challenge, these tools for data extraction, matching, and cleaning represent a significant step forward for investigative journalism, enabling valuable information to be unlocked from large amounts of raw data.

Satellite journalism and object detection

A particularly promising application of machine learning in investigative journalism is object detection, a technique that trains algorithms to identify objects within images. This technology has found important application in the emerging field of satellite journalism.

1. Investigations into human rights violations and war crimes

Stories about illegal mining operations, human rights violations, and war crimes have benefited from object detection to identify visual evidence otherwise difficult to detect. In 2023, the New York Times visual investigations team used this technique to look for evidence of Israel’s use of 2,000-pound bombs in southern Gaza, training an algorithm to identify craters created by these weapons.

After removing false positives, journalists confirmed that “hundreds of those bombs were dropped, particularly in areas that had been designated as safe for civilians… it is likely that more bombs than those captured by our reporting were used.”

2. Environmental monitoring and safety

Object detection is not limited to investigations involving armed conflicts. The nonprofit organization Rainforest Connection has been a pioneer in using acoustic sensors and machine learning in remote areas to detect illegal logging, but also to monitor various impacts of climate change and even to track the risk of infectious diseases.

These applications demonstrate the potential of AI to unlock new forms of investigative journalism based on multimedia data, bringing to light stories that would otherwise have remained hidden.

New forms of storytelling and audience engagement

In addition to its applications in data collection and analysis, AI is paving the way for new forms of journalistic storytelling and audience engagement. Tools like Natural Language Generation (NLG) allow articles to be personalized or “versioned” based on different audiences or inputs, while natural language processing facilitates automatic translation and summarization, making stories accessible to new audiences.

1. Audience engagement through social media

An innovative example of these possibilities is the Brazilian project Operation Serenata de Amor, which uses machine learning to monitor politicians’ expenses and an automated Twitter account to engage the public and elicit responses from the politicians themselves. As stated by one of the journalists involved, “we are living in an era where parliamentarians are arguing with robots on Twitter. We have made democracy more accessible“.

2. Tools for accessibility and understanding

Projects like Polisis, a website and browser extension that offer readable summaries of the privacy policies of various services, show how AI can be used to inform and empower the public on complex issues. While the most advanced NLG tools could one day enable the automatic generation of entire stories based on data or multimedia inputs, for now these tools are mainly used to customize and adapt content for different audiences, paving the way for new forms of tailored journalism. Although these applications raise legitimate ethical and accuracy concerns, they also represent an opportunity for journalism to evolve and remain relevant in an era of information overload and fragmented attention.

Algorithmic accountability: unmasking AI systems

One of the most important applications of AI in investigative journalism is “algorithmic accountability,” that is, using algorithms to shed light on algorithms themselves and keep them in check. With the increased use of AI systems by governments, companies, and other organizations, this form of watchdog journalism has become crucial to protect citizens’ rights and promote transparency.

1. Pioneers of algorithmic accountability

ProPublica was one of the first organizations to explore this field, using machine learning since 2012 to decode political microtargeting and message customization based on recipients’ demographic characteristics. Their award-winning series “Machine Bias” subsequently investigated biases in software used to determine criminal sentencing, discrimination in Facebook’s advertising tools, and in auto insurance premium calculators.

2. Investigating algorithmic profiling

Elsewhere, the collaborative project Lighthouse Reports has used similar techniques to investigate algorithmic profiling by local Dutch governments, which the United Nations has compared to “the digital equivalent of fraud inspectors going door to door in a certain area and checking every person’s documents in an attempt to identify fraud cases, while no such checks are carried out on people living in wealthier areas”.

These investigations have highlighted how algorithms can encode and perpetuate biases and discrimination, underlining the importance of greater transparency and accountability in the use of these systems.

3. Approaches to algorithmic accountability

As highlighted by a report from the German radio station Bayerischer Rundfunk, there are several approaches to investigating AI, including the use of freedom of information laws, analysis of automated system outputs, data analysis, and the use of interviews and documents. Regardless of the method used, algorithmic accountability represents a crucial form of watchdog journalism in the age of AI, helping to protect citizens’ rights and promote greater transparency in the use of these powerful systems.

Challenges and ethical considerations in the use of AI

Despite its many possibilities, the use of AI in investigative journalism also raises important challenges and ethical considerations that need to be addressed.

1. Accuracy and transparency

One of the main concerns relates to the accuracy of AI tools, which often produce errors or unwanted “hallucinations.” This issue is particularly relevant for generative AI, which can produce potentially misleading or even false multimedia content. To address this challenge, it is essential that journalists are transparent about their use of AI and the limitations of the tools employed, carefully verifying results and clearly communicating to the public the process followed.

2. Bias and discrimination

Another concern is the risk that AI algorithms encode and perpetuate biases and discrimination present in training data or in the choices of their designers. This issue has been highlighted by several investigations into algorithmic accountability, underlining the importance of greater transparency and accountability in the use of these systems.

3. Ethical and professional considerations

Finally, the use of AI in journalism raises important ethical and professional issues, such as respect for privacy, protection of sources, and the impact on human relationships and trust in journalism. For example, the use of automated social accounts to engage the public, as in the case of Operation Serenata de Amor, could raise concerns about the transparency and authenticity of interaction. Similarly, automatic generation of personalized stories could undermine public trust in journalism if not properly managed.

To address these challenges, it is essential that journalists adhere to solid ethical and professional principles, promoting transparency, respect for privacy, and the protection of sources. At the same time, an open and inclusive debate is necessary on the ethical implications of using AI in journalism, involving all interested parties.

Conclusions: a future of opportunities and challenges

Artificial intelligence is revolutionizing investigative journalism, offering new tools and opportunities to uncover and tell important stories. From analyzing large amounts of data to object detection in satellite images, from natural language processing to algorithmic accountability, AI is paving the way for new forms of data-driven and evidence-based journalism. At the same time, the adoption of these technologies raises important challenges and ethical considerations that must be addressed carefully and responsibly. Accuracy, transparency, respect for privacy, and the protection of sources must remain top priorities for journalists leveraging the potential of AI.

Despite these challenges, the future of investigative journalism appears to be closely tied to artificial intelligence. As these technologies evolve, opening new avenues for storytelling and audience engagement, it will be essential for journalists to stay updated and adapt to these changes. Through a combination of technical skills, ethical rigor, and a passion for telling important stories, investigative journalists can fully harness the potential of AI, helping to foster a more informed, transparent, and accountable society.
Source: Onlinejournalismblog