Artificial intelligence in investigative journalism * Anna Bruno

The investigative journalismhas developed an indissoluble link with data journalism and open source intelligence (Osint), offering fertile ground to experiment with the potential of artificial intelligence (TO THE). Although the generative explosion has opened new horizons for innovation, the use of AI in investigative journalism has so far focused mainly on other areas.

Summary

Although there is no universally accepted definition of "artificial intelligence" (Wang, 2019; Russell, Norvig & Chang, 2022), in the journalistic context, the term was used to indicate a wide range of technologies, from the solutions for classification of documents to the generation of videos or images. However, AI includes numerous branches, each with specific applications and challenges.

Generative artificial intelligence

Tools likeChat GPT and Gemini di Googleare based on a form of note as "Great models of language"(LLM). These are part of the wider field ofTo generative, which also includes tools for generation of images such asFrom the and Midjourney, tools for video generation such asSora di Openiiand audio tools likeAUDIOCOFT of Meta.

These models are trained on vast dataset of images, videos or audio to build multimedia content "foreseeing" every word, pixels or sound during the writing, drawing or composition process. It is this predictive capacity that confesses an appearance of intelligence, but does not necessarily guarantee the factual accuracy of the results. The factual inaccuracies are a recurring problem, to the point of having coined a specific term to describe them: "hallucinations".

Deep learning

L 'To generativeIt is in turn a branch of profound learning, which in turn is part of the wider field of automatic learning. Automatic learning implies the training of an algorithm to predict, classify or group inputs in associated groups. Already in 2018, this form of AI was used by the three quarters of the "digital leaders" in an investigation for purposes that went from the Recommendation of Fact-Checking content. In specific investigative journalism, two of the fifteen winners of thePulitzer prizeThey were using automatic learning.

The two most common ways to train an algorithm are known as supervised and not supervised learning. The non -supervised learning allows the algorithm to group the data based on any identified pattern, requesting very few information on the data. This makes it a powerful tool to identify groups of documents or related words. The supervised learning, on the contrary, requires training data labeled in some way. This makes it more useful for classifying new data (based on the training of training data) or making forecasts (based on the relationships identified between data labeled differently).

The use of AI in investigative journalism

The large AI applications have been exploited in all stages of journalism, from the generation of ideas and alerts to research, production, publication, distribution, feedback and storage (Hanson, Roca-Sales, Keegan and King 2017; Gibbs 2024). In specific investigative journalism, AI finds applications in 9 scenarios.

1. Identify a problem or its scope

When the Data Journalism team of SRF decided to investigate the use of fake followers on the part of influencers on social media, he resorted to machine learning. The team created a set of Instagram account data classified as false or real, using it to train an algorithm applicable to millions of Swiss influencers followers. The results highlighted the diffusion of the practice of purchasing fake followers, establishing the scope of the problem for the first time.

2. Reduce the complexity of an investigation

Machine learning can be described as a filtering tool. When the Journal-Constitution Atlanta wanted to establish how many doctors were authorized to continue exercising after being considered guilty of sexual monoptes, it used machine learning to reduce a set of 100,000 documents to about 6,000 potentially relevant, which could be controlled manually.

A similar process was followed by the International Consortium of Investigative Journalists in their investigation on the damage caused by medical devices, in which millions of records were filtered by an algorithm trained to identify the reports in which the description of an adverse event indicated the death of a patient, but death had been incorrectly classified.

3. Modeling and forecasts

Machine learning's ability to predict future events can be used to identify potential problems. In the awarded series "Waves of Abandonment" by Grist and Texas Observer, journalists used a model to predict which oil wells could be abandoned in the coming years, allowing them to write an article on potential costs for taxpayers.

Modeling was also used by Eye On Ohio to understand which factors made the attachment of some houses more likely than others, and from the advocate to identify the causes of the Ebola epidemics, highlighting Nigeria as particularly at risk.

4. Algorithmic liability: unmasking a system

Another application of AI is the "algorithmic liability", or the use of algorithms to shed light on the algorithms themselves and keep them under control. Propublica has been a pioneer in this field, using Machine Learning since 2012 to decode political microtargeting. The series "Machine Bias"He investigated the prejudices in the software used to determine the criminal sentences, on discrimination in Facebook advertising tools and in the calculators of car insurance premiums.

Elsewhere, the collaborative projectLighthouse reportsHe used similar techniques to investigate the algorithmic profiling used by the Dutch local governments, compared by the United Nations to "the digital equivalent of anti -flavored inspectors who knock on each door in a certain area and control each person's documents in an attempt to identify fraud cases, while no control of the genre is applied to those who live in more well -going areas".

5. Natural language processing and Text Mining

A branch of AI that often uses machine learning is the processing of natural language (NLP), a technology that allows computers to understand human language. Techniques such as the analysis of the sentiment, the topic modeling and the extraction of nominal entities have been used in different journalistic investigations to identify suspicious models, identify disinformation campaigns and browse more quickly between the mentions of specific entities in large quantities of documents.

6. Extraction, combination and cleaning of large -scale data

Another attraction of AI for investigative journalism is its ability to extract data from the documents, overcoming the obstacle of publications in pdf format. Tools such as Cloud Document Ai Ai Di Google and Deepform allow you to automate the extraction of structured data, although with challenges related to accuracy.

Cleaning data to combine information from different sources is another area of application of AI. The entity-Focused Data System project of DataMade, for example, uses techniques for processing natural language to connect people and organizations even when they are named slightly differently in different data sets.

7. Satellite journalism and Object Detection

A particular use of the Machine Learning developed in the field of satellite journalism, with stories on illegal mining operations, human rights violations and war crimes that have benefited from the Object Detection, a technique that trains the algorithms to identify objects within images. The New York Times used this approach to find evidence of the use of 2,000 pound bombs by Israel in the south of Gaza.

8. Acoustic sensors and machines learning

Although less explored than text and images, the audio offers opportunities for the application of AI in investigative journalism. The Rainforest Connection No-profit organization made a pioneer in the use of machine learning with acoustic monitoring in remote areas to detect the illegal felling of trees. The same technique was used to measure various impacts of climate change and even to trace the risk of infectious diseases.

9. New forms of narration and involvement

In addition to managing large quantities of information, AI has made it possible to explore new ways of telling and distributing the news. The Brazilian Watchdog project Operation Serenata De Amor, for example, takes advantage of the machine learning to monitor the expenses of politicians and an automated Twitter account to involve the public and solicit answers from the politicians themselves.

The translation and synthesis offered by the elaboration of natural language can also make stories accessible to new public. A website and an extension for browsers called Polisis offer a legible summary of the privacy policies of various services, while Natural Language Generation tools allow you to customize or "verso" stories for different audiences or inputs.

These are just some of the multiple applications of the AI in investigative journalism. In the next sections we will explore the various branches of this revolutionary technology in more detail.

Machine Learning: the spine of the AI in journalism

The Machine LearningIt is a fundamental component of artificial intelligence applied to investigative journalism. This branch of AI implies the training of algorithms to predict, classify or group inputs in associated groups, opening the way to a wide range of journalistic applications.

There are two main approaches to train an algorithm ofMachine Learning: supervised and non -supervised learning. Uncounted learning allows the algorithm to independently identify patterns and groupings in the data, without requesting preliminary information. This makes it a powerful tool to identify groups of documents or related words.

The supervised learning, on the other hand, requires that training data are labeled in some way. This makes it more useful for classifying new data (based on the training of training data) or making predictions (based on the relationships identified between the data labeled in a different way).

Learning not supervised to discover schemes

The non -supervised approach was used in various investigations to identify suspicious models in the data. Jeff Kao, for example, used topic modeling, a non -supervised NLP -based NLP technique, to identify suspicious patterns in millions of comments to a public consultation on the Net Neutrality, providing evidence of an automated disinformation campaign.

Supervised learning to classify and predict

On the other hand, supervised learning was fundamental for projects that required the classification or prediction of data. When SRF investigated the fake followers of influencers, he created a set of Instagram account data labeled as "fake" or "real" to train an algorithm capable of classifying millions of other followers.

Likewise, the journalists of Grist and Texas Observer used supervised learning to train a model capable of predicting which oil wells could be abandoned in the coming years, on the basis of historical data such as depth, position and oil prices.

Challenges and limitations

Despite its potential, Machine Learning still has challenges and limitations. Algorithms are not infallible and can produce false positive or negative, requiring a certain degree of manual verification. Furthermore, as Andy Dudfield of a British Fact-Checking organization points out, "Algorithms do not know what the facts are. It is a very thin world of contexts and caveat".

Another challenge is represented by the accuracy of the tools of to commercial, which often require significant configuration work to adapt to specific use cases or types of heterogeneous documents.

Despite these obstacles, the machine learning remains a powerful tool for investigative journalism, allowing you to face challenges that would have been impractical with traditional methods.

The processing of natural language: explore the text as a source of data

A branch of AI strictly linked to machine learning is the processing of natural language (NLP), a technology that allows computers to understand and process human language. In investigative journalism, the NLP offers a wide range of techniques to extract information and insights from large quantities of text.

1. Analysis of sentiment and topic modeling

Techniques such as the analysis of the sentiment and topical modeling were used to identify models and anomalies in textual data. An investigation by the Washington Post used the analysis of the sentiment to compare the language removed from the drafts of the audit of an international development agency before publication, identifying over 400 eliminated negative references.

The topic modeling, on the other hand, uses non -supervised learning to group a set of textual data in "clusters" based on shared language. This technique was used byAssociated PressTo identify accidents in involving schools police officers and educators' firearms in 140,000 accidents.

2. Extraction of nominal entities

One of the most popular applications of NLP in investigations is the extraction of nominal entities, which generates lists of people, places, organizations and key concepts present in the documents. This functionality allows journalists to browse more quickly among the mentions of specific entities, saving enormous quantities of time.

3. Translation, synthesis and accessibility

In addition to facilitating the analysis of the texts, the NLP can make the stories accessible to new public through translation and automatic synthesis. Tools like Polisis offer readable summary of privacy policies, while the Natural Language Generation (NLG) allows you to customize or "verso" articles for different audiences or inputs.

Although powerful, these technologies still have challenges related to accuracy and ability to capture shades of context and meaning. However, even without reaching human performance, they can offer considerable saving of time and resources.

Extraction, combination and cleaning of data on industrial level

One of the main attractions of AI for investigative journalism is its ability to extract, combine and clean large quantities of data efficiently. This feature is crucial to unlock precious information hidden in formats that are difficult to access such as PDFs.

1. Extraction of data structured by PDF

Tools such as Cloud Document Ai Ai Di Google and Deepform allow you to automate the extraction of data structured by PDF sets, overcoming one of the most common obstacles to journalistic investigations. However, as Jonathan Stray points out of the Deepform project, the accuracy of these tools is not yet perfect, requesting a certain degree of manual verification.

Another challenge is represented by the owner and specialized nature of many data extraction tools, which often require a significant configuration work to adapt to types of heterogeneous documents.

2. Data matching and cleaning

In addition to the approach, in addition to the action, the AI offers powerful tools to combine and clean sets of data from different sources. An investigation into the tax evasion on property has used sensitive hashing to locations (LSH), a non -supervised learning technique that brings together similar records in "Bucket", to combine data on properties and users. This method is not infallible and can produce false positive and negative, requiring an accurate configuration of the algorithm.

The cleaning of data to facilitate the combination is also at the center of the entity-Focused Data System project of DataMade, created in collaboration with the Atlanta Journal-Constitution. This tool helps journalists to connect people and organizations even when they are named slightly differently in different data sets, using natural language processing techniques such as "probabilistic parsing".

Although the accuracy remains a challenge, these tools of extraction, combination and cleaning of the data represent a significant step forward for investigative journalism, allowing to unlock precious information hidden in large quantities of raw data.

Satellite journalism and Object Detection

A particularly promising application of machine learning in investigative journalism is Object Detection, a technique that trains algorithms to identify objects within images. This technology has found an important application in the emerging field of satellite journalism.

1. Investigations on violations of human rights and war crimes

Stories on illegal mining operations, human rights violations and war crimes have benefited from the Object Detection to identify visual tests otherwise difficult to detect. In 2023, the Team of Visual Inquiry New York Times used this technique to look for evidence of the use of 2,000 pound bombs by Israel in the south of Gaza, training an algorithm to identify the craters created by these bombs.

After removing the false positives, the journalists confirmed that "hundreds of those bombs were detached, in particular in areas that had been reported as safe for civilians ... it is likely that more bombs than those captured by our reportage were used".

2. Environmental monitoring and safety

Object detection is not limited to investigations on armed conflicts. The Rainforest Connection No-profit organization has made a forerunner in the use of acoustic sensors and machines learning in remote areas to detect the illegal reduction of trees, but also to monitor various impacts of climate change and even to trace the risk of infectious diseases.

These applications demonstrate the AI potential to unlock new forms of investigative journalism based on multimedia data, allowing you to bring to light stories that would otherwise remain hidden.

New forms of narration and involvement of the public

In addition to its applications in the collection and analysis of the data, AI is opening the way to new forms of journalistic narration and the involvement of the public. Tools such as the Natural Language Generation (NLG) allow you to customize or "verso" articles based on different audiences or inputs, while the processing of natural language facilitates translation and automatic synthesis, making stories accessible to new public.

1. Public involvement through social media

An innovative example of these potential is the Brazilian project Operation Serenata De Amor, which uses the machine learning to monitor the expenses of politicians and an automated Twitter account to involve the public and solicit responses from the politicians themselves. As stated by one of the journalists involved, "We are experiencing an era when parliamentarians discuss with robots on Twitter. We made democracy more accessible".

2. Tools for accessibility and understanding

Projects such as Polisis, a website and a browser extension that offer legible summary of the privacy policies of various services, demonstrate how the AI can be used to inform and empower the public on complex issues. While the most advanced NLG tools could one day allow to automatically generate entire stories based on data inputs or multimedia, at the moment these tools are used mainly to customize and adapt the content for different audiences, opening the way to new forms of customized journalism. Although these applications raise legitimate ethical and accuracy concerns, they also represent an opportunity for journalism to evolve and remain relevant in an era of information overall and fragmented attention.

Algorithmic liability: unmasking the systems of AI

One of the most important applications of AI in investigative journalism is the "algorithmic liability", that is, the use of algorithms to shed light on the algorithms themselves and keep them under control. With the increase in the use of AI systems by governments, companies and other organizations, this form of control journalism has become crucial to protect citizens' rights and promote transparency.

1. Pioneers of algorithmic liability

Propublica was one of the first organizations to explore this field, using Machine Learning since 2012 to decode political microtargeting and the personalization of messages based on the demographic characteristics of the recipients. Their award -winning series "Machine Bias"He subsequently investigated the prejudices in the software used to determine the criminal sentences, on discrimination in Facebook advertising tools and in the calculators of car insurance premiums.

2. Investigate on algorithmic profiling

Elsewhere, the Lighthouse Reports collaborative project has used similar techniques to investigate the algorithmic profiling used by the Dutch local governments, compared by the United Nations to "the digital equivalent of anti -flavored inspectors who knock on each door in a certain area and control each person's documents in an attempt to identify fraud cases, while no control of the genre is applied to those who live in more well -to -warn areas".

These investigations have highlighted how algorithms can codify and perpetuate prejudices and discrimination, underlining the importance of greater transparency and responsibility in using these systems.

3. Approaches to algorithmic liability

As pointed out by a report by the German radio station Bayerischer Rundfunk, there are several approaches to investigate AI, including the use of access laws, the analysis of the output of automated systems, the analysis of data and the use of interviews and documents. Regardless of the method used, algorithmic liability represents a crucial form of journalism control in the AI era, helping to protect citizens' rights and to promote greater transparency in the use of these powerful systems.

Challenges and ethical considerations in the use of AI

Despite its multiple potential, the use of investigative journalism also raises important challenges and ethical considerations that must be faced.

1. Accuracy and transparency

One of the main concerns concerns the accuracy of AI tools, which often produce unwanted errors or "hallucinations". This problem is particularly relevant for the generative to the generative, which can produce potentially misleading or even false multimedia content. To deal with this challenge, it is essential that journalists are transparent on the use of the AI and on the limits of the tools used, carefully checking the results and clearly communicating the process followed to the public.

2. Prejudices and discrimination

Another concern concerns the risk that the algorithms of ai codificino and perpetual prejudices and discrimination present in the training data or in the choices of the designers. This problem has been highlighted by several investigations on algorithmic liability, underlining the importance of greater transparency and responsibility in using these systems.

3. Ethical and ethical considerations

Finally, the use of AI in journalism raises important ethical and deontological issues, such as respect for privacy, the protection of sources and the impact on human relationships and trust in journalism. For example, the use of automated social account to involve the public, as in the case of Operation Serenata de Amor, could raise concerns about the transparency and authenticity of the interaction. Likewise, the automatic generation of personalized stories could put public trust in journalism if not managed correctly at risk.

To face these challenges, it is essential that journalists adhere to solid ethical and deontological principles, promoting transparency, respect for privacy and the protection of sources. At the same time, an open and inclusive debate is needed on the ethical implications of the use of AI in journalism, involving all interested parties.

Conclusions: a future of opportunities and challenges

Artificial intelligence is revolutionizing investigative journalism, offering new tools and opportunities to discover and tell important stories. From the analysis of large quantities of data to the object detection in satellite images, from the elaboration of natural language to algorithmic responsibility, AI is opening the way to new forms of journalism based on data and tests. At the same time, the adoption of these technologies raises important challenges and ethical considerations that must be dealt with with attention and responsibility. Accuracy, transparency, respect for privacy and the protection of sources must remain absolute priorities for journalists who exploit the potential of AI.

Despite these challenges, the future of investigative journalism seems to be closely linked to artificial intelligence. As these technologies will evolve, by opening the way to new forms of narration and public involvement, it will be essential for journalists to stay updated and adapt to these changes. Through a combination of technical skills, ethical rigor and passion for the story of important stories, investigative journalists will be able to fully exploit the potential of AI, helping to promote a more informed, transparent and responsible society.
Source: Onlinejournalismblog