This article is published in collaboration with the International Journalists’ Network IJNET
Artificial intelligence has made significant advancements across various fields. It has become indispensable in data journalism due to its powerful capabilities in collecting, analyzing, and visualizing data—particularly "big data."
This article provides an overview of the concept of big data and its sources, outlines the stages involved in processing and analyzing it, and highlights the wide-ranging applications of AI and machine learning tools in data analysis. It delves into how these tools can be leveraged to create data-driven stories, investigations, and reports. It also examines the challenges surrounding the effectiveness of these technologies and discusses the ethical considerations tied to their use.
What is Big Data and How Do We Analyze It?
”
Big Data refers to vast, diverse sets of information that encompass structured, unstructured, and semi-structured data, continuously generated at high speed and massive volume. The size of these data sets is typically measured in units like terabytes or petabytes, with one petabyte equaling one million gigabytes
“
Big Data refers to vast, diverse sets of information that encompass structured, unstructured, and semi-structured data, continuously generated at high speed and massive volume. The size of these data sets is typically measured in units like terabytes or petabytes, with one petabyte equaling one million gigabytes.
Artificial Intelligence (AI) is defined as a technology that enables computers and machines to mimic human abilities such as learning, understanding, problem-solving, and decision-making. Machine Learning (ML), a subset of AI, focuses on developing systems and algorithms that allow computers to learn from data and improve their performance over time without explicit programming.
Given that analyzing Big Data requires advanced tools and techniques, leveraging AI capabilities, particularly machine learning, has become essential at every stage of data processing and analysis.
What Are the Stages of Big Data Processing?
- Data Collection
This stage begins with gathering, discovering, and extracting data from its sources to transform and load it for further processing.
- Data Storage
Data is stored either on cloud storage platforms or appropriate physical storage devices.
- Data Processing
This involves transforming data into consistent and usable formats to derive meaningful insights during analysis.
In the collaborative Pandora Papers project led by the International Consortium of Investigative Journalists (ICIJ) and published in 2021, machine learning was utilized to classify data and exclude irrelevant information. The clustering technique, responsible for segmenting and categorizing data, was employed to make the information comprehensible.
- Data Cleaning
”
This stage involves removing data that contains errors, duplicates, contradictions, or irrelevant information. Machine learning is increasingly used to handle this task, as manually cleaning large datasets is impractical.
“
This stage involves removing data that contains errors, duplicates, contradictions, or irrelevant information. Machine learning is increasingly used to handle this task, as manually cleaning large datasets is impractical.
The Los Angeles Times employed machine learning in its data-driven report titled "LAPD underreported serious assaults, skewing crime stats for 8 years". The investigation uncovered that the Los Angeles Police Department had misclassified approximately 14,000 serious assault incidents as minor crimes over an eight-year period, leading to an inaccurate depiction of crime levels in the city.
- Data Analysis
Data is analyzed using AI techniques to derive conclusions and insights, often presented descriptively through visualization tools like charts and graphs.
Tools such as Google BigQuery facilitate the analysis of large datasets found in government reports and social data. Advanced tools like the Natural Language Toolkit (NLTK) assist in text data analysis, keyword extraction, indexing, sentiment analysis, archiving, and content analysis of news databases.
Key AI Tools for Big Data Analysis:
- Google Cloud AutoML
This cloud-based tool is ideal for creating specialized models to recognize patterns in text or data. It’s particularly suitable for journalists with limited experience in analyzing big data.
- H2O.ai
H2O.ai is used for big data analysis and future trend predictions, making it especially valuable for journalists covering economics, politics, and elections.
- IBM Watson Studio
This tool helps journalists analyze social media content to predict audience interests and trends, enabling the development of effective content promotion strategies.
- Amazon SageMaker
A cloud-based platform suitable for training models to automate data analysis tasks, particularly for data used in investigative reporting.
- PyOD Library
An open-source Python-based library used for detecting anomalies in data, especially financial data, making it highly effective for investigative journalism focused on corruption and financial tracking.
- Isolation Forest
A tool designed to identify errors or unusual patterns in journalistic and media data, such as election-related data.
- Tableau
Used for interactive data visualization, this tool is particularly useful for presenting complex data to non-specialist audiences.
- Power BI by Microsoft
This tool generates interactive reports to clearly present findings, often used in digital journalism for complex investigative outcomes.
While AI tools offer a great range of capabilities for handling big data, journalists and data analysts encounter various challenges when applying them.
Challenges of Using AI in Big Data Journalism
- High Costs
Big data processing requires advanced computing systems to handle large, unstructured data quickly and accurately. Many local newsrooms and resource-limited initiatives struggle to afford such infrastructure.
- Storage and Data Security Issues
Safely storing and processing big data demands complex technologies and skilled technical staff, posing a significant challenge for many media organizations.
- Arabic Language Challenges
Most datasets used to train AI models are in English, and many tools do not support Arabic or its various dialects, complicating their application in data-driven journalism across the Middle East and North Africa.
- AI Limitations
AI tools still face difficulties in efficiently collecting and representing unstructured data, requiring careful human oversight to validate insights.
- Limited Access to Open Government Data
Many developing countries in Asia and Africa lack transparent, evidence-based open government datasets. A 2017 survey by the Arab Data Journalists’ Network involving 60 journalists from eight Arab countries revealed that 71.9% of respondents found accessing data in their countries difficult, 22.8% described the process as highly complex, while only 5.3% said it was easy.
These technical challenges are compounded by ethical issues arising from AI, which are compounded when relying on AI tools for big data analysis. Some of these ethical challenges include:
- Algorithmic Bias
Journalists and researchers face challenges related to biases in machine learning algorithms that classify, analyze, or filter data based on factors such as race, color, and gender. This bias often stems from inadequate training on datasets that sufficiently represent people of color, women, or individuals with nontraditional gender identities. Furthermore, many countries in the Global South and the Middle East and North Africa region lag in keeping up with the latest advancements in this field.
- AI Hallucinations
AI hallucinations occur when a model fails to understand commands or questions or when these inputs do not align with the data it was trained on. This leads to performance gaps, ignoring critical steps, or producing incorrect results, all of which undermine the credibility of data journalism and investigative outcomes.
- Data Accessibility and Corporate Monopoly
Many big data resources and machine learning tools that are relatively user-friendly are proprietary and owned by large corporations or search engines, making access challenging and obscuring the methodologies behind data collection and usage. This poses a significant barrier for resource-constrained institutions and freelance journalists.
- Lack of Transparency and Documentation
Some algorithms lack clear documentation of their processes and tasks, with minimal transparency regarding how classifications, recommendations, or decisions are made.
- Data Privacy
AI tools often rely on anonymized data but may fail to adequately notify users or obtain explicit consent to use personal information such as accounts, opinions, and behavioral patterns, raising serious privacy concerns.
In Europe, the General Data Protection Regulation (GDPR) was introduced in 2016 to enhance citizen rights and digital security. This regulation, comprising 99 articles, defines individual rights and standards for handling personal data, including the right to be informed, to correct inaccuracies, and to have data erased. It also requires companies to sign written agreements with third-party data processors.
The issue in the Middle East and North Africa is that digital rights legislation remains wanting. Additionally, machine learning providers rarely offer transparent reports explaining how data is collected, processed, or analyzed.
Despite these challenges, the responsible use of AI tools remains essential, particularly in analyzing big data that is difficult to handle manually. Properly collecting and analyzing such data can yield insights that serve the public interest.