Data Ingestion Pipelines

The FEWS NET Data Warehouse (FDW) supports automated, scheduled data ingestion from remote APIs and scrapes data directly from websites that offer relevant data downloads.

Compared to manually uploading a cleaned spreadsheet, FDW ingestion pipelines dramatically reduce the level of effort required to maintain long-run time series.

Supported external APIs

Data from the following external APIs is ingested into the FDW.

Source

Description

Notes

Source

Description

Notes

Armed Conflict Location and Event Data (ACLED)

Conflict data from ACLED’s API.

Reporting delays and corrections to past data are common due to challenges gathering conflict data. Therefore, ACLED data in FDW for a given period is subject to change.  

Multiple data endpoints (events) can be attributed to the same period. Events are given a value of 0 when there was a conflict without fatalities. Values >0 indicate the number of fatalities.

Food and Agriculture Organization (FAO)

Price ingestion pipeline from FAO’s web API.

Monthly and weekly prices data are ingested to multiple Data Source Documents.

International Monetary Fund (IMF) Price Ingestion

Price, CPI, Labor, and GDP data ingestion pipeline from IMF's API.

Data is included in FDW under Secondary price index, Semi-Structured Data Series: Labor Statistics, and Semi-Structured Data Series: Economic Statistics.

The following data ingestion pipelines are no longer in use:

  • FARMERS, Sudan

Supported internal APIs

FEWS data collection via KoBo Toolbox

FEWS NET collects Price, Exchange Rate, and Cross Border Trade data using KoBo Toolbox. The following data collected via KoBo is automatically ingested into the FDW:

  • Chad Weekly Market Prices

  • DRC Weekly Exchange Rates

  • DRC Weekly Market Prices

  • Ethiopia Weekly Market Prices

  • Nigeria Weekly Exchange Rates

  • Nigeria Weekly Livestock Prices

  • Nigeria Weekly Market Prices

  • South Sudan Weekly Market Prices

  • Zimbabwe Weekly Exchange Rates Open Market

  • Zimbabwe Weekly Exchange Rates Supermarket

  • Zimbabwe Weekly Market Prices Open Market

  • Zimbabwe Weekly Market Prices Service Station

  • Zimbabwe Weekly Market Prices Supermarket

Configuring Data Series

Typically, there are few initial matches between a remote API and FDW.

Metadata must be set up in FDW such that the incoming metadata from the remote API is recognized correctly. This includes creating new metadata items as well as aliases for existing metadata items.

Metadata match reports

The data ingestion pipeline for an API creates a Google Sheet that contains the metadata matches between the remote API and FDW. The spreadsheet contains a list of all the data series offered by the remote API and all the FDW Data Series that are expected for that remote API. 

The spreadsheet columns are structured as follows:

A. Remote API metadata columns: These columns contain metadata from the remote API. The column labels use the original names. These columns contain one of two types of metadata:

  • Metadata as it appears in the remote API, for example, Box (80 pieces). Where the column in the remote API has the same name as a column in FDW, the remote column will be suffixed by _original.

  • Remote API metadata converted to FDW conventions, for example, ea_80. Where the column in the remote API has the same name as a column in FDW, the remote column will be suffixed by _remote.

B. Data Series column: Contains the ID number of the FDW Data Series that is matched against the remote API.

C. FDW Metadata columns: Contain the actual metadata values for that Data Series from FDW.

Each row in the spreadsheet represents one of three situations:

  1. A remote data series that has been matched to an FDW data series: This probably indicates a successful match, but it may also represent an accidental match where FDW has recognized the wrong remote data series.

  2. An FDW Data Series without a matching remote Data Series: This indicates a metadata mismatch, and we need to find the remote data series that we expected to match and identify and correct the unrecognized metadata.

  3. A remote Data Series without a matching FDW Data Series: This might indicate a metadata mismatch, or it might be a remote data series that we do not want to capture in FDW.

Metadata mismatch example.png

Metadata matching

The pipeline will only load data into FDW if all the expected Data Series are recognized from the remote API. This prevents a situation where the pipeline appears to be working because most data is loaded successfully, but we later discover that an important Data Series is missing.

Looking at the expected FDW Data Series that do not have a matching remote Data Series, and particularly multiple unmatched FDW Data Series with common metadata, is the best way to identify metadata that needs to be updated. For example, if there are many unmatched FDW Data Series for the same Market (or Product, or Unit) then it is likely that the market in question is not recognized. In that case, we must identify the value reported by the remote API for that market, and ensure that an appropriate alias is set up in the FDW Market.

Requesting additional APIs

Support for additional APIs and websites should be requested through the Hub’s sprint process for the Data Platform. That process can be initiated through a Helpdesk ticket or the monthly Data Stakeholders meeting.

The implementation process involves 3 steps, typically tracked through 3 separate but dependent Jira tickets implemented in consecutive sprints:

  1. Research the API: Investigate the API, including acquiring the credentials if necessary, and document the authentication required, the different endpoints and the available filters and formats. This step may require interaction with the developers of the remote API and some level of trial and error if support and/or documentation is not available. The output from the ticket is typically a Jupyter Notebook that demonstrates how to access the API and download the data.

  2. Develop a data ingestion pipeline for the API: Write a new Luigi pipeline within FDW to download the necessary data and perform API-specific transformations required to prepare the data for ingestion to FDW,  and then use the transformed data as input to a generic data normalization, validation and ingestion tasks. The output from the ticket is a complete pipeline with associated unit tests merged into the FDW software and released to the FDW production environment.

  3. Support for enabling the API in FDW production: Use API-specific guidance to determine the required metadata and Data Series in FDW. Typically, we have no control over the content of the remote API, and so we must set up FDW appropriately to recognize the Data Series we want to capture. The ingestion pipeline produces an API metadata matches spreadsheet to help with this process, which reports the data series available from the remote API, and how the metadata matches to metadata available in FDW, including the Data Series defined for the relevant Data Source Document(s).