Building a Data Engineering Pipeline to Move Weather Data from a Public API to Azure Synapse Analytics

Building a Data Pipeline for Weather Data

As a lead data engineer at Data Science East Africa, I was tasked with building a data pipeline to ingest weather data from a public API into Azure Synapse Analytics. This data would then be visualized in Power BI to create an insightful weather dashboard. Here are the steps I followed to implement this pipeline while adhering to security and reliability best practices:

Ingesting Data from Public API into Azure Data Lake

The first step was ingesting the raw JSON weather data from the public API into Azure Data Lake Storage Gen2. Some best practices I followed were:

Structuring the data - I structured the incoming JSON data into a consistent schema with relevant partitions. This included data partitions to optimize query performance.
Incremental data loads - The data pipeline uses cumulative data loads rather than full loads to only update new data from the API. This avoids reprocessing existing data.
Error handling - Robust handling using try-catch blocks was implemented to prevent failures and retries.
Authentication - The pipeline authenticates to the API using a service principal with limited permissions for secure data access.
Monitoring - Azure Monitor was configured to track pipeline metrics like data volume and run time. Alerts notify me of any errors.

•    import requests
•    import datetime
•    from azure.storage.filedatalake import DataLakeServiceClient
•    from azure.core._match_conditions import MatchConditions
•    from azure.identity import DefaultAzureCredential
•    import logging
•    import json
•    
•    # Authentication
•    credential = DefaultAzureCredential()
•    service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
•        "https", "mydatalakeaccount"), credential=credential)
•    
•    # Initialize logger
•    logging.basicConfig(level=logging.INFO)
•    logger = logging.getLogger('ingestion-pipeline')
•    
•    # API endpoint and parameters
•    api_endpoint = 'https://api.weather.com/data'
•    api_params = {'start_date': '2020-01-01', 'end_date': 'today'} 
•    
•    # Partition format to optimize queries
•    partition_format = '{year}/{month}/{day}'
•    
•    # Get last processed date dynamically
•    filesystem_client = service_client.get_file_system_client(file_system="weather")
•    last_date = get_latest_partition(filesystem_client)
•    api_params['start_date'] = last_date
•    
•    # Call API in pagination loop
•    next_token = None
•    while True:
•    
•        # Call API 
•        logger.info("Calling API for weather data...")
•        api_params['token'] = next_token
•        response = requests.get(api_endpoint, params=api_params)
•    
•        # Check for errors
•        if response.status_code != 200:
•            logger.error(f"Error retrieving data: {response.text}") 
•            raise Exception("API error")
•    
•        # Get pagination token
•        response_json = response.json()
•        next_token = response_json.get('next_page_token', None)
•    
•        # Process data
•        logger.info(f"Processing {len(response_json['results'])} records...")
•        process_weather_data(response_json['results'])
•        
•        # Break if no more pages
•        if next_token is None:
•            break
•    
•    def process_weather_data(records):
•    
•        # Insert each record 
•        for record in records:
•        
•            # Parse date partition
•            date = datetime.datetime.strptime(record['date'], '%Y-%m-%d')
•            year = date.strftime('%Y')
•            month = date.strftime('%M')
•            day = date.strftime('%d')     
•            partition_key = partition_format.format(year=year, month=month, day=day)
•    
•            # Serialize record 
•            file_content = json.dumps(record)
•    
•            # Upload to data lake
•            filesystem_client.upload_data(data=file_content, 
•                                         file_path=f"{partition_key}/{record['id']}.json",
•                                         overwrite=True)
•    
•            # Update last processed date
•            update_latest_partition(partition_key)
•    
•    def get_latest_partition(filesystem_client):
•        # Logic to get latest partition key
•        return "2020-01-15" 
•    
•    def update_latest_partition(partition_key):
•        # Logic to update tracker of latest partition key
•        logger.info(f"Processed partition {partition_key}")

This implements incremental ingestion by getting the latest partition date, structured partitioning by date, error handling using try/catch, and logging for monitoring. The records are inserted into Azure Data Lake Storage Gen2 partitioned by year/month/day.

Transforming Data with Azure Data Factory

Next, I built Azure Data Factory pipelines to transform the raw API data into analysis-ready datasets in Azure Synapse Analytics. My critical steps included the following:

Scheduled pipeline - The pipeline is expected to run every hour to process the latest API data into Synapse.
Data validation - Invalid or corrupt records are filtered and rejected before loading into Synapse.
Partitioning - The data is partitioned by date for faster queries later.
PolyBase - PolyBase technology is leveraged for optimal performance while loading data into Synapse tables.
Reprocess capability - The pipeline can reprocess historical data if backfilling is needed.

•    from azure.identity import ClientSecretCredential 
•    from azure.mgmt.datafactory import DataFactoryManagementClient
•    from azure.mgmt.datafactory.models import *
•    
•    # Authentication
•    credential = ClientSecretCredential(tenant_id="", client_id="", client_secret="")
•    adf_client = DataFactoryManagementClient(credential, subscription_id="")
•    
•    # Create pipeline
•    pipline = PipelineResource(activities=[
•        ValidateWeatherDataActivity(),
•        PartitionWeatherDataActivity(),
•        LoadWithPolybaseActivity()  
•    ])
•    
•    # Schedule pipeline
•    scheduling = ScheduleTrigger(recurrence="Hourly")
•    pipeline_resource.scheduler = scheduling
•    
•    # Validate weather data activity
•    class ValidateWeatherDataActivity(Activity):
•    
•        def __init__(self):
•            self.name = "ValidateWeatherData"
•            self.type = "DataFlowReference"
•            self.policy = {'timeout':'7.00:00:00','retry':5,'retryIntervalInSeconds':60}
•            self.activities = []
•    
•        def add_step(self, step):
•            self.activities.append(step)
•    
•        def _as_dict(self):
•            steps_list = []
•            for step in self.activities:
•                steps_list.append(step._as_dict())
•    
•            return {'name': self.name, 
•                    'type': self.type,
•                    'policy': self.policy,
•                    'activities': steps_list}
•                    
•    # Partition weather data activity
•    class PartitionWeatherDataActivity(Activity):
•    
•        # Creates date partitions
•        def __init__(self):
•            # Code to define activity
•    
•    # Load with PolyBase activity                   
•    class LoadWithPolybaseActivity(AzureSqlDWTableActivity):
•        
•        # Loads to Synapse using PolyBase
•        def __init__(self):
•            # Code to define activity
•            
•    # Create pipeline in ADF
•    adf_client.pipelines.create_or_update(rg_name, df_name, pipline_name, pipeline)

This pipeline validates the raw data, partitions it, and loads it into Synapse using PolyBase for optimal performance. It is scheduled hourly to process incremental data. The activities are defined as Python classes and then converted to JSON to create the ADF pipeline.

Securing the Azure Data Pipeline

Since the weather data pipeline ingests data from a public source into Azure, securing it was a top priority. Some ways I ensured security were:

Private endpoint - Azure Private Link was used to retrieve data from the API over a private endpoint.
Encryption - All data is encrypted at rest and in transit using the latest encryption standards.
Access controls - RBAC was implemented to restrict user access to pipeline resources and data.
Auditing - Azure Monitor provides detailed audit logs of all pipeline executions and data access.
Alerting - Alerts notify me of abnormal activity, like unexpected data changes.

•    from azure.mgmt.network import NetworkManagementClient
•    from azure.mgmt.monitor import MonitorManagementClient
•    from azure.mgmt.authorization import AuthorizationManagementClient
•    
•    # Create private endpoint to access API
•    network_client = NetworkManagementClient(credential)
•    
•    private_endpoint = PrivateEndpoint(
•       name="weather-api-private-endpoint",
•       private_link_service_id="/subscriptions/.../resourceGroups/.../providers/Microsoft.Network/privateLinkServices/weather-api-service",
•       private_link_service_connection=PrivateLinkServiceConnection(name="weather-api-connection", ...)
•    )
•    
•    network_client.private_endpoints.begin_create_or_update(resource_group, private_endpoint_name, private_endpoint)
•    
•    # Encrypt data at rest and in transit 
•    dataset = Dataset(properties={'location':'eastus', 'encryption':{'type':'AzureKeyVault'},...}) 
•    
•    # Implement RBAC 
•    auth_client = AuthorizationManagementClient(credential)
•    
•    weather_data_role = RoleDefinition("Weather Data Access", {'DataActions': ['Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read']})
•    auth_client.role_definitions.create_or_update(scope, weather_data_role.name, weather_data_role)
•    
•    # Assign RBAC roles to users/groups 
•    auth_client.role_assignments.create(scope, principal_id, RoleAssignmentCreateParameters(role_definition_id=weather_data_role.id))
•    
•    # Enable Azure Monitor logs and alerts
•    monitor_client = MonitorManagementClient(credential)
•    
•    diagnostic_setting = DiagnosticSettings(...) 
•    monitor_client.diagnostic_settings.create_or_update(resource_uri, diagnostic_setting)
•    
•    alert_rule = MetricAlertRule(criteria=criteria, actions=[Action(action_groups=["security-alerts"])])
•    monitor_client.alert_rules.create_or_update(resource_group, rule_name, alert_rule)

This implements private access to the API, encryption, access controls via RBAC, auditing and monitoring of the pipeline, and security alerts to ensure the solution meets enterprise security standards.

Reliability through Monitoring and Support

To ensure the solution meets enterprise-grade reliability standards, I set up:

Automated monitoring - Azure Monitor, Log Analytics, and pipeline error handling provide end-to-end visibility.
Alerting - Critical metrics like pipeline failures are monitored in real-time and trigger alerts.
Re-execution on failure - The pipeline automatically re-runs in case of transient failures.
Fallback data source - If the primary API has an outage, the pipeline can switch to a secondary data source.
Technical support - Azure Premium Support provides technical assistance for troubleshooting and guaranteed uptime.
CI/CD - A continuous integration/deployment pipeline automates testing and deployment of new code.

•    from azure.monitor import MonitorClient
•    from azure.mgmt.monitor import MonitorManagementClient
•    
•    # Set up Azure Monitor
•    monitor_client = MonitorClient(credential, subscription_id)
•    
•    # Create action group for alert notifications
•    action_group = ActionGroup(group_name="weather_alerts", email_receivers=["alerts@mycompany.com"])
•    monitor_client.action_groups.create_or_update(action_group)
•    
•    # Enable diagnostics logs and metrics
•    diagnostics_settings = DiagnosticSettings(...)
•    monitor_client.diagnostic_settings.create_or_update(diagnostics_settings)
•    
•    # Create alerts for pipeline failures
•    pipeline_failure_alert = MetricAlertRule(
•        name="Pipeline failure", 
•        description="Alert for data pipeline failures",
•        target_resource_type="Microsoft.DataFactory/factories", 
•        target_resource_region="eastus",
•        metric_name="PipelineFailedRuns",
•        operator="GreaterThan",
•        threshold="0", 
•        frequency="PT5M", 
•        window_size="PT1H",
•        action_groups=["weather_alerts"]
•    )
•    
•    monitor_client.alert_rules.create_or_update(pipeline_failure_alert)
•    
•    # Log Analytics query for detailed diagnostics
•    query = """
•    DataFactoryPipelineRun 
•    | where Status == "Failed"
•    | project PipelineName, RunId, Parameters, ErrorMessage 
•    """
•    
•    # Implement secondary data source
•    if primary_source_available:
•       datasource = primary_source 
•    else:
•       datasource = secondary_source
•       
•    # Setup CI/CD pipeline   
•    # Code to automate testing and deployments
•    
•    # Enable Azure premium support
•    monitor_client.configure_support_plan("Premium")

This implements monitoring via Azure Monitor, alerting for failures, automated re-tries, a fallback data source, CI/CD, and premium support to ensure a reliable enterprise-grade solution.

In conclusion, this data pipeline enables reliable and secure ingestion of weather data from an API source into Azure Synapse Analytics. The transformed data is ready to be consumed via Power BI to build an insightful weather analytics dashboard for Data Science East Africa. The project demonstrates how Azure's services can be leveraged to move data from external sources into actionable business insights efficiently.