Sunbird Obsrv
  • Introduction
    • The Value of Data
    • Data Value Chain
    • Challenges
    • The Solution: Obsrv
  • Core Concepts
    • Obsrv Overview
    • Key Capabilities
    • Datasets
    • Connectors
    • High Level Architecture
    • Tech Stack
    • Monitoring
  • Explore
    • Roadmap
    • Case Studies
      • Agri Climate Advisory
      • Learning Analytics at Population Scale
      • IOT Observations Infra
      • Data Driven Features in Learning Platform
      • Network Observability
      • Fraud Detection
    • Performance Benchmarks
  • Guides
    • Installation
      • AWS Installation Guide
      • Azure Installation Guide
      • GCP Installation Guide
      • OCI Installation Guide
      • Data Center Installation Guide
    • Dataset Management APIs
    • Dataset Management Console
    • Connector APIs
    • Data In & Out APIs
    • Alerts and Notification Channels APIs
    • Developer Guide
    • Example Datasets
    • Connectors Developer Guide
      • SDK Assumptions
      • Required Files
        • metadata.json
        • ui-config.json
        • metrics.yaml
        • alerts.yaml
      • Obsrv Base Setup
      • Dev Requirements
      • Interfaces
        • Stream Interfaces
        • Batch Interfaces
      • Classes
        • ConnectorContext Class
        • ConnectorStats Class
        • ConnectorState Class
        • ErrorData Class
        • MetricData Class
      • Verifying
      • Packaging Guide
      • Reference Implementations
    • Coming Soon!
  • Community
  • Previous Versions
    • SB-5.0 Version
      • Overview
      • USE
        • Release Notes
          • Obsrv 2.0-Beta
          • Obsrv 2.1.0
          • Obsrv 2.2.0
          • Obsrv 2.0.0-GA
          • Obsrv 5.3.0-GA
          • Release V 5.1.0
          • Release V 5.1.2
          • Release V 5.1.3
          • Release V 5.0.0
          • Release V 4.10.0
        • Installation Guide
        • Obsrv 2.0 Installation Guide
          • Getting Started with Obsrv Deployment Using Helm
        • System Requirements
      • LEARN
        • Functional Capabilities
        • Dependencies
        • Product Roadmap
        • Product & Developer Guide
          • Telemetry Service
          • Data Pipeline
          • Data Service
          • Data Product
            • On Demand Druid Exhaust Job
              • Component Diagram
              • ML CSV Reports
              • Folder Struture
          • Report Service
          • Report Configurator
          • Summarisers
      • ENGAGE
        • Discuss
        • Contribute to Obsrv
      • Raise an Issue
  • Release Notes
    • Obsrv 1.1.0 Beta Release
    • Obsrv 1.2.0-RC Release
Powered by GitBook
On this page

Was this helpful?

Edit on GitHub
  1. Previous Versions
  2. SB-5.0 Version
  3. LEARN
  4. Product & Developer Guide

Data Pipeline

PreviousTelemetry ServiceNextData Service

Last updated 2 years ago

Was this helpful?

Group of real-time stream processing jobs that processes the event stream of telemetry data generated from client apps and micro services. The telemetry data goes through a series of steps such as validation, de-duplication, transformation and denormalization of metadata. The transformed data is then stored in a consumable format that can be used for further analysis.

Key Features:

  1. Lambda Architecture: A hybrid approach of using both batch-processing and stream-processing methods to process massive data sets.

  2. Loose coupling: Data processing jobs are loosely coupled as they only communicate with a durable queue such as Apache Kafka.

  3. Easy chaining: Data processing jobs can be chained easily by only configuring the input and output data sources they consume from. This allows easy introduction of new jobs required for processing custom workflows.

  4. Data Sync points: The stream of data is synced to a configurable cloud storage which acts as a persistent data store with durability. The data sync points allow the capability to replay data from a specific stage in the pipeline.

  5. Resiliency: The data pipeline guarantees AT LEAST ONCE processing semantics and ensures no data loss.

  6. Monitoring: The data pipeline jobs has the capability to emit standard and custom metrics to monitor the health and also allows you to perform an audit of the system at various stages.

  7. Auto-scaling: The data processing pipeline offers support for auto-scaling out of the box. This helps the pipeline to adapt to changes in the incoming data volume.

  8. Real-time analytics: The data processing pipeline offers out of the box support with Apache Druid, an analytics data store design for fast slice-and-dice analytics.

Installation Configuration Reference:

Property
Description
Default

kafka.consumer.broker.host

Host or IP addresses of Kafka brokers for consumption of data

none

kafka.producer.broker.host

Host or IP addresses of Kafka brokers for publishing data

none

enable.checkpointing

A boolean variable to enable checkpointing on cloud storage

false

redis.host

Host or IP address of Redis cache used for metadata caching

none

consumer.parallelism

Number of threads to consume data in parallel

1

operator.parallelism

Number of threads to process data in parallel

1

telemetry.schema.path

Directory path for JSON schema files to validate the telemetry data

schemas/telemetry

event.max.size

Acceptable size of each event in bytes

1 Mb

redis.devicestore.id

The index of the device data store in Redis cache

2

redis.userstore.id

The index of the user data store in Redis cache

12

redis.contentstore.id

The index of the content data store in Redis cache

5

redis.dialcodestore.id

The index of the dialcode data store in Redis cache

6

GitHub - project-sunbird/sunbird-data-pipeline: Repository for set of real-time streaming jobs to process and enrich the telemetry data generated by various user devices. The repository also consists of ansible provisioning playbooks to automate data pipeline related infrastructure provisioning and deployment playbooks to automate deployment of various components related to data analytics.GitHub
Data Pipeline source code
Logo
Analytics Data Pipeline