Sunbird Obsrv
  • Introduction
    • The Value of Data
    • Data Value Chain
    • Challenges
    • The Solution: Obsrv
  • Core Concepts
    • Obsrv Overview
    • Key Capabilities
    • Datasets
    • Connectors
    • High Level Architecture
    • Tech Stack
    • Monitoring
  • Explore
    • Roadmap
    • Case Studies
      • Agri Climate Advisory
      • Learning Analytics at Population Scale
      • IOT Observations Infra
      • Data Driven Features in Learning Platform
      • Network Observability
      • Fraud Detection
    • Performance Benchmarks
  • Guides
    • Installation
      • AWS Installation Guide
      • Azure Installation Guide
      • GCP Installation Guide
      • OCI Installation Guide
      • Data Center Installation Guide
    • Dataset Management APIs
    • Dataset Management Console
    • Connector APIs
    • Data In & Out APIs
    • Alerts and Notification Channels APIs
    • Developer Guide
    • Example Datasets
    • Connectors Developer Guide
      • SDK Assumptions
      • Required Files
        • metadata.json
        • ui-config.json
        • metrics.yaml
        • alerts.yaml
      • Obsrv Base Setup
      • Dev Requirements
      • Interfaces
        • Stream Interfaces
        • Batch Interfaces
      • Classes
        • ConnectorContext Class
        • ConnectorStats Class
        • ConnectorState Class
        • ErrorData Class
        • MetricData Class
      • Verifying
      • Packaging Guide
      • Reference Implementations
    • Coming Soon!
  • Community
  • Previous Versions
    • SB-5.0 Version
      • Overview
      • USE
        • Release Notes
          • Obsrv 2.0-Beta
          • Obsrv 2.1.0
          • Obsrv 2.2.0
          • Obsrv 2.0.0-GA
          • Obsrv 5.3.0-GA
          • Release V 5.1.0
          • Release V 5.1.2
          • Release V 5.1.3
          • Release V 5.0.0
          • Release V 4.10.0
        • Installation Guide
        • Obsrv 2.0 Installation Guide
          • Getting Started with Obsrv Deployment Using Helm
        • System Requirements
      • LEARN
        • Functional Capabilities
        • Dependencies
        • Product Roadmap
        • Product & Developer Guide
          • Telemetry Service
          • Data Pipeline
          • Data Service
          • Data Product
            • On Demand Druid Exhaust Job
              • Component Diagram
              • ML CSV Reports
              • Folder Struture
          • Report Service
          • Report Configurator
          • Summarisers
      • ENGAGE
        • Discuss
        • Contribute to Obsrv
      • Raise an Issue
  • Release Notes
    • Obsrv 1.1.0 Beta Release
    • Obsrv 1.2.0-RC Release
Powered by GitBook
On this page
  • Introduction
  • What does Dataset do?
  • Dataset Management

Was this helpful?

Edit on GitHub
  1. Core Concepts

Datasets

Definition and details about datasets - a key construct in Obsrv

Introduction

Datasets along with connectors form the core constructs of Obsrv. Anything and everything we do in Obsrv is related to managing a dataset. A dataset can be created with one ore more sources and a dataset can result in creation of one or more tables.

What does Dataset do?

Most of the time as Engineers, we tend to focus more on technology and current use-cases that we have and the design would be bounded. Hence with data platforms, we try to solve by thinking about the technologies and designs that can enable ingestion, processing, querying and storage and the use-cases at hand that we have. But as a open-source and future proof product - How do we design for evolving needs?

With Obsrv - we took the route of data first and data centric design and at the core of it is a Dataset. All the components of obsrv are designed to enable creation, management and consumption of dataset.

With data first thinking - we are able to solve quite a few design problems and helps with explosion of use-cases later:

  1. Data Isolation: Instead of creating one mega table, datasets enable structuring of data into many datasets further enable data isolation. A dataset can be created per tenant or region. This would help with Data privacy concerns like GDPR where the data has to be stored locally.

  2. Data Governance: Since a dataset is atomic and can exist indepently, it enables us to govern it more efficiently. Every dataset can have it's own SLA's, Quality metrics, Access Control and derived tables. For ex: Creating a derived table for every tenant can enable one to implement a capability like "Right to be Forgotten" or "Download my data".

  3. Efficient Scaling: Not all datasets will be of the same size. With the dataset construct - one can theoritically scale a specific dataset, instead of scaling entire Obsrv. Traditional design would scale (or size) for the largest source or peak volume.

  4. Decoupled & Modular: Each dataset is decoupled with another dataset. One dataset failure doesn't effect another dataset. This enables interesting use-cases like create a dataset on demand from the same source and direct it for adhoc querying without impact production workloads and shutdown the dataset once adhoc analysis is over

  5. Ease of Operations: At the end of the day, a solution is built once but has to be operated daily. With the dataset construct, even the operations is bifurcated by dataset. This would enable one to focus on critical datasets than all datasets which would result in operational efficiency and reduced operational teams

Dataset Management

Following lists the high level capabilities available as part of Dataset via dataset management APIs

  1. Create Dataset: Create dataset by providing or linking sources

  2. Publish Dataset: Review and Publish a dataset

  3. Manage Dataset: Perform actions like edit, retire and archive a dataset. Edit of a dataset creates a new version and the changes are propagated to existing dataset on publish

  4. Monitor Dataset: Monitor dataset operations. Every dataset exposes various metrics like data load, processing speed, queries per second and query throughput

  5. Manage Tables: Create aggregate tables. Aggregate tables can be created on all or subset of data

  6. Manage Connectors: Manage source and destination connectors for the dataset

PreviousKey CapabilitiesNextConnectors

Last updated 1 year ago

Was this helpful?