System Design Topics

Overview

  • master key design concepts
  • Every system is unique, and the larger and more complex the system, the further it is from conventional design approaches, but in an interview you are expected to demonstrate your knowledge and understanding of widely used system design concepts and best practices
  • engineers are paid big money to come up with secret sauce to solve unique problems -> so don't worry about coming up with a unique solution
  • you'll learn how to build a reliable, scalable, secure, fast, easy-to-maintain, and low-dollar-cost system
  • youtube is a collection of small systems

Thought process:

  • break problem into common subproblems, then apply fundamental concepts & best practices to solve them

problems to think about

  • how to transfer data at large scale
  • how to aggregate data efficiently
  • how to store data reliably
  • how to retrieve data quickly

need to learn

  • what common system design problems exists
  • what tools we have to solve them
  • think in LEGO model. each brick is a system design model

messaging queues

general purpose

  • RabbitMQ
  • Apache ActiveMQ

event streaming platforms

  • Apache Kafka
  • Amazon Kinesis

message queuing services

  • amazon SQS
  • Azure Queue Storage

Pub/Sub services

  • amazon SNS
  • google cloud pub/sub

Requirements

functional

  • behavior (what system is supposed to do)

non-functional

  • quality of system (how system is supposed to be)
  • if interviewer gives ambiguous problem
    • my reply should be the problem seems too big
    • let me try to define specific functions and reduce the scope

Questions to ask yourself

  • do we need to scale for writes or reads?
  • both
    • to scale for writes
      • partition messages and store in separate queues
      • which partition strategy to use?
        • maybe hash is okay
    • where do i store quickly?
      • in memory (bounded queue or disk) or in disk (append-only log or an embedded database)
        • if database
          • should i pick b-tree or lsm tree?
            • more probably lsm as such dbs are faster for writes
    • to scale for read
      • paritioning will help here too. we will have consumer per partition
      • should i use pull or push for reading messages?
        • if i go with pull, i need to make sure system supports long polling to decrease number of read requests
  • how to get high availability
    • i need to replicate messages
    • leaderbased or leaderless replication?
      • most likely leader based, but then i need to solve leader election problem
        • that should be easy i can use coordination service or a DB that guarantees strong consistency
  • how to make system reliable
    • i need some protection mechanisms (load shedding, rate limiting, maybe shuffle sharding)
    • should i use reverse proxy - maybe
      • it will take care partition discovery and message routing
  • how to make it fast?
    • i should consider batching and compressing

Actions

  • write the functional and non-functional requirements. keep a list.
  • higher the item in the list, more important it is

Conclusion

when you know what concepts exist to address what requirements, you are much better set for a meaningful discussion with interviewer

Functional Requirements

  • need to identify who is going to use system and how
  • generally written in form of user stories
  • when a user does this, system does that
    • user could be creator/ viewer (in youtube) or other systems (rate limiting system)
  • easier to identify for systems we know (youtube, facebook, etc)
  • harder to identify for systems we don't interact with generally (rate limiting systems, fraud prevention, CDN, etc)

expectation:

  • the interviewer wants to see how we deal with ambiguity
  • also how we analyze big obscure problems & how we reduce compexity
  • expect lot more ambiguity for senior roles

action:

  • start with customer and work backwards (amazon's approach)

Non Functional Requirements

High availability

  • uptime (system has been working and available)
  • count based
    • success ratio of requests
  • it may happen that there was network outage and client couldn't reach server, even though server was available. therefore perspective matters and we should care about client's experience as system engineer
  • involves both architecture (design concepts) and process (how we deploy etc)

Steps

  • build redundancy to avoid single point of failure (regions, availability zones, replication)
  • switch from 1 server to another without losing data (DNS, load balancer, reverse proxy)
  • protect from client (load shedding, rate limiting)
  • protect from failure and perf degradation of dependencies (timeout, circuit breaker)
  • detect failure (monitoring, health checks)

processes

  • change management (changes should be reviwed and approved)
  • QA (tests to validate newly introduced changes work)
  • deployment pipeline (deploy to prod, automated rollback)
  • capacity planning (monitoring system resources for growing demand)
  • disaster recovery (backup, restore)
  • root cause analysis
  • game day (simulate a failure & test system and team response)
  • team culture
  • Dynamo DB gives 99.9999% availability

Reliability

  • Error: mistake, usually made by people
  • Fault: a bug caused by error
  • Fault tolerance:
    • goal is zero downtime
    • requires even more redundancy than high availability -> higher cost
  • High availability:
    • goal is to minimize downtime

Analogy:

  • car vs airplane; in case of care easier to change the tire
  • in case of engine failure in airplane; other engines have to take over (atleast till the nearest airport)
  • car -> high availability. airplane -> fault tolerant
  • Resilience:
    • in case of fault, maintain acceptable level of service
  • Game Day
    • more focused on teams and its actions
    • goal is to build muscle memory on how to response to events
  • choas engineering
    • more focused on systems behavior
  • Reliability = Availability + Correctness + time (system replies in timely manner)
  • reliability, high availability -> how system works in expected failures
  • fault tolerance, resilience -> how system works in unexpected failures

Fault Tolernace

Scalability

  • the property of a system to handle growing load
  • load could be:
    1. requests per second
    2. volume of data
    3. number of concurrent connections

vertical vs horizontal

vertical

  • add resources to system
  • after a point, it becomes expensive

horizontal

  • add more systems

designing horizontal scalability

  • harder
  1. Service Discovery
  • clients have to discover service machines
  1. Load Balancing
  2. Request Routing
  3. Maintenace

horizontal scaling for DB

  • sharding to scale writes
  • replication to scale reads

how to get availability in a single machine

  • active-passive setup

Caching