System Design Topics

Overview

master key design concepts
Every system is unique, and the larger and more complex the system, the further it is from conventional design approaches, but in an interview you are expected to demonstrate your knowledge and understanding of widely used system design concepts and best practices
engineers are paid big money to come up with secret sauce to solve unique problems -> so don't worry about coming up with a unique solution
you'll learn how to build a reliable, scalable, secure, fast, easy-to-maintain, and low-dollar-cost system
youtube is a collection of small systems

Thought process:

break problem into common subproblems, then apply fundamental concepts & best practices to solve them

quality of system (how system is supposed to be)
if interviewer gives ambiguous problem
- my reply should be the problem seems too big
- let me try to define specific functions and reduce the scope

do we need to scale for writes or reads?
both
- to scale for writes
  - partition messages and store in separate queues
  - which partition strategy to use?
    - maybe hash is okay
- where do i store quickly?
  - in memory (bounded queue or disk) or in disk (append-only log or an embedded database)
    - if database
      - should i pick b-tree or lsm tree?
        more probably lsm as such dbs are faster for writes
- to scale for read
  - paritioning will help here too. we will have consumer per partition
  - should i use pull or push for reading messages?
    - if i go with pull, i need to make sure system supports long polling to decrease number of read requests
how to get high availability
- i need to replicate messages
- leaderbased or leaderless replication?
  - most likely leader based, but then i need to solve leader election problem
    - that should be easy i can use coordination service or a DB that guarantees strong consistency
how to make system reliable
- i need some protection mechanisms (load shedding, rate limiting, maybe shuffle sharding)
- should i use reverse proxy - maybe
  - it will take care partition discovery and message routing
how to make it fast?
- i should consider batching and compressing

when you know what concepts exist to address what requirements, you are much better set for a meaningful discussion with interviewer

need to identify who is going to use system and how
generally written in form of user stories
when a user does this, system does that
- user could be creator/ viewer (in youtube) or other systems (rate limiting system)
easier to identify for systems we know (youtube, facebook, etc)
harder to identify for systems we don't interact with generally (rate limiting systems, fraud prevention, CDN, etc)

uptime (system has been working and available)
count based
- success ratio of requests
it may happen that there was network outage and client couldn't reach server, even though server was available. therefore perspective matters and we should care about client's experience as system engineer
involves both architecture (design concepts) and process (how we deploy etc)

build redundancy to avoid single point of failure (regions, availability zones, replication)
switch from 1 server to another without losing data (DNS, load balancer, reverse proxy)
protect from client (load shedding, rate limiting)
protect from failure and perf degradation of dependencies (timeout, circuit breaker)
detect failure (monitoring, health checks)

Error: mistake, usually made by people
Fault: a bug caused by error
Fault tolerance:
- goal is zero downtime
- requires even more redundancy than high availability -> higher cost
High availability:
- goal is to minimize downtime

car vs airplane; in case of care easier to change the tire
in case of engine failure in airplane; other engines have to take over (atleast till the nearest airport)
car -> high availability. airplane -> fault tolerant
Resilience:
- in case of fault, maintain acceptable level of service
Game Day
- more focused on teams and its actions
- goal is to build muscle memory on how to response to events
choas engineering
- more focused on systems behavior
Reliability = Availability + Correctness + time (system replies in timely manner)
reliability, high availability -> how system works in expected failures
fault tolerance, resilience -> how system works in unexpected failures