Distributed Systems Concepts
===================================

**CORE CONCEPTS**

- Distributed System: A collection of interconnected computers that communicate and coordinate to achieve a common goal
- Scalability: Handling increased workload

  + Horizontal: Add more machines (nodes)
  + Vertical: Upgrade existing machines (e.g. CPU, RAM)

- Reliability: Functioning correctly despite failures
- Availability: Proportion of uptime
- Consistency: Agreement on data values across all nodes at a given time
- Partition Tolerance: Operating despite network splits
- CAP Theorem: (Pick two) Consistency, Availability, Partition Tolerance
- State Management: Data storage/synchronization across nodes (stateless/stateful)
- Data Partitioning: Dividing data for scalability (horizontal/vertical, hash/range-based)
- Replication Strategies: Copying data for fault tolerance
- (synchronous/asynchronous/quorum-based)
- Consistency Models: Trade-offs between data agreement and performance (strong/eventual/causal)
- Time and Order: Event ordering (Lamport timestamps/ vector clocks)
- Fault Tolerance: Handling failures gracefully (failover/redundancy)
- Concurrency Control: Managing simultaneous access (optimistic/pessimistic)

**ARCHITECTURES**

- Client-Server: Clients request services from servers
- Peer-to-Peer (P2P): Nodes act as both clients and servers
- Master-Slave: One master node coordinates, multiple slaves execute Microservices: System decomposed into small, independent services
- Event-Driven Architecture: Components communicate by reacting to events
- Service-Oriented Architecture (SOA): A collection of services that communicate with each other
- Lambda Architecture: A hybrid architecture for batch and real-time processing

**KEY TECHNOLOGIES**

- Container Orchestration: Kubernetes, Docker Swarm Service Discovery: Consul, etc
- API Gateways: Kong, Tyk, Apigee, AWS API Gateway Load Balancers: HAProxy, NGINX, Traefik, Amazon ELB
- Monitoring & Observability: Prometheus, Grafana
- Distributed Tracing: Zipkin, Jaeger Service Mesh: Istio, Linkerd
- Infrastructure as Code (laC): Terraform, Pulumi, CloudFormation Configuration Management: Ansible, Chef, Puppet Distributed Caches: Redis, Memcached
- Stream Processing: Apache Flink, Apache Spark Streaming Workflow Orchestration: Apache Airflow, Argo Workflows Job Schedulers: Quartz, Apache Oozie, Kubernetes CronJobs Cloud Providers: AWS, Azure, Google Cloud Platform Serverless Platforms: AWS Lambda, Azure Functions, Google Cloud Functions

**FALLACIES OF DISTRIBUTED COMPUTING**

- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn't change
- There is one administrator
- Transport cost is zero
- The network is homogeneous

**COMMUNICATION**

- Remote Procedure Call (RPC): Call a function on a remote machine Message Passing: Asynchronous communication between nodes
- Message Queues (Point-to-Point): RabbitMQ, Amazon SQS
- Message Brokers (Publish-Subscribe): Kafka, Apache Pulsar
- REST: Query language for APls with fine-grained data fetching GraphQL: TLS/SSL, HTTPS, SSH
- gRPC: High-performance RPC framework using Protocol Buffers Webhooks: Automatic notifications sent when certain events occur

**COORDINATION & CONSENSUS**

- Consensus Algorithms: Paxos, Raft
- Distributed Locks: ZooKeeper, etcd
- Distributed Transactions: Two-phase commit (2PC)
- Leader Election: Choosing a coordinator node
- Gossip Protocol: Decentralized information dissemination

**DESIGN PRINCIPLES**

- Loose Coupling: Minimize component dependencies
- High Cohesion: Group related functionality together
- Abstraction: Hide implementation details
- Single Responsibility: Each component does one thing well Separation of Concerns: Divide system into distinct modules
- Statelessness: (Where possible) Avoid storing session data
- Idempotence: Operations can be repeated safely
- Eventual Consistency: Accept temporary inconsistency for availability and performance
- Fail-Fast: Detect and report errors quickly
- Circuit Breaker: Isolate failing components
- Retry Mechanisms: Automatically retry failed operations
- Asynchronicity: Design for non-blocking communication Timeouts/Deadlines: Set limits to prevent indefinite waits

**DATA MANAGEMENT**

Distributed Databases
- NoSQL: MongoDB, Cassandra, DynamoDB (flexible schema, scalable)
- NewSQL: CockroachDB, YugabyteDB (scalable with SQL-like features)
- Distributed File Systems

    + HDFS: For big data (Hadoop)
    + Ceph: Unified, distributed storage system

- Caching: Redis, Memcached (in-memory for speed)
- ACID: Atomicity, Consistency, Isolation, Durability (relational database guarantees)
- BASE: Basically Available, Soft state, Eventual consistency (NoSQL database characteristics)
- Data Partitioning: Distributing data across nodes (sharding, hashing, range partitioning)
- Replication Strategies: Copying data for fault tolerance