AWS Basic – Network 101

Here are the AWS Networking knowledges that are fundamental for cloud computing.

Region: (e.g. us-east-1)

AWS has the concept of a Region, which is a physical location around the world where we cluster data centers.

Each AWS Region is designed to be isolated from the other AWS Regions. This design achieves the greatest possible fault tolerance and stability.

VPC:

The Amazon Virtual Private Cloud (Amazon VPC) service lets you provision a private, isolated section of the AWS Cloud where you can launch AWS services and other resources in a virtual network that you define.

Availability Zone: (e.g: us-east-1a)

An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZ’s give customers the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center.

Subnet:

Separate subnets for unique routing requirements. AWS recommends using public subnets for external-facing resources and private subnets for internal resources. For each Availability Zone, this Quick Start provisions one public subnet and one private subnet by default.

Internet Gateway:

An internet gateway is an access point through which your resources can access the internet and be accessed from the internet.

NAT Gateway:

A NAT gateway can route outgoing traffic from private subnets to the internet.

Route 53:

Amazon Route 53 is the DNS available for your AWS resources.

Percentile – Monitoring

When we want to monitor the distributed system, we usually use “percentile”. For example, P99 – that means percentile 99, we mesure the performance until 99% and we exclude the last 1% performance.

Concret example, we say a Service’s latency P99 = 100ms, that means 99% of service response time is less than 100ms.

Normally, the calculate of percentile is expensive. Because we have to take for example 100 samples and order them , find the 99th one.

For monitoring, we usually take P50, P99 and P99.9.

Here is a good example by Elastic which can help to understand the concept. And anther one for going deeper.

The links:

https://www.elastic.co/blog/averages-can-dangerous-use-percentile

https://blog.bramp.net/post/2018/01/16/measuring-percentile-latency/

Single point of failure -SPOF

In distributed system world, Single point of failure (SPOF) is a key word that you should always be aware.

It means if a part of system fails, the whole system will be down. For example, if Service A sends messages to Service B via a single instance of message queue, then if the queue fails, the communication between Service A and B will be completely loses. Then this message queue is Single point of failure (SPOF) of the system.

The key solution to remove SPOF is using “Redundancy“, here is very well document by Oracle that explains the point.

The system “Reliability” explained by Amazon.

Example:

How AWS remove SPOF of load balancer:

  • Elastic Load Balancing (ELB) : There are two logical components in the Elastic Load Balancing service architecture: load balancers and a controller service. The load balancers are resources that monitor traffic and handle requests that come in through the Internet. The controller service monitors the load balancers, adds and removes capacity as needed, and verifies that load balancers are behaving properly.
  • Amazon SQS Standard queues : provides At-least-once delivery:Amazon SQS stores copies of your messages on multiple servers for redundancy and high availability. On rare occasions, one of the servers that stores a copy of a message might be unavailable when you receive or delete a message.
  • Amazon RDS Multi-AZ: In a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone. The primary DB instance is synchronously replicated across Availability Zones to a standby replica to provide data redundancy, eliminate I/O freezes, and minimize latency spikes during system backups.

Useful links:

https://docs.oracle.com/cd/E19424-01/820-4806/fjdch/index.html

https://wa.aws.amazon.com/wat.pillar.reliability.en.html

This blog is part of category Distributed System