Fun with AI: Writing a program to calculate p95 to p99 latencies : Part 1

Purpose/Summary
Target Audience
Why percentiles, and why not average?
Sample Use Case
Preparing Test Data using ChatGPT
Part 2: Using ChatGPT to write a program to calculate message processing times
Part 3: Using the latencies to calculate the number of partitions and consumer instances needed
Other related items
Appendix
- Github project link
- Screenshot of the ChatGPT Chat Transcript

Purpose/Summary

There are 2 purposes of this blog post:

The first purpose is to showcase how to leverage tools like ChatGPT/AI to make dev work easier
The second purpose is to generate awareness of why we should use percentiles and not averages.

Target Audience

The target audience is someone who’s ideally working/worked on distributed systems already, and wishes to understand how to ensure/calculate the Availability, Reliability, Latency, System Requirements, etc of a distributed system.

Why percentiles, and why not average?

There are a lot of resources available on it already online, so I won’t go into too much detail and try to summarize it:

Averages (Mean) don’t reflect the reality when it comes to scale. When we’re talking about scale at billions/trillions, the average might be 5 milliseconds, the 95th percentile might be 19 milliseconds and the 99th percentile might be 37 milliseconds; thus the number we’d need to ensure 95% SLA or 99% SLA wouldn’t be calculated properly by using an Average.
If SLAs aren’t adhered to, businesses might have to pay monetary penalties to their customers.
Even when we talk about a scale of just 1 billion events a day, even a failure rate of 0.001% means 10,000 message failures a day – things work differently at scale, and that’s why distributed computing is a separate specialization altogether.

Sample Use Case

The sample use case that we’ll take is:

We have a distributed message-processing system
We are using Kafka
We want to calculate the number of Kafka Partitions required to ensure 99% SLA that messages will be processed within 1 second.

Preparing Test Data using ChatGPT

For things like hardware reliability, the data is collected by OEMs through vendors, dealers etc. Backblaze’s Hard drive test data is a good example: Hard Drive Data and Stats

Things are much easier in tech. It’s easy to load test a digital system, generate logs, and gather data.

For this blog post, we’ll use ChatGPT 3.5 to generate ~5 million log files: Here’s the chat transcript with ChatGPT: https://chat.openai.com/share/f88121bb-cdeb-4429-9e44-58a6ff63e53f

I had to modify the generated program to keep the timestamps chronologically correct (ChatGPT/AI isn’t a silver bullet and human decision-making is still a necessity). (We also didn’t cover Span IDs, and they’re necessary for a distributed system, but that’s for another time maybe).

I also modified the generated program to keep it limited to a single service, just for simplicity’s sake. Here’s a sample of the 5 million log files generated by the program:

The next steps would be to optimize the message processing times, and then move on to hardware requirement calculations.

Part 2: Using ChatGPT to write a program to calculate message processing times

In the next blog post, we’ll use ChatGPT to calculate the Average, 95th percentile, and 99th percentile message processing times.

We’ll also take a look at Skewness, Kurtosis etc. and why they’re relevant. Here are a few resources meanwhile:

Part 3: Using the latencies to calculate the number of partitions and consumer instances needed

In the 3rd version of this blog post, we’ll use the 99th percentile processing times to figure out how many Kafka partitions we’ll need.

Other related items

Mathematics and Statistics are used to calculate things like Reliability of Harddisks, Uptime Guarantees, Chances of Failures/Success etc. – and those things drive business decisions.

Appendix

Github project link

The program generated for this specific blog post is present here: https://github.com/MGoyal92/mayankgoyal.com-blog-programs/tree/master/src/main/java/com/mayankgoyal/blog/p95p99latency

Additional menu

Table of Contents