Table of Contents
- Purpose/Summary
- Target Audience
- Why percentiles, and why not average?
- Sample Use Case
- Preparing Test Data using ChatGPT
- Part 2: Using ChatGPT to write a program to calculate message processing times
- Part 3: Using the latencies to calculate the number of partitions and consumer instances needed
- Other related items
- Appendix
- Github project link
- Screenshot of the ChatGPT Chat Transcript
Purpose/Summary
There are 2 purposes of this blog post:
- The first purpose is to showcase how to leverage tools like ChatGPT/AI to make dev work easier
- The second purpose is to generate awareness of why we should use percentiles and not averages.
Target Audience
The target audience is someone who’s ideally working/worked on distributed systems already, and wishes to understand how to ensure/calculate the Availability, Reliability, Latency, System Requirements, etc of a distributed system.
Why percentiles, and why not average?
There are a lot of resources available on it already online, so I won’t go into too much detail and try to summarize it:
- Averages (Mean) don’t reflect the reality when it comes to scale. When we’re talking about scale at billions/trillions, the average might be 5 milliseconds, the 95th percentile might be 19 milliseconds and the 99th percentile might be 37 milliseconds; thus the number we’d need to ensure 95% SLA or 99% SLA wouldn’t be calculated properly by using an Average.
- If SLAs aren’t adhered to, businesses might have to pay monetary penalties to their customers.
- Even when we talk about a scale of just 1 billion events a day, even a failure rate of 0.001% means 10,000 message failures a day – things work differently at scale, and that’s why distributed computing is a separate specialization altogether.
Sample Use Case
The sample use case that we’ll take is:
- We have a distributed message-processing system
- We are using Kafka
- We want to calculate the number of Kafka Partitions required to ensure 99% SLA that messages will be processed within 1 second.
Preparing Test Data using ChatGPT
For things like hardware reliability, the data is collected by OEMs through vendors, dealers etc. Backblaze’s Hard drive test data is a good example: Hard Drive Data and Stats
Things are much easier in tech. It’s easy to load test a digital system, generate logs, and gather data.
For this blog post, we’ll use ChatGPT 3.5 to generate ~5 million log files: Here’s the chat transcript with ChatGPT: https://chat.openai.com/share/f88121bb-cdeb-4429-9e44-58a6ff63e53f
I had to modify the generated program to keep the timestamps chronologically correct (ChatGPT/AI isn’t a silver bullet and human decision-making is still a necessity). (We also didn’t cover Span IDs, and they’re necessary for a distributed system, but that’s for another time maybe).
I also modified the generated program to keep it limited to a single service, just for simplicity’s sake. Here’s a sample of the 5 million log files generated by the program:


The next steps would be to optimize the message processing times, and then move on to hardware requirement calculations.
Part 2: Using ChatGPT to write a program to calculate message processing times
In the next blog post, we’ll use ChatGPT to calculate the Average, 95th percentile, and 99th percentile message processing times.
We’ll also take a look at Skewness, Kurtosis etc. and why they’re relevant. Here are a few resources meanwhile:
- https://simple.wikipedia.org/wiki/Descriptive_statistics
- https://en.wikipedia.org/wiki/Descriptive_statistics
- https://www.ibm.com/docs/en/spss-statistics/beta?topic=reliability-scale-statistics
Part 3: Using the latencies to calculate the number of partitions and consumer instances needed
In the 3rd version of this blog post, we’ll use the 99th percentile processing times to figure out how many Kafka partitions we’ll need.
Other related items
Mathematics and Statistics are used to calculate things like Reliability of Harddisks, Uptime Guarantees, Chances of Failures/Success etc. – and those things drive business decisions.
Appendix
Github project link
The program generated for this specific blog post is present here: https://github.com/MGoyal92/mayankgoyal.com-blog-programs/tree/master/src/main/java/com/mayankgoyal/blog/p95p99latency