At Endgame Engineering, experience has shown us that small errors in the edge cases of web service connection lifecycles can eventually contribute to production outages. So we believe it’s worth the time to exhaustively investigate bugs that we don’t understand and also explore related areas in the code to resolve issues before customers are impacted.
This case study will walk through the identification, investigation, and resolution of errors that we identified in our data streaming pipeline. We will discuss how we use Go and nginx in our tech stack and the lessons we learned while debugging and tuning their HTTP2 performance, as well as some general lessons we learned about debugging these sorts of issues.
Background
Endgame has made it a priority to invest in observability for our customers that are using the Endgame platform. We collect telemetry information about the health of our cloud hosted and on-premises installations and consolidate that information into a centralized data streaming pipeline. We apply streaming analytics on the data to monitor performance trends and product efficacy. Our engineering, customer support, and research teams use the data to regularly roll out improvements to the product.
At a high-level, the pipeline consists of:
Several data collection services which run on customer platform instances and gather telemetry data.
A NATS streaming server on each local platform that queues data for transmission.
A gateway service written in Go that establishes a secure HTTPS connection to the Endgame cloud backend using two-way MTLS. It streams messages from the local queue to our hosted environment.
A nginx server in our hosted environment that terminates the HTTPS connection and forwards messages to a data streaming platform.
Several downstream data consumers that post-process the information for consumption by our internal support teams.
The communication between our deployed platforms and the hosted environment happens over an HTTP2 protocol connection. We chose to use HTTP2 because of its superior compression and performance for high-throughput pipelined communications. Furthermore, HTTP2 is a broadly adopted industry standard, which allows us to easily integrate client and server technologies without custom code.
Symptoms
We noticed several of our cloud hosted platforms reporting this error message in their logs with increasing frequency:
ERR Failed to send 'endpoint-health' message to cloud from queue 'feedback.endpoint-health': Post
https://data.endgame.com/v2/e/data-feedback/endpoint-health: http2: Transport: cannot retry err [http2:
Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define
Request.GetBody to avoid this error
It became apparent that this was a growing issue as the frequency of this error closely aligned with the increasing overall messages per second received by our data streaming backend. We used the AWS log search tool mentioned in an earlier blog post to confirm that the error was occurring broadly across many platforms. This led us to believe that the error was trending with overall data volume. Since we are regularly adding new telemetry and new customers, we concluded that the problem would most likely continue to worsen over time.
We were concerned that this error message represented data loss which would impact our customer support teams that rely on the data. Thankfully, we have retry logic at the application level that ensured these intermittent errors did not cause data loss. In order to ensure that the increasing error rate would not cause data loss in the future, we set out to determine why HTTP2 connections were regularly dropping.
Reproduction
The clear and unique error message from the Go HTTP2 library enabled us to immediately identify the place in the code where the error originated. This told us two things: First, we were misusing the Go HTTP2 transport somehow. And second, our nginx server was returning unexpected GOAWAY messages.
We worked backward from the error message to produce a small Go function that would reproduce the problem:
doRequest := func(string url) error { req, _ := http.NewRequest("POST", url, nil) req.Body = ioutil.NopCloser(bytes.NewReader([]byte("{}"))) resp, err := http.DefaultClient.Do(req) if resp != nil { defer resp.Body.Close() } return err } for i := 0; i < 1001; i++ { go func() { err := doRequest("https://data.endgame.com/v2/e/data-feedback/endpoint-health") if err != nil { fmt.Printf("HTTP error: %v\n", err) } }() }
NOTE: We’ve excluded some domain name and TLS configuration details from this example that are required to authenticate with our servers. If you run this example yourself you’ll see a slew of HTTP 403 errors and probably some TLS connection errors as well.
Running this code produced the output we expected:
HTTP error: Post
https://data.endgame.com/v2/e/data-feedback/endpoint-health: http2: Transport: cannot retry err [http2:
Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define
Request.GetBody to avoid this error
In our original implementation we used the Request.Body field to prepare a call to the Client.Do function. This turned out to be the source of the error, since the Go client does not have enough information to retry the request. The Request.Body field expects an io.ReadCloser and it closes the reader after the first HTTP request. On the other hand, if we use the Request constructor, http.NewRequest, we can pass an io.Reader object instead. Then the request constructor can properly prepare the necessary data structures internally for retrying the request automatically.
So, the fixed doRequest function is:
doRequest := func(string url) error { req, _ := http.NewRequest("POST", url, bytes.NewBufferString("{}")) resp, err := http.DefaultClient.Do(req) if resp != nil { defer resp.Body.Close() } return err }
We learned a few important points while working through the reproduction:
The error only occurred if we ran many requests in parallel with goroutines. The HTTP client in Go is thread-safe, so generally we would expect it to work fine in parallel. This indicated that the error was related to the way that the connection was being closed by the server while several requests were in flight.
The error always occurred on the 1,001st request.
To better understand the HTTP2 traffic we re-ran the reproduction using the Go HTTP2 debugging flag GODEBUG=http2debug=2 that is built-in to the http library. This provided some more detail on the exact cause of the error. The 1,000th request included this unique logging information:
http2: Framer 0xc0003fe2a0: wrote HEADERS flags=END_HEADERS stream=1999 len=7
http2: Framer 0xc0003fe2a0: wrote DATA stream=1999 len=2 data="{}"
http2: Framer 0xc0003fe2a0: wrote DATA flags=END_STREAM stream=1999 len=0 data=""
http2: Framer 0xc0003fe2a0: read GOAWAY len=8 LastStreamID=1999 ErrCode=NO_ERROR Debug=""
http2: Transport received GOAWAY len=8 LastStreamID=1999 ErrCode=NO_ERROR Debug=""
http2: Framer 0xc0003fe2a0: read HEADERS flags=END_HEADERS stream=1999 len=115
This made it clear that the server was sending GOAWAY frames at the 1,999th stream ID. The HTTP2 spec further clarifies that GOAWAY is sent by the server to explicitly close a connection.
Tuning NGINX HTTP2 Settings
Clearly, nginx was closing HTTP2 connections at the 1,000th request. This clue was detailed enough to find an old Go issue referencing the same problem. This validated our findings that the behavior occurred specifically with nginx when requests happen in parallel over the same HTTP2 connection. Go fixed their library in 2016 (and again in 2018) to better handle this specific behavior with nginx.
The conversations in these issues confirmed that nginx was closing the connection due the http2_max_requests setting in the nginx http2 module, which is set to 1000 by default.
An nginx issue opened in 2017 discussed ways to handle this setting for long-lived connections. In our use case, we expect to have HTTP2 connections open for a long time as messages are streamed to the server. It seemed like the best course of action would be to set http2_max_requests to a large value (say, 1 million) and accept that nginx would regularly close connections.
Digging Deeper
We could claim that the Go client http.Request constructor and possibly also the nginx setting was the “root cause” and be done with the investigation. But at Endgame we emphasize an engineering culture that values tenacity and technical excellence, and we agree with John Allspaw and many others that “there is no root cause” in complex systems. Rather, incidents have several contributing causes that emerge from unexpected events and decisions by engineers over time as they build and maintain an overall system.
So we asked: If we didn’t know about this nginx setting, what else did we not know about HTTP2 performance that might be contributing to connection drops and other performance issues? We knew that the original error message, “define Request.GetBody to avoid this error” indicated that we had misconfigured our Go client in at least one way and we suspected that there could be other similar issues lurking.
So, we made two changes to our test environment as preparation for further study:
Revise our use of Go http.Client so that we use the proper http.NewRequest constructor.
Apply the new http2_max_requests setting set to 1 million.
With these changes, we could confirm that our reproduction code did not trigger the original error, and with debugging on we could also confirm that nginx did not send a GOAWAY frame on the 1,000th request.
Early in the investigation, we observed that this error scenario was becoming more common as message volume increased. We assumed that under low load circumstances, the HTTP2 client would go extended periods of time without any messages to send and it would naturally timeout and close. We decided to test that assumption with a simple function that would send a single message and then wait for a timeout:
doRequest() // same function as the earlier example time.Sleep(100 * time.Hour)
Again, the Go GODEBUG=http2debug=2 environment variable helped us watch the HTTP2 connection lifecycle. An abridged log looked like this:
...
2019/05/22 16:23:15 http2: Framer 0xc0001722a0: wrote HEADERS flags=END_HEADERS stream=1 len=65
2019/05/22 16:23:15 http2: Framer 0xc0001722a0: wrote DATA stream=1 len=2 data="{}"
...
2019/05/22 16:23:15 http2: decoded hpack field header field ":status" = "200"
...
2019/05/22 16:23:15 http2: Transport received HEADERS flags=END_HEADERS stream=1 len=111
2019/05/22 16:26:15 http2: Framer 0xc00015c1c0: read GOAWAY len=8 LastStreamID=1 ErrCode=NO_ERROR Debug=""
2019/05/22 16:26:15 http2: Transport received GOAWAY len=8 LastStreamID=1 ErrCode=NO_ERROR Debug=""
2019/05/22 16:26:15 http2: Transport readFrame error on conn 0xc000424180: (*errors.errorString) EOF
...
On the one hand, this was exactly what we expected to see. Eventually, the connection times out and is closed. On the other hand, it’s a surprise because the server is closing the connection after three minutes via the now-familiar GOAWAY frame. This matches the nginx default value of three minutes for http2_idle timeout - but why doesn’t the Go client close the idle connection itself? According to the Go docs, the value for IdleConnTimeout in the default http client is 90 seconds.
Remember earlier when we mentioned that this service uses mutual TLS? Looking over the code, we found that we were taking a naive approach to instantiating the HTTP client, essentially using code like this:
var tlsCfg tls.Config // initialize the proper client and server certificates... client := http.Client{ Timeout: 30 * time.Second, Transport: &http.Transport{ TLSClientConfig: tlsCfg, }, } // use client for HTTP POST requests...
We had remembered to explicitly set a HTTP client timeout. As explained in this informative blog post, if we initialize a http.Client but don’t set the Timeout field, the client will never timeout HTTP requests. But our experiment showed that the same “infinite timeout” default exists for other settings as well and we did not set those properly. In particular, http.Transport.IdleConnTimeout defaults to 0, which means it will keep idle connections open forever.
So, we revised our code to explicitly define the transport-level defaults to match the suggested defaults in the Go http.DefaultTransport:
client := http.Client{ Timeout: 30 * time.Second, Transport: &http.Transport{ MaxIdleConns: 100, IdleConnTimeout: 90 * time.Second, TLSHandshakeTimeout: 10 * time.Second, ExpectContinueTimeout: 1 * time.Second, TLSClientConfig: tlsCfg, }, }
Running our experiment again, we confirmed that the Go client closed the connection itself after 90 seconds:
2019/05/22 17:45:24 http2: Framer 0xc0002641c0: wrote HEADERS flags=END_HEADERS stream=1 len=65
2019/05/22 17:45:24 http2: Framer 0xc0002641c0: wrote DATA stream=1 len=2 data="{}"
...
2019/05/22 17:45:24 http2: decoded hpack field header field ":status" = "200"
...
2019/05/22 17:46:54 http2: Transport closing idle conn 0xc00022d980 (forSingleUse=false, maxStream=1)
At this point, we felt much more comfortable that we were handling the HTTP2 lifecycle correctly on both the client and server.
Conclusion
The exploration of this bug taught us several general lessons about these sort of issues.
Don’t forget about default values in Go HTTP!
Manypeoplehavewritten about configuring various HTTP settings in Go. We are glad that we were able to learn this lesson without experiencing a production outage due to our misconfiguration, and appreciate the other contributions that people have made to the community by documenting best practices.
Reproduction is worth the time
We could have easily fixed this specific error message by defining the Request.GetBody method and assuming the problem would be fixed. But in reality, it would have masked several places in code where we had misconfigured our client and server relative to our production workloads. Working through a minimal reproduction identified the exact behavior necessary to trigger the bug and clued us into other code changes that would improve the performance and stability of the data pipeline.
Open source greatly speeds up debugging and fixing issues
Thanks to a well-written error message we were able to explore the exact location in the Go source code that triggered our bug. Reading through that code and related locations (like the constructor for http.Request) is what taught us how to properly utilize these objects to refactor our code appropriately. Without open source, we would have worked around the bug but not truly understood the underlying logic.