Finding and Fixing Memory Leaks in PythonPeter KarpBlockedUnblockFollowFollowingJan 17One of the major benefits provided in dynamic interpreted languages such as Python is that they make managing memory easy.
Objects such as arrays and strings grow dynamically as needed and their memory is cleared when no longer needed.
Since memory management is handled by the language, memory leaks are less common of a problem than in languages like C and C++ where it is left to the programmer to request and free memory.
The BuzzFeed technology stack includes a micro-service architecture that supports over a hundred services many of which are built with Python.
We monitor the services for common system properties such as memory and load.
In the case of memory a well-behaved service will use memory and free memory.
It performs like this chart reporting on the memory used over a three-month period.
A microservice that leaks memory over time will exhibit a saw-tooth behavior as memory increases until some point (for example, maximum memory available) where the service is shut down, freeing all the memory and then restarted.
Sometimes a code review will identify places where underlying operating system resources such as a file handle are allocated but never freed.
These resources are limited and each time they are used they allocate a small amount of memory and need to be freed after use so others may use them.
This post first describes the tools used to identify the source of a memory leak.
It then presents a real example of a Python application that leaked memory and how these tools were used to track down the leakToolsIf a code review does not turn up any viable suspects, then it is time to turn to tools for tracking down memory leaks.
The first tool should provide a way to chart memory usage over time.
At BuzzFeed we use DataDog to monitor microservices performance.
Leaks may accumulate slowly over time, several bytes at a time.
In this case it is necessary to chart the memory growth to see the trend.
The other tool, tracemalloc, is part of the Python system library.
Essentially tracemalloc is used to take snapshots of the Python memory.
To begin using tracemalloc first call tracemalloc.
start() to initialize tracemalloc, then take a snapshot usingsnapshot=tracemalloc.
take_snapshot()tracemalloc can show a sorted list of the top allocations in the snapshot using the statistics() method on a snapshot.
In this snippet the top five allocations grouped by source filename are logged.
for i, stat in enumerate(snapshot.
statistics(‘filename’)[:5], 1): logging.
info(“top_current”,i=i, stat=str(stat))The output will look similar to this:This shows the size of the memory allocation, the number of objects allocated and the average size each on a per module basis.
We take a snapshot at the start of our program and implement a callback that runs every few minutes to take a snapshot of the memory.
Comparing two snapshots shows changes with memory allocation.
We compare each snapshot to the one taken at the start.
By observing any allocation that is increasing over time we may capture an object that is leaking memory.
The method compare_to() is called on snapshots to compare it with another snapshot.
The 'filename' parameter is used to group all allocations by module.
This helps to narrow a search to a module that is leaking memory.
current = tracemalloc.
take_snapshot()stats = current.
compare_to(start, ‘filename’)for i, stat in enumerate(stats[:5], 1): logging.
info(“since_start”, i=i, stat=str(stat))The output will look similar to this:This shows the size and the number of objects and a comparison of each and the average allocation size on a per module basis.
Once a suspect module is identified, it may be possible to find the exact line of code responsible for a memory allocation.
tracemalloc provides a way to view a stack trace for any memory allocation.
As with a Python exception traceback, it shows the line and module where an allocation occurred and all the calls that came before.
Reading bottom to top, this shows a trace to a line in the socket module where a memory allocation took place.
With this information it may be possible to finally isolate the cause of the memory leak.
In this first section we saw that tracemalloc takes snapshots of memory and provides statistics about the memory allocation.
The next section describes the search for an actual memory leak in one BuzzFeed microservice.
The Search for Our Memory LeakOver several months we observed the classic saw-tooth of an application with a memory leak.
We instrumented the microservice with a call to trace_leak(), a function we wrote to log the statistics found in the tracemalloc snapshots.
The code loops forever and sleeps for some delay in each loop.
The microservice is built using tornado so we call it using spawn_callback() and pass parameters delay, top and trace:tornado.
ioloop.
IOLoop.
current().
spawn_callback(trace_leak, delay=300, top=5, trace=1)The logs for a single iteration showed allocations occurring in several modulestracemalloc is not the source of the memory leak!.However, it does require some memory so it shows up here.
After running the service for several hours we use DataDog to filter the logs by module and we start to see a pattern with socket.
py:The size of the allocation for socket.
py is increasing from 1840 KiB to 1845 KiB.
None of the other modules exhibited this clear a trend.
We next look at the traceback for socket.
py.
We identify a possible causeWe get a stack trace from tracemalloc for the socket module.
Initially, I want to assume that Python and the standard library is solid and not leaking memory.
Everything in this trace is part of the Python 3.
6 standard library except for a package from DataDog ddtrace/writer.
py.
Given my assumption about the integrity of Python, a package provided by a third-party seems like a good place to start investigating further.
It’s still leakingWe find when ddtrace was added to our service and do a quick rollback of requirements and then start monitoring the memory again.
Another look at the logsOver the course of several days the memory continues to rise.
Removing the module did not stop the leak.
We did not find the leaking culprit.
So it’s back to the logs to find another suspect.
There is nothing in these logs that looks suspicious on its own.
However, ssl.
py is allocating the largest chunk by far, 2.
5 MB of memory.
Over time the logs show that this remains constant, neither increasing nor decreasing.
Without much else to go on we start checking the tracebacks for ssl.
py.
A solid leadThe top of the stack shows a call on line 645 of ssl.
py to peer_certificate().
Without much else to go on we make a long-shot Google search for “python memory leak ssl peer_certificate” and get a link to a Python bug report.
Fortunately, this bug was resolved.
Now it was simply a matter of updating our container image from Python 3.
6.
1 to Python 3.
6.
4 to get the fixed version and see if it resolved our memory leak.
Looks goodAfter updating the image we monitor the memory again with DataDog.
After a fresh deploy around Sept.
9th the memory now runs flat.
SummaryHaving the right tools for the job can make the difference between solving the problem and not.
The search for our memory leak took place over two months.
tracemalloc provides good insight into the memory allocations happening in a Python program; however, it does not know about the memory allocations that take place in packages that are allocating memory in C or C++.
In the end, tracking down memory leaks requires patience, persistence, and a bit of detective work.
Referenceshttps://docs.
python.
org/3/library/tracemalloc.
htmlhttps://www.
fugue.
co/blog/2017-03-06-diagnosing-and-fixing-memory-leaks-in-python.
htmlTo keep in touch with us here and find out what’s going on at BuzzFeed Tech, be sure to follow us on Twitter @BuzzFeedExp where a member of our Tech team takes over the handle for a week!.