Managing VMs like a Data Scientist

It’s just a table — that’s it.

However, it’s got some pretty cool built-in methods to make your data manipulation, interrogation and cleaning a much, much more pleasurable experience.

If we run the code below, which calls the apply() method on the sepal_length column, we get the output shown in the table.

I hope the functionality of the apply() method is clear from the example… If not, stare at it a bit, then read on.

iris.

sepal_length.

apply(lambda row: 'tall' if row >= 5 else 'short')There is a weird lambda keyword thrown into the example, which in short is just a “phantom” function, formally called an anonymous function.

Basically it is a function that doesn’t have a name but runs some code.

In our example, this anonymous lambda function checks each row of our column, and if the sepal length is greater or equal to zero returns tall, otherwise it returns short.

Bring it all togetherI hear you saying: “OK cool Louwrens, nice background, but so what?” Well, we’ve covered all the theory needed to understand what is about to happen, which is:Create a VM Class which gets initialised with an IP and username,the init method then checks if we can connect to the remote VM and uses the ✅ and ⛔️ emoticons to show a successful or unsuccessful connection.

I then create a DataFrame containing all the IPs of our 4 remote VMs on GCP,then we can use the apply() method to run bash commands on these VMs and return a DataFrame.

I then display a summary DataFrame containing the specs for these 4 VMs sitting on GCP.

Below I create the VM Class.

from paramiko import SSHClientfrom paramiko.

auth_handler import AuthenticationExceptionclass VM(object): def __init__(self, ip ,username, pkey='~/.

ssh/id_rsa.

pub'): self.

hostname = ip self.

username = username self.

pkey = pkey self.

logged_in_emoj = '✅' self.

logged_in = True try: ssh = SSHClient() ssh.

load_system_host_keys() ssh.

connect(hostname=self.

ip, username=self.

username, key_filename=self.

pkey) ssh.

close() except AuthenticationException as exception: print(exception) print('Login failed'%(self.

username+'@'+self.

ip)) self.

logged_in_emoj = '⛔️' self.

logged_in = False def __str__(self): return(self.

username+'@'+self.

ip+' '+self.

logged_in_emoj)I then create a DataFrame, VMs, which holds all the IPs for our 4 VMs on GCP.

VMs = pd.

DataFrame(dict(IP=['35.

204.

255.

178', '35.

204.

96.

40', '35.

204.

213.

24', '35.

204.

115.

95']))We can then call the apply() method on the DataFrame, which iterates through each host and creates a VM Class object for each VM which gets stored in the VM column of the VMs DataFrame.

VMs['VM'] = VMs.

apply(lambda row: VM(row.

IP, USERNAME, PUB_KEY), axis=1)Note that the __str__() method of our VM Class is used to represent the VM Class in a DataFrame, as seen below.

Each VM is represented as username + ip + ✅, exactly how we defined it in the __str__() method.

Ok great, we’ve created a DataFrame with a bunch of connected VMs inside.

What can we do with these?For those of you who don’t know, there is a command called lscpu in Unix which displays all the information about the CPUs on a machine, below is an example output for vm1.

louwjlabuschagne_gmail_com@vm1:~$ lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 1On-line CPU(s) list: 0Thread(s) per core: 1Core(s) per socket: 1Socket(s): 1NUMA node(s): 1Vendor ID: GenuineIntelCPU family: 6Model: 85Model name: Intel(R) Xeon(R) CPU @ 2.

00GHzStepping: 3CPU MHz: 2000.

170BogoMIPS: 4000.

34Hypervisor vendor: KVMVirtualization type: fullL1d cache: 32KL1i cache: 32KL2 cache: 256KL3 cache: 56320KNUMA node0 CPU(s): 0We are now looking to get the output of the lscpu command for each of our 4 VMs on GCP; we can wrap the lcspu function in the exec_command() method (see the github repo) to return the output of each VM’s lscpu command.

lscpu = VMs.

VM.

apply(lambda vm: exec_command(‘lscpu’))With which we can obtain a DataFrame like the one shown below.

Another useful command is the cat /proc/meminfo command, shown below, which returns the current state of the RAM for a Unix machine.

louwjlabuschagne_gmail_com@my-vm1:~$ cat /proc/meminfoMemTotal: 1020416 kBMemFree: 871852 kBMemAvailable: 835736 kBBuffers: 10164 kBCached: 53504 kBSwapCached: 0 kBActive: 92012 kBInactive: 17816 kBActive(anon): 46308 kBInactive(anon): 4060 kBActive(file): 45704 kBInactive(file): 13756 kBUnevictable: 0 kBMlocked: 0 kBSwapTotal: 0 kBSwapFree: 0 kBDirty: 28 kBWriteback: 0 kBAnonPages: 46176 kBMapped: 25736 kBI’ve extracted the most relevant columns from the lscpu, and cat /proc/meminfo commands and display an overview of our 4 VMs below.

We can plot this information quickly with a library like seaborn or plotly that works great out of the box with DataFrame objects, or we can get summary statistics for all our VMs using the built-in methods pandas has.

ConclusionThis post has only scratched the surface on how using Classes and DataFrames in conjunction with each other can ease your life.

Be sure to check out the jupyter notebook on the github repo to fill in some coding gaps I’ve eluded to in this post.

The next time you are doing data wrangling with pandas I encourage you to take a step back and consider wrapping some of the functionality you need in a Class and seeing how that could improve your workflow.

Once written, you can always reuse the Class in your subsequent analysis or productionise it with your code.

As the python mindset goes: “Don’t reinvent the wheel every time.

”.

. More details

Leave a Reply