Strata SF day 2 Highlights: AI and Politics, Chatbots Insights, Forecasting Uncertainty, Scalable Video Analysis, and more

The frequency at which words appears in a language actually obey Zipf’s law, which observed that the most frequent appeared twice as often as the second most frequent word and so on.

The same distribution curve holds true not just for words but also for phrases and sentences.

95% of first input to all chat bots is covered by just 1800 words.

Avg branching factor decreases with each successive word; implying that people are boring.

We walk around saying the same things most of the time.

The following image shows top 50 inputs across all English-speaking chat bots on the platform (universal domain), regardless of the use-case:Pandorabots has open-sourced chat libraries so that developers do not need to re-invent the wheel.

She mentioned 10 thousand is the magic number of responses/rules required on average to cover a specific domain, which boils to about a month of work prior to launch.

Mike Olson, Chief Strategy Officer, Cloudera delivered a keynote on “The enterprise data cloud”.

The disruptive advances in technology such as Big Data, powerful analytics, and public cloud services has fundamentally changed our expectations of technology: it should be fast, simple, and flexible to use.

He explained the key capabilities an enterprise data cloud system requires, and why the future belongs to multi-cloud (or hybrid cloud).

He shared that for large enterprises the cloud isn’t going to happen; it has already happened.

The data center is not dead.

Virtually every large enterprise still continues to operate data center.

What they really like to do is to have one consistent picture of the data regardless of where it happens to live.

They would like to set, enforce, and monitor security policies and data governance policies to ensure they are doing the right thing.

We will see data privacy evolve as a critical requirement across the globe.

The main lesson that we’ve learned from the cloud is that “ease” seriously matters.

These systems have to spin up and spin down on demand.

The users must be able to self-service their way to any analytic framework they run.

Theresa Johnson, Product Manager, Metrics & Forecasting Products, Airbnb gave a compelling keynote on Forecasting Uncertainty.

She outlined the following challenges in forecasting:She mentioned that it’s actually hard (or some say impossible) to come up with a theory of business that allows you to map input metrics and how they flow together.

As data scientists, we can start to model flow to business activities.

That would lead to better theory of case.

At each step of the process we can quantify the uncertainty.

Decision trees require theory; but, building a theoretical model of business can be daunting.

To get started you need to know:For example: When trying to forecast how many pool cleaners are in Chicago?Airbnb way of solving it – Create a model on how to get pool cleaners using both a supply side and demand side.

Bringing two sides together in this two-sided marketplace of pool cleaners gives final number of pool cleaners in Chicago.

Interesting aspect here is that when each individual metric is forecasted separately, we know where the error/uncertainty is around each specific part of the overall forecast.

We can titrate and figure out how that would impact the outcome.

Alex Poms and Will Crichton, Ph.


students from Stanford University gave an interesting talk on “Scanner: Efficient video analysis at scale”.

  There are an increasing number of applications that depend on processing large amount of video.

More video datasets are available in a world instrumented with cameras, such as millions of unlabeled YouTube videos and millions of labeled action clips.

They outlined how big video data is different from traditional big data:Goals in mind while architecting Scanner:Video frames are big i.


1 hour of video can be equivalent to 12 Wikipedias / 12 GB of pixels.

They did a live demo of Scanner and went in some detail to cover programming model and dataflow.

The event also included prize ceremony for the Strata Data Awards, which recognize outstanding advances in the fields of data science and machine learning.

This year’s winners (decided by a team of judges along with audience votes):Resources:Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.



js; (document.

getElementsByTagName(head)[0] || document.


appendChild(dsq); })();.

. More details

Leave a Reply