# Data science for weather forecast: how to prove a funny theory

????Automation with a web scraping scriptBasically, the idea is that the script visits each page independently and looks for the data we want, calculates the average temperature of the summer, captures the temperatures of the 3 days in May, and repeats the process for each year starting from 1945.

But how exactly can we do that?This is the URL of the page you just saw above.

As you can see, it is specified in the URL the city and the date you want to access.

So we could tell the script which URL to visit, for each day we were interested in.

But here comes another issue.

Once the script is on the page, how does it detect the temperature that we want?Well, as you might know, each webpage is written in HTML format, which means that each element that you see on screen belongs to a specific HTML tag.

And, luckily for us, each page of that website was structured in the exact same way.

So the only thing that we needed to do is identify in which HTML tag was the daily mean temperature stored and tell the script to fetch that specific value.

(For those interested, we used the python library Beautiful Soup)The script was then able to do all the nasty calculations for us and return for each year the average summer temperature and the individual temperatures of the 3 days of May, all bundled in a nice Excel sheet ????.

what our dataset looks like now (temperatures are in Fahrenheit).

The first 3 columns are the 3 days of May, and the last one is the average summer temperatureBut (there’s always a “but”), that was not enough.

In fact, when you think about it, each line of our Excel sheet represented 1 year (average temp of summer + 3 days of May).

So, even if we went back to 1945, that represented only 73 lines… which is far too little data to pretend to do any sort of reliable analysis or prediction ( a couple hundreds would be much better).

So we decided to repeat the exact same process for 4 other cities of northern California around San Jose which were subject to the same type of weather but were far enough not to have redundant data (taking San Francisco for instance, which is by the sea would have biased everything, and taking Milpitas, which is in San Jose suburbs wouldn’t have added any relevant data).

We now have 370 measurements, which is not ideal, but sufficient to start doing some analysis.

Let the analysis begin!Data TransformationNow let’s try to simplify our dataset to make it easier to analyze.

To start things off, we pulled the Excel file data into Alteryx, a data science tool to create end-to-end data pipelines.

This will help us prepare and analyze the data all along the experiment.

Ingested Data: we decided to add 2 columns which indicated the city and the year of the measurementWe aimed to visualize the data using Tableau, which is one of the most commonly used Business Intelligence (BI) tools.

Hence, we needed to transform the data in a format that is easily and efficiently consumed by Tableau.

It is worth mentioning that we scraped the data in a format that was already structured, and, therefore, very little data cleaning was required.

We merely reordered and reformatted some columns and checked that there were no null values.

the end-to-end data transformation pipelineThe picture above is a summary of all the steps Alteryx goes through to transform our data.

As you can see, the first and last steps are just data import/export.

The second and fourth change the data format to handle it more easily.

But what is the third step in the process?Well, we noticed that the respective temperatures of the 19th, 20th, and 21st of May were very similar across all cities and years, and therefore decided to combine them by calculating their average, and thus reduce the number of features.

In fact, by combining them, we were able to directly compare the temperature for these 3 days with the temperature of the summer.

Our dataset after data transformationNote that there are different solutions to transform the data (script, Excel…).

We chose to use Alteryx because we were familiar with the tool and we knew we could complete the job in a very timely manner.

In fact, we wanted to minimize the time spent on preparing the data in order to maximize the time spent on analyzing the data and solving the problem.

The actual analysis (for real this time ????)Now we will walk you through the step-by-step approach that we have used to explore and analyze the data, and eventually draw conclusions with statistical significance:The initial graphWe started by creating a scatter plot to get a general idea of the relationship between the two variables.

As a reminder, our objective is to show that the temperature of the 19th, 20th and 21st of May is a good indicator at predicting the temperature of the summer.

By looking at this graph, there appears to be a positive linear relationship between the two variables: as the temperature of the three days increases, so does the temperature of the summer.

This begs the question: how can we demonstrate this relationship?.????This is where regression analysis comes into the place.

This is a statistical method that can be used to study the relationship between two continuous variables.

A very simple regression analysis model that we used is the linear model, which uses a simple linear equation to fit the data.

We used this method because it is relatively simple to interpret and it is supported by Tableau.

Linear regressionThe trend line that we have plotted is actually showing us the general direction that the group of points seems to follow.

We can now confirm that there is a positive linear relationship between the two variables: as the temperature of the three days increases, so does the temperature of the summer.

For the more statistics-savvy ones reading this, we can even demonstrate this relationship using some measures that describe our linear model:R-squared = 0.

29 merely indicates that 29% of the variance in the summer temperature can be explained by the temperature of the 3 daysP-value < 0.

0001 simply indicates that the model is significant and that the results are reliableThus, these results clearly show that there is a correlation between the temperature of the 3 days (19th, 20th, and 21st of May) and the temperature of the summer.

Although these results are already very interesting, is it possible to improve the significance of the model?Disparity in the dataIt looks like the data is separated into two big chunks (as highlighted in the figure), which definitely affects the fitting of the line.

Is it possible to explain this disparity in the data?Remember that the dataset includes the temperatures of different cities which are evenly distributed around San Jose.

So, let’s see what happens if we colour the data points by city.

Here is the link to the dashboard if you want to play with the dashboard ????This graph clearly shows that even if cities are subject to the same type of weather, the results are very different from city to city.

In fact, Sacramento and San Jose, which are up north, are colder than Fresno or Bakersfield which are more down south.

We can also notice that Bakersfield, Fresno and Redding show almost no variation in summer temperature compared to San Jose, whose data points seem to follow a positive trend line.

Hence, it would be more significant to analyze the relationship between the two variables for each city individually.

By fitting a model for each city, we realize that in the case of Sacramento, Redding, Fresno and Bakersfield, there is clearly no relationship between the two variables.

This is statistically evidenced by an R-squared value that is close to 0 and a P-value that is above 0.

70.

Redding trend line (similar to Sacramento, Fresno and Bakersfield)However, we can see that in the case of San Jose, there is a correlation between the two variables (R-squared = 0.

316 and P-value < 0.

0001).

Note that this time, 31.

6% of the variance in the summer temperature can be explained by the average temperature of the 3 days.

San Jose trend lineSo what’s the conclusion?The results demonstrate a correlation between the temperature of the 3 days (19th, 20th, and 21st of May) and the temperature of the Summer with a 99.

99% confidence level (p-value < 0.

0001).

However, this theory is only true for the city of San Jose, which is very surprising since all these cities are in California and should, therefore, show similar temperature trends.

Now that we have done the hard work, it will definitely be interesting to see if our theory is confirmed this year????‍???? .

We will apply our linear model to 2019 data and see if it correctly predicts the average summer temperature.

Although the results will certainly not be very accurate, it will still be a good indicator of whether the summer will be hot or not.

In the future, we want to strengthen this analysis by looking into other factors that might explain this phenomenon and improve our model by using different analysis methods.