Something You don’t know about data File if you are new to Data Science, Import data File from the web: Part 1

Something You don’t know about data File if you are new to Data Science, Import data File from the web: Part 1Sahil DhankharBlockedUnblockFollowFollowingMar 23To be a master in data science, You have to understand how to manage your data and import it from the web because approx.

90% of data in real-world come straight from the internet.

Data Engineer Life ( Source: Agula)If you are new to Data Science field, then you must be working hard to learn the concepts fast.

Now, you are at the right place to learn something faster.

Moreover, based on my experience that I came to know after meeting so many people who are new to Data Science stream or in the learning process.

All these people are in a hurry to cover the distance between London to New York in few seconds, wait for it **Elon Musk HYPERLOOP ( still Elon Hyperloop going to take 4.

67 Hours to complete 5585 Km range) and I think this technology is still in working progress.

Source: DailyExpress( Elon Musk Hyper-loop Concept)Also, you are trying so hard to cover this distance in one hour.

You are not AI who can read all website content in a few minutes and make a unique new one.

Data Science is a field where you have to create strategies like Wealthing Like Rabbits By Robert R.

Brown ( By the way great book to read, if you like to know more about the financial world and love to save some money for your future).

Let’s come back to the point; Data science is a field where you need to spend some time to gain some useful and in-depth knowledge.

In this topic, we are going to cover some different data format that we all are using every day in our project.

As a Data Scientist, we all spend so much time to prepare our data ( 70% time spend on data preparing, cleaning and handling missing values).

Most data people can understand my point what I’m talking about, Yup It’s ok guys, That’s our life nowadays.

We all can apprehend the challenge of operating with totally different data varieties.

Sometimes, handling different datatypes make you pulling your hairs, but wait I haven’t talked regarding the unstructured information or semi-structured data yet.

Data makes me mad( source: Critique Cricle)For any Data Scientist and Data Engineer, handling different data formats can become a tiresome task.

Actual data in the real world is really messy, you can rarely get clean tabular data.

So, It’s actually, essential for the data scientist to become aware of different data format, challenges to handling all these formats and find the best way to handle this data in real life.

 We will discuss some of the file formats that you find useful in Data Scientist field.

However, Most of the data skill(.

csv(), text, pickled files, Excel file, Spreadsheet and Matlab files), all these file work locally environment.

However, as a Data Scientist, much of your time these skills is quite enough.

Because utmost of the time you are going to work on the data that imported from the World Wide Web.

For example; you require to import the wine dataset from the UCI machine learning dataset Library.

 How do you get this data from the web?.Now, you can give a try to your favorite web browser to download this dataset.

Wine Data-Set(Source:uci.

edu)Now there are some navigate to the proper URL point and click on the appropriate hyperlinks to download the file.

However, this method developed a couple of severe problems.

Firstly, it isn’t written in code.

Moreover, so pretends re-producibility issues.

If anybody wanted to reproduce your workflow, they wouldn’t necessarily have to do so outside Python.

Secondly, it is not scalable; if you want to download 50,100 or 1000 such files, it will take as much as 50, 100 and 1000 times as long individually.

If you use python, your workflow could mount as reproducibility and scalability.

We are going to use python in this part to learn how to import and locally dataset from the WWW(Wold Wide Web).

We are also going to load the same kind of dataset in pandas data frames directly from the www(World Wide Web).

Whether they be flat files or differently, then you place these skills in the broader meaning of making HTTP requests.

In particular, you’d like HTTP GET requests, which in plain English means getting data from the web.

You use these new request skills to learn the basics of scraping HTML from the internet.

Moreover, you’ll use the wonderful Python package Beautiful Soup to pass the HTML and turn it into data.

Beautiful Soup to pass the HTML and turn it into data.

Now there are some great packages.

To help us import web data you’re going to use and get familiar with the URLlib(urllib package).

This module implements a high-level interface for retrieving data across the world wide web.

In particular, the urlopen() function is related to the built-in function open but accepts the universal source locators instead of file names.

Let’s now dive straight into importing from the web with an example importing the one quality data set for red wine dataset.

So, what happened here is imported a function called URL retrieved from the request sub package of the url live package.

Flat File from Web( Open and Read) :We have just learned how to import file from the internet, locally saved and stored it into a Dataframe.

What if we wanted to load the file without saving it locally, we also can do that by using pandas.

So, we are going to use function pd.

read_csv() with the URL as a first argument.


Non-Flat File from Web:Pandas function is so useful to store the file locally, and it has a close relative that help us to load all kind of files.

In the next exercise, we used pd.

read_excel() to import the Excel Spreadsheet.


HTTP Request (performing) using urllib:HTTP is Hyper Text Transfer Protocol, a foundation of data communication for the web.

We are going to perform urlretrieve() perform a GET request.

In this exercise, you will ping the server.


HTTP Request (Printing) using urllib:We already know how to ping the server, Now we are going to extract the response and print the HTML.


HTTP requests(performing) using requestNow, you have the full idea of how to use HTTP requests using the urllib package you’re going to figure out how to do the same thing at the higher level.

Again, we are going to ping the server, and you have to compare this program with the previous one.

( HTTP Request(performing) using the urllib).

We are not going to close the connection here.

I think now you already get a little bit idea of how to extract the data from the web.

 In the next part we are going to learn about the Beautiful Soup, How to use the API in data science, and then we talk about the Twitter API in full series.

So, please follow me on medium and Twitter.

So, you don’t miss any of this important topic to cover your data science journey.

Moreover, Welcome to Data Science World.









. More details

Leave a Reply