Web Scraping Apartment Listings in Stockholm

This problem is solved in the function below which takes the same URL argument in order to calculate how many pages we have based on the premise that each page contains at most 38 objects.def Booli_findNumberOfPagesData(url): request = requests.get(url) soup = BeautifulSoup(request.text,'lxml') data = soup.findAll('div', class_ = 'search-list__pagination-summary') numberOfObjectsPerPage = 38 try: numberOfObjects = int(data[0].text[ -(len(data[0].text)-3 – data[0].text.rfind("av")):])except: numberOfObjects = 0 numberOfPages = int( np.ceil(numberOfObjects/numberOfObjectsPerPage))return numberOfPages, numberOfObjectsHaving the two above function we can write a simple function that loops through each URL and stores the information in a Pandas data frame..The function followsdef loopThroughRegions(data_url, m2_max, m2_min, maxListPrice, minListPrice): object_info = [] region = [] length = [0] for index, row in data_url.iterrows(): #Base URL url = "https://www.booli.se/{}/{}/..maxListPrice={}& maxLivingArea={}& minListPrice={}& minLivingArea={}& objectType=L%C3%A4genhet& page=1& upcomingSale= ".format(row["Region"], row["RegionID"], maxListPrice, m2_max, minListPrice, m2_min) object_info = Booli_ScrapeObjects(url, object_info) numberOfPages, numberOfObjects = Booli_findNumberOfPagesData(url) for page in range(2, numberOfPages): url = "https://www.booli.se/{}/{}/..maxListPrice={}& maxLivingArea={}& minListPrice={}& minLivingArea={}& objectType=L%C3%A4genhet& page={}& upcomingSale= ".format(row["Region"], row["RegionID"], maxListPrice, m2_max, minListPrice, m2_min, page) object_info = Booli_ScrapeObjects(url, object_info) length.append(len(object_info))#Creating a simple vector containing duplicates of regions up to number of object stored for each region for i in range(0, length[len(length)-1] – length[len(length) – 2]): region.append(row["Region"])return object_info, regionIn order to store the Pandas data frame in a regular Microsoft SQL server database we need to clean the data..For example we would like two rewrite objects that have 1½ rooms to 1.5 and so on..A simple function that obtains this simple cleaning process can be found belowdef cleaningData(object_info): for index, row in object_info.iterrows():if row["m2"].find("+") != -1: m2s = row["m2"].split("+") newM2 = int(m2s[0]) + int(m2s[1]) object_info.set_value(index, "m2", newM2)if row["Number of rooms"].find("½") != -1: rooms = row["Number of rooms"].split("½")if rooms[0] == "": newRooms = 0.5 else: newRooms = float(0.5) + float(rooms[0])object_info.set_value(index, "Number of rooms", newRooms)if row["Rent"].find("—") != -1: newRent = 0 object_info.set_value(index, "Rent", newRent) else: newRent = "".join(row["Rent"].split()) object_info.set_value(index, "Rent", newRent)return object_info1.2 The database storing partFinally we have all the data in the structured format so now it’s time to store it in a convenient way, namely, in a database..We have all the data stored in a Pandas data frame that looks like thisPandas data frame containing cleaned, scraped object dataSo we need to construct a database table that can store each column..This can easily be done via the following SQL-code (given that you use MS SQL server though the syntax is similair for other databases)Table creation in MS SQL serverwhere a primary key of “Link” would be sufficient..Stupid enough I made a clustered key by combining link with “dateInserted”..Now lets write a simple function that can connect to my local “on-prem” database using the pyodbc-package..The function followsdef mssql_connect(server, database, driver): cnxn = pyodbc.connect('DRIVER='+driver+ ';SERVER='+server+ ';DATABASE='+database + ';Trusted_Connection=yes') cursor = cnxn.cursor() return cnxn, cursorUsing this function combined with a query that inserts data by row, by first making sure that the primary key combination doesn’t exist, we finally reach our goal (the query could preferably be written using a simple merge statement)#SQL INPUT PARAMETERSpyodbc.pooling = Falseserver = 'server'database = 'database'username = 'username'password = 'password'driver= '{ODBC Driver 13 for SQL Server}'cnxn, cursor = mssql_connect( server, database, username, password, driver)data = result.values.tolist()for i, item in enumerate(data): insert_query = "IF NOT EXISTS ( SELECT * FROM [Booli].[UpcomingSales] WHERE [Link] = '" + str(item[0]) + "' AND [DateInserted] = '" + str(date.today()) +"') BEGIN INSERT INTO [Booli].[UpcomingSales] VALUES ('" + str(item[0]) + "'," + str(item[1]) + "," + str(item[2]) + "," + str(item[3]) + ",'" + str(item[4]) + "'," + str(item[5]) + "," + str(item[6]) + ",'" + str(item[7]) + "','" + str(date.today()) +"') END"cursor.execute(insert_query)#Cleanupcnxn.commit()cursor.close()cnxn.close()Now we can obtain the result by simply talking to the database viaDatabase call to the table that holds the object data2..Visualizing data from sold objectsIt is easy to adjust the code in order to scrape sold objects instead of objects up for sale, I wont go over how to do it but instead I will show some simple visualization of scraped data from sold objects..First of we start by calling the database together with a SQL-query, similar to the storing part, in order to obtain the data.. More details

Leave a Reply