# sgCarMart Web Scraping and Exploratory Data Analysis Project

(Note the minor inaccuracy in both KDE plots for the range of mileage < 0.

The KDE plot is an estimate that is fitted to the distribution, and is not representative of the actual distribution.

)Depreciation Histograms for Direct Owner Sales vs Dealer SalesThe KDE plot of depreciation for dealer sales has a kink to the left of the peak.

However, mileage and depreciation have an inverse relationship — we would have expected a second, lower peak to the right of the tallest peak, but the KDE plot shows otherwise.

Therefore we can reasonably assume that dealers have a higher tendency to tamper with the mileage figures of the vehicles.

Now going back to the question of whether to buy from a direct owner or a dealer: Instead of making a definitive choice, it would be more practical for a buyer to be extra careful with a dealer vehicle that has a low mileage but also a low depreciation.

Similarly, a buyer should always negotiate with a dealer for a lower depreciation for a vehicle model that is also available from a direct owner in a similar condition.

Question 2: Is sgCarMart Forgoing Revenue from the Lower Segments?Another interesting observation from Tables 3 and 4 is that vehicles sold by the direct owners generally have a higher engine capacity, with a mean of 2,066.

5cc and a median of 1,984.

0cc vs a mean of 1,863.

2cc and a median of 1,598.

0cc for dealer-sold vehicles.

Engine capacity has a positive relationship with depreciation due to two reasons — (1) A premium vehicle commands a high depreciation and usually comes with a large engine.

(2) A vehicle will command a higher depreciation than the same model of a lower engine capacity, e.

g.

a 2,500cc Toyota Camry vs a 2,000cc Toyota Camry.

This suggests that dealers sell a higher proportion of lower segment vehicles, and direct owners sell a higher proportion of higher segment vehicles.

To investigate this claim, we do a count of direct owner vs dealer sales for the various segments:Table 5: Count of Vehicle Sales by Direct Owner vs Dealers, for the Various SegmentsAs seen in Table 5, the ratio of dealer-direct owner sales increases dramatically as we go down the segments from Ultra Luxury to Economy.

The previous claim is indeed true.

Owners of premium vehicles are more likely to conduct the sale in their own capacity.

This could be due to dealers offering these owners poor prices, in a bid to protect themselves (dealers) against high repair costs, an issue common among premium, continental brands.

Owners of premium segment vehicles could be more savvy too, hence are able to conduct the sale on their own by listing on sgCarMart.

On the flip side, owners of lower segment vehicles in the Economy segment have a higher tendency to trade in or sell their vehicles to a dealer, rather than conduct the sale on their own.

From sgCarMart’s perspective, it matters if a car owner decides to sell the car on sgCarMart or directly to a dealer.

There are three possible ways to a transaction:Owner sells directly to dealer without listing on sgCarMart -> dealer lists car on sgCarMart -> car sold to private buyer (car listed once)Owner lists car on sgCarMart -> dealer buys the car -> dealer lists car on sgCarMart -> car sold to private buyer (car listed twice)Owner lists car on sgCarMart -> car sold to private buyer (car listed once)Based on Table 5, there are too many Economy segment car owners transacting as per option 1 above.

Economy segment vehicles sold by dealers also outnumber the combined total of all other vehicles.

sgCarMart can increase its revenue significantly if more Economy segment car owners choose to list themselves, as the same car can possibly be listed a second time if it was sold to a dealer.

As of now, it charges a flat \$58 for a standard ad that remains listed until the car is sold.

Comparison of Direct Owner Revenue and Dealer Markup for 2 Sales Sequences (estimated, not to scale)The figure above shows the estimated revenue and markup breakdown for dealer sales that are preceded and not preceded by a direct owner sale.

By encouraging car owners to list themselves, some of these cars will inevitably be sold to private buyers.

However, dealers are very likely buyers too, as they are aggressive in scouting for and closing deals on cars that can be flipped for a profit.

If there were more of type 2 dealer sales and less of type 1, sgCarMart will capture higher ad revenue while reducing the markup dealers receive.

The direct owner is also likely to receive a higher price as he will receive offers from more dealers.

Without information on car owners, we can only postulate 3 reasons why a car owner would sell directly to a dealer instead of conducting the sale himself:Quick sale, urgent need of cash — difficult to capture due to pressure to sell vehicle quicklyConvenience, no need to deal with fussy buyers — difficult to capture too, as that would require sgCarMart to offer consignment servicesNot technologically savvy, i.

e.

unable to list the car on his own or never heard of sgCarMart before — target groupThe third group of car owners can be captured by sgCarMart through offering a service that helps these owners create their listings.

Once the vehicles are posted, the owners will only need to manage phone calls and texts from prospective buyers.

As of now, sgCarMart does not appear to have such a service, with the listing process all done online by the car owner.

sgCarMart already offers a service called sgCarMart Connect, which helps buyers and sellers settle the paperwork associated with a transaction.

By helping tech-challenged owners post their listings, sgCarMart can provide a truly end-to-end service that captures a wider range of customers.

Question 3: Premium Ads — Do They Work?According to a friend who works at sgCarMart, premium ads come with more images and have higher search priority.

Only dealers are given the option to use premium ads, and there is a limit to how many premium ads dealers can use.

A premium ad on sgCarMart stands out visually in many ways, for instance having a green label when it appears in the search results as follows:Sample Search Results from sgCarMartNow let’s create a new DataFrame excluding vehicles with Mileage = 0, before splitting it into two DataFrames for premium and non-premium ads.

Then we run the descriptive statistics:Table 6: Premium Ad Descriptive Statistics (Excl.

0 Mileage)Based on the Tables 6 and 7 above, premium ads generally result in faster sales — on average taking 11.

2 days to sell, with a median of 7.

8 days to sell, with a median of 9.

0 days.

One might expect the faster sales for premium ads to happen at lower prices, but the descriptive statistics indicate otherwise.

Furthermore, premium ads on average and at the median have lower mileage, longer periods of COE left, and were manufactured later.

In other words, premium listings contained vehicles that were newer and in better condition.

In short, premium ads are effective as they result in faster sales on average.

Would premium ads show any interesting patterns when broken down into the various segments?Table 8: Count of Vehicle Sales by Premium vs Non-Premium Ads, for the Various SegmentsThere is no meaningful relationship between the segment and whether the ad is premium or not.

The ratio of non-premium to premium ads are in the high 2's for 4 out of 6 of the segments.

Question 4: What Keywords I Should Use in Describing the Vehicle?Every sgCarMart listing has 3 fields of descriptive text, namely Features, Accessories and Description.

The goal is to find out if the text has any relationship with the speed of the sale (days taken to sell the vehicle) by fitting a few types of regression models, namely support vector regression (SVR), decision tree regression and linear regression.

The first thing we do is to create a new column in the main DataFrame df that merges the contents of all of the three fields:Then we import the relevant modules and clean up the text:Next, the bag of words model was created.

ngram_range for TfidfVectorizer() was set as (1,2) to capture single words as well as word-pairs which may be semantically important:A regression model will be explored with the independent variable being the tf-idf matrix and the dependent variable being the number of days it takes to sell a vehicle.

We then split the dataset into training (85%) and test sets (15%):Then we start off by performing grid search on SVR for the optimal parameters, with the mean square error (MSE) as the scoring criterion:The poly kernel was excluded as it failed to converge to a solution.

The same was done for the linear kernel for the error penalty term C = 10, 100 and 1000 for the same reason.

Next, the results are generated:Table 9: Grid Search Results on the SVR ModelThe best parameters for SVR are kernel = ‘rbf’ and C = 10, with the lowest MSE of 152.

55.

That translates to an RMSE of 12.

35, which is a very poor fit considering that Days to Sell has a mean of 12.

48 and a median of 8.

00.

Next, we fit SVR, decision tree and linear regression models to the tf-idf vectoriser, calculating the mean RMSE over 10 iterations:Table 10: Mean RMSE for the 3 Regression ModelsAll 3 models have massive RMSEs and are not useful in predicting the number of days a car takes to get sold.

Let’s investigate the linear regression coefficients of the word/word-pairs for any insights.

The linear regression model is re-run 10 times to generate the features and their corresponding coefficients:Tf-idf Features and Their Coefficients for 10 Linear Regression IterationsNext we do a quick and dirty average of the coefficients and sort the averages to see which coefficients may be important:The coefficient values ranged from -2.

33 to 3.

61.

Let us look at the 10 features with the lowest coefficients and the 10 features with the highest coefficients:Table 11: Top Linear Regression Features Based on Coefficient MagnitudeSome of the feature names look odd as they were reduced to their word stem by the PorterStemmer before the Text column in the database was vectorised.

Table 11 includes their translations and the contexts in which they apply.

The features with high absolute coefficients are represented at both the positive and negative ends of the coefficients.

These features are ignored.

Let’s find out if the 5 non-repeated features (in orange above) affect the number of days taken to sell a car.

We create in the main DataFrame df 5 new columns for the 5 features, search through the Text column for matches, and insert 1 if the features appear and 0 if they do not.

Fortunately sgCarMart capitalises the first letter of ALL words in its text fields.

Thus the number of word-permutations that have to be done is kept low.

Next, we plot histograms for a quick visual check if the distribution of Days to Sell is affected by the 5 features:Key Tf-idf Features and Their Impact on the Number of Days Taken to Sell a VehicleThe two histograms for each feature appear very similar.

Note that the ‘sta vicom’ feature provides a poor comparison due to the small number of samples that contain that feature.

The histograms offer no insight.

Let us look at the descriptive statistics in a combined pandas DataFrame:Table 12: Descriptive Statistics on 5 Key Tf-idf FeaturesFor the features ‘fog’ and ‘evalu’, there is little difference in the mean and the general distribution of the days taken to sell.

Samples that contained the ‘non oblig’ feature had a higher mean but the feature had a negative coefficient in the linear regression model.

This is likely due to the multi-collinearity of the 1,000 text features as multiple features can be used together frequently.

Looking at a single word or word-pair on its own is unlikely to yield any meaningful insight.

Lastly, samples that contain the ‘drive away’ feature had a higher mean and median than those without.

This is in-line with its positive coefficient, but is likely just coincidence given the multi-collinearity problem.

In short, the multi-collinearity problem results in the td-idf + regression analysis yielding limited insight.

We are unable to tell which words result in a faster sale.

There is likely no relation between the text and the days taken to sell a car.

The tf-idf analysis might have been more useful in a classification problem instead of a regression problem.

Conclusion of the Exploratory Data AnalysisLet’s summarise the key takeaways for the various parties:sgCarMart: help non-savvy car owners create their listings, so as to capture a part of the massive chunk of Economy segment owners who are not listing with sgCarMartCar owners selling: no real takeaway given the fuzzy tf-idf analysis resultsCar buyers: beware of cheap, low-mileage cars from dealers; always negotiate with a dealer for a lower price than what is listedDealers: select wisely your premium ads as they sell 12.

5% faster than non-premium ads on average; if you are tampering with the odometers, think twice!Part 4: Closing RemarksThis project yielded more insights that I had initially expected.

While I would love to have direct access to sgCarMart’s database for a larger dataset with more features, I am satisfied with the outcome of this project given its fairly tight timeline of 3 months.

Perhaps a longer data collection period would have contributed to a better exploratory analysis.

Nevertheless, I look forward to any discussion or feedback on areas of improvement.

Feel free to message me on LinkedIn or simply drop a comment below.

You may also find the code of this project on my Github.