Using Gitlab’s CI for Periodic Data Mining

At this point, we have a script which downloads and stores the news that we need and runs every hour.

However, artifacts are split per job run, therefore we need to write another script that downloads all our JSON artifacts and aggregates them in a single dataset.

Aggregating ArtifactsSince we will be using GitLab’s API for downloading the artifacts, we need to get some initial information like the project’s ID and an access token for HTTP requests.

To find the project’s id just navigate to the project’s GitLab page:For creating a new Access Token go to your profile settings from the top right corner:Click on the Access Tokens tab, fill in a token name and click Create personal access token:The token should be displayed at the top of the page.

Save that somewhere because we will need it for the next steps.

With these you can use the script below to download all artifacts, extract them in a directory and load them in memory:Make sure that you have replaced the project-id and access-token values in the CONFIG class before running.

Additionally, an extra dependency of the progress is needed, so you can go ahead and install it:pip install progressAnd that was the last part needed for this tutorial folks.

After waiting for a couple of days I run my aggregation script and I already had 340 unique news entries in my dataset!.Neat!RecapIf you have followed all the steps from the previous sections you should end up with the following files:- feed_miner.

py- requirements.

txt- aggregate.

py- .

gitlab-ci.

ymlThese include:A script which downloads and stores an RSS feed to a json file.

A Gitlab CI configuration file that defines a pipeline to install python dependencies and run the miner script.

(Scheduled to run every hour)An aggregation script that downloads all artifacts from the successful jobs, extracts them and reading all news records in memory while removing duplicates.

With all these, you can sit back and relax while the data are being mined for you and stored in your Gitlab repository.

A potential improvement would be to create another pipeline which runs the aggregation script every week or so, and creates a csv file but further data processing is totally up to you.

I hope you enjoyed the tutorial folks!.You can find the complete code in my github repository here.

.

. More details

Leave a Reply