This Job Posting Does Not Exist

I aimed to train a model to be able to make job postings for a wide variety of job titles.

This post will walk through how to finetune GPT-2 on a custom dataset from acquiring the data via webscraping to tweaking the settings and running the model with relevant explanations and examples along the way.

I use job posting data for my particular application, however the content from this post can be applied to other text generation tasks and datasets.

Background on GPT-2If you’re not familiar with GPT-2, it’s the newest language model released from OpenAI and the successor to OpenAI’s GPT model.

The GPT-2 model was trained upon over 8 million webpages, or approximately 40GB of text data from outbound links from Reddit which encompass a wide variety of topics.

While this model performs well on general text generation because of this pretraining, you’ll still need a small dataset of your own to perform transfer learning and fine-tune the last layer’s weights if you want to train on and use this model for your own specific task.

As part of creating my own dataset, I resorted to web scraping to quickly get the data I wanted.

Show Me The DataDataset Statistics:* Total size: ~210 MB* Number of job titles: ~450 job titles* Number of postings per job title: ~100* Time to get data: ~ 1 hourSince I wanted GPT-2 to create job posting text for all kinds of jobs, I gathered 450 job titles from this blog and wrote the values to a csv file.

I chose to get approximately 100 job postings for each job title since I wanted to test the minimum amount of data required to effectively train the model.

To get this somewhat comprehensive dataset relatively quickly, I used the BeautifulSoup and asyncio libraries with a bash script that uses multiprocessing to scrape the Indeed website, which allows webscraping.

You can check out the scripts I used to acquire that data here and here as an example of what webscraping scripts with BeautifulSoup ,asyncio and multiprocessing might look like.

If you decide to also create your own dataset and perform webscraping to accomplish that, I highly recommend using asyncio and multiple processes to cut down the time.

Let’s Get To The Good StuffOnce you have a dataset ready, I would suggest forking this brilliant repo that already has a way to train the GPT-2 model in TensorFlow.

Since I don’t have my own personal GPU and don’t want to spend any money training this if I don’t have to, I decided to use my local computer.

While there are alternative options that offer free GPU usage (Google Colab or Kaggle kernels), each of those have time limits on their usage.

Since I wanted to train the model overnight, I didn’t want to worry about the GPU shutting down and opted for training the model locally.

In order to use the code within the repository, I had to follow the instructions and set up the environment, which involves installing the required libraries and downloading the models (117M and 345M).

From here on out, this post will refer to the scripts within that repo.

When training the model, I chose to use the model with 117 million parameters and the dataset that I acquired from scraping Indeed stored in the data/job_postings directory (which isn’t in the repo because of the size of the data).

In order to train the model, I used the following command from the main directory of the repo.

PYTHONPATH=src .

/train.

py –model_name .

/models/117M –dataset .

/data/job_postings –run_name job_postings_6_20_19Here I specify which model I’m using (117M), the dataset, and the name of the directory in which the model weights will be stored (job_postings_6_20_19).

Since training the model until it would generate decent results would probably take a while, I let the model run overnight for approximately 8 hours.

Afterwards, I then copied the model weights from the job_postings_6_20_19 directory into the models/117M directory to test it out using the following command.

python .

/src/interactive_conditional_samples.

py –top_p 0.

9 –model_name 117MHere, I used the –top_p command for nucleus sampling, which is like top-k sampling with a dynamic k.

Top-k sampling is a method in which you choose one of k words that have the highest probability of being chosen as the next word every time you generate a word.

The default for the –top_k flag is 40, but you could easily use a higher value if you want to have more variability or smaller values if you would like it to be more deterministic.

With nucleus (top-p) sampling, you set a probability threshold of p so that you consider however many words that cumulatively would reach that threshold probability of being chosen.

In this way, you may only consider only 1 word in scenarios where the model knows exactly what it wants to choose next, but you may consider many more if the model is uncertain.

This nucleus sampling prevents models from generating overly repetitive sequences.

After typing in that command, the following output appeared:Model prompt >>>At that point, you can input whatever starting text you want.

In my case, I fed the job title to reap the rewards of my model.

Below are some select results.

ResultsHere, I walk through 4 examples of where the model performed well and poorly.

So when I input the prompt Data Scientist, this is the result I got.

Gather and analyze all outputs of activities and operations.

Provides guidance to ensure timely calibration, understanding and calibration of data, and to support overall execution of execution plan.

Provides guidance on various operational goals, critical metrics, and activities carried out through the conversion of data (data sheets, charts, graphics, charts and tables) to and from mainframes and delivery schedules.

High level of expertise and service delivery at a cost efficient manner.

Prior experience under management associated with data support contracts.

Benefits Receive support from multiple disciplines, including Clinical, BPO, Healthcare, Information Technology, CRM, In-GPO, IBM, NASA, Data System Engineering, and others.

Compensation & Benefits In a typical first year position, Microsoft offers a reasonable compensation package (down to $30,000-$40,000 depending on the type of position).

Basic benefits include Medical, Disability, and Pension.

If this option is available for you, you will receive fair compensation and benefits, including up to two full-time and part-time days in a year.

We are looking for someone who: Relies on excellence in the HR and data science fields Works hand in hand with colleagues and with Microsoft’s colleagues to enable innovation in a company-wide, engineering-driven culture Leadership and leadership skills in achieving goals, keeping customers interested, and helping each other The Microsoft HR team is a team based in Redmond, WA and provides full-time, part-time and occasional holiday leave.

The team is comprised of engineering, business and other development specialists.

Maintains strict confidentiality and confidentiality policies while dealing with employees in all activities related to the company’s external business.

Provides technical support to external vendors in connection with the Company’s internal business activities.

Conflicts of interest / Ethics Maintains a strict ethical code of conduct, including the following: No (non-indirect) contact with the employee, including but not limited to, a publication of unsolicited business tactics, actions or representations.

No appearance of undue influence on our internal or external vendors, vendors, staff, or the Executive team No unlawful communication of information to the employee, including posting on LinkedIn in an effort to influence the employee to choose those indicated by the employee in conversations with the Executive team.

At the very least, the model generated a somewhat convincing albeit wildly unappealing job posting (because of the low compensation).

Here’s what I got when I put in the title Accountant.

Benefits & Duties Estimates and analyzes information received from users, employees, clients and vendors.

Makes recommendations to manage and improve the business, financial status and business operations of the enterprise and manages and provides service to individual client/ vendor groups.

Assesses future compliance requirements to ensure success for the company.

Comprehensive Coordination of Strategy, Product Management, Business Development and Process Development, Demography, Manager Development, Information Assurance, Remediation, Cost Accounting and Risk Management.

Conducts budgeting audits to ensure alignment with overall company budgets and objectives.

Manages transactional relationships with other departments and programs.

Is helpful with weekly recurring updates and Budgeting etc.

Acts as a liaison with senior team leaders.

Acts as the information or research advisor to all members of the company.

Acts as a front-line caseworker or an internal consultant on any product, plan, initiative or project.

At least 4 years of experience dealing with varied budgets.

Experience with Salesforce, Salesforce Hadoop or Equation, CRM, MySQL, Puppet or Redshift, CRM Software, CRM Storage, Data SaaS or Any other software or scripting language.

Experience with SQL.

Experience with the SQL Server Standard Database (SSD).

Experience with assisting vendors in data warehousing.

Proven record in the ability to manage and address the following activities for a variety of operational tasks as well as tasks and processes within an enterprise.

Benefits We offer monthly or weekly 401k plan, 24/7 live chat, Senior management and the ability to setup a monthly IRA.

The opportunity for advancement in this position is open to all employees with similar experience and ability to perform the duties as required.

Job Type: Full-time Salary: $60,000.

00 to $60,000.

00 /year Experience: Scheduled B2B or B2C: 3 years (Preferred) Location: San Francisco, CA (Preferred) Work authorization: United States (Required) Benefits offered: Paid time off Health insurance Dental insurance Retirement benefits or accounts Commuting/travel assistance Paid Training: Yes Pay Frequency: Bi weekly or Twice monthly<|endoftext|>That actually looks like a plausible post for an Accountant with maybe the exception of the technical requirements (MySQL, Puppet, Redshift, etc.

).

Unfortunately, not all of the prompts I gave the model were quite successful.

Here’s what I got when I put in the title Data Engineer .

Keywords: Customer Relations, Relationship Management, Sr.

Service, Sr.

Education, Service, Sr.

Data Management, Sr.

Data Analysis, Sr.

Technical Support, Sr.

User Experience, Sr.

Service, Sr.

Customer Relationship Management, Sr.

Data Analysis, Sr.

Quality Assurance, Sr.

Products & Services, Sr.

Business Development, Sr.

Management, Sr.

Director, Sr.

Legal, Sr.

Manager, Sr.

Counsel, Sr.

Account Executives, Sr.

Employees and Accounts, Sr.

Regulations and Rules, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

Business Development, Sr.

BusinessThat… definitely doesn’t seem right.

Unfortunately the nucleus sampling didn’t seem to help here.

Guess the job REALLY wanted to emphasize the Sr.

Business Development part of Data Engineering.

Here’s a job posting for the title of Hairdresser.

(Bon Appétit)Weekends:Monday-Friday 8:30am-6:00pm 9:30am-6:00pmPacific Crest:Monday-Friday:9am-5pmSaturday:8:30am-5:00pmSunday:Sunday (Saturdays)LeasingPlace:DeKalb CityWest RdSuite 1Roof and Mountainen AvenueRoofloomTalk, Portland (Parker and Union)Classic per H&R Block & Perk Parking standing space,hence a classification plan on top of downtown.

Signed off on per week specials for entire week:Shipping directly to the studio, resupply on weekends through outship.

Trash coasters, pots, microwaves, polishing and several other applications.

Comfortable with special requests, and have a personable attitude to being so passionate about who we are, what we do and what we do not do.

Registered Shift Coordinator with little or no day time schedule.

Problem solve immediately after shift departure is preferred.

One hour shift and are paid $10 an hour.

This is a great opportunity for a respected, experienced shift coordinator who has a sense of humor and a creative spirit.

If you are a quick learner (starting at age 5) who can handle a lot of other people or live in a city with a lot of low-paid companies, this is the job for you.

Profit Sharing expense.

Is paying a server a 50% or 75% discount on pizza when you are a client.

The Best Option for someone who wants to start in the world.

We are your dream employer.

All sales are privately owned and open to all candidates and we're a super fun job.

We provide full service for both our clients and staff.

We treat each client with dignity and respect.

We are welcoming and responsible for all activities and expenses.

Ensure reasonable accommodations for other clients and employees when applying.

We work hard to give you the best possible business experience that reflects our highest standards.

We have a tough time accepting positive but not hostile feedback.

We can do better than that.

Please send an email to the ideal candidate.

Required:If you are not currently an Employer with your private hire preference, please send this info along with your preferred picker:Yes (Please Subject Link an Email with the relevant info).

Please include:Yes (Please Subject Link an Email with the relevant info).

Please include: Yes (Please Subject Link an Email with the relevant info).

RecThis is one of my personal favorites just for how ridiculous it is.

I particularly love the lines We are your dream employer.

and If you are a quick learner (starting at age 5) who can handle a lot of other people or live in a city with a lot of low-paid companies, this is the job for you.

They’re hilarious parodies of some of the lines that are in job postings; if I didn’t know any better, I would say that whoever wrote this has a future in comedy.

Thoughts & RecommendationsWhile I don’t have any rigorous statistics on why certain job postings didn’t fare so well, I suspect it was either a lack of data or not enough training.

The list of job titles had quite a few titles with the word ‘Engineer’ in them, including Biological Engineer, Chemical Engineer, DevOps Engineer, and Chief Engineer amongst others.

I suspect that the model wasn’t able to distinguish between the various engineering positions with the amount of data that was provided.

Alternatively, the data may have been enough, but training for ~8 hours on a CPU may not have been enough.

If you decide to train your own GPT-2 model, I would suggest obtaining more data than I did or perhaps train longer than I did for better results.

It seems that at the very least, GPT-2 learned the general structure and a surface-level semantic understanding of how a job posting looks, but seems to lack a deeper human-level understanding of what seems appropriate or out of place for a job posting.

While these select samples are definitely not better than what’s already online, it’s harder to tell whether a few of them are fake and the concerns about its use in broader society seem fairly justified.

So it seems the answers to my original questions are (anecdotally at least) yes, GPT-2 can produce a coherent job posting and yes, GPT-2 can produce even more ridiculous job postings than what’s already out there.

If you want more details about the technical details of how GPT-2 works, check out this blog post, which has a pretty digestible explanation and more resources for technical details.

As usual, please let me know if you have any questions or suggestions to make this better.

.

. More details

Leave a Reply