How to Get Access to an Unpublished Dataset

How to Get Access to an Unpublished DatasetBecause the “real world” isn’t Kaggle…Kaggle has made sizable real-world datasets accessible to data scientists in a transparent and unsurpassed way. Unfortunately as more new data scientists rely on Kaggle to hand them data they miss out on developing one of the most important skill sets for our field: hunting for and acquiring hard to find or privately held datasets.This post will help you discover unpublished datasets and will teach you a simple framework to “ask and acquire”.When will I ever need to find a way to get a dataset?Eventually you will need data and a google search won’t be enough to get it. It happened to me this weekend. IT WILL HAPPEN TO YOU.As an Employee: Often times data (within corporations or companies) will be sequestered away and held by various individuals or departments. Understanding simple patterns to identify and persuasively contact the key stakeholders who have access and authority to share this data is critical to accomplishing your tasks in a timely manner.As an Entrepreneur: For the data driven entrepreneur looking to validate a product idea the data that is required to make informed decisions often already exists in some form — whether academic or professional — but it sometimes sits behind costly paywalls or is only available in a limited scope publicly. Knowing how find it and get access saves time and money.As an Academic: Within academia the drive to discover, document and publish previously unknown knowledge requires diligent seeking & questioning of every assumption and assertion to identify opportunities for breakthrough. Without understanding basic communications the academic risks “discovering” something that is already known or even wasting time and research resources.This weekend I had one of these experiences. At dinner a friend asked: “Is it possible to remove the alcohol from bourbon without losing the complex flavors and odors?”As I began to google I realized quickly that the only answer I would find would be one I determined on my own. With that question I was on my way down the rabbit hole.Phase 1 — Identify the question/data you needDefine your questionThis is the most important part. The question you are asking often defines the information you need. This also limits the scope of the data you are trying to acquire which can be helpful when you are asking someone who is in no way obligated to respond to your request.Google it or do an Intranet SearchAs data scientists we are already pretty damn good at Googling and performing heuristic based searches for information. In a corporate setting use your internal tools & documentation. For academics & entrepreneurs Google Scholar can be invaluable.Performing a thorough search before asking others is essential to make sure your question is solid and shows them you respect their time. This also helps build trust when it comes time to ask for access while dataset hunting.My search led to a few promising studies on the molecular weight of compounds found in bourbon. As a result my question evolved. It helped me further identify a study and a dataset that looked particularly promising and useful.Use Human InteractionActually having to talk to people can be intimidating but often makes the difference in how quickly we can gain access to quality information. simple phone call, email or DM is the place to start.Honestly, as long as you are polite and to the point there’s almost no wrong way to do it.You JUST have to DO IT.Once you’ve defined your question and identified who has access to the data you need (or who may know someone who has access to the data you need) then you can move to phase 2.Phase 2 — Ask those you find and find by askingAsk & FindWhether email or phone the template is pretty much the same: Craft and deliver an “ask” message that is polite, and straight forward.My “Ask” email to the bourbon study researcher.Remember:Convey that you are interested in their work and the data they have worked so hard to collect. It gets them stoked and helps them give a damn about you.Be human. Make sure you write each email for that person specifically. You don’t want your email coming across as robotic or devoid of emotion.Don’t ask people to do anything that would get them to get in trouble. You should also clearly & truthfully state what your intentions are for the data.Follow up the next day if you get no response. Simply say “Hi ___, did you receive my email?” You’re not trying to sell anything to them so don’t hesitate to be persistent. Simple polite follow up is key.Simple emails often get quick replies. If you receive a quick response, assume their time is scarce and follow up immediately.The individuals with the data are an invaluable resource. Time is their limited resource.Find & AskOnce you’ve struck up a conversation you’re in the final stretch. You’ll find that people are very helpful in this stage. You may end up emailing back and forth with them to find the right data but this process carries the benefit of helping you determine what data you really need.Success! A better dataset than I could have ever found on my own.If you hit a conversation dead-end or get a “no” be sure to follow up and ask if they can direct you to any other useful datasets or if they know of any other resources for locating the data you need. People will almost always try to be helpful as it lessens the severity of having to tell someone “no”.Don’t be afraid to ask simple questions about the dataset to see what additional data they have. With a little bit of luck they’ll get you the exact dataset you need and save you tons of time.Some asks will lead you to find a dataset and some will lead to a new person to ask, but regardless of who or what you find always remember to say thank you.The most important email — the Thank You.My own search for the bourbon dataset ended in success because of these few tips.Just remember, you are asking someone to share something they likely spent 1000x more time than you working on / looking at.If you bring enthusiasm for their work it will be reciprocated.If you have any other tips for writing cold emails as a data scientist please share them in the comments.

Leave a Reply