So you want to be a Data Scientist

by Guest Post on July 29, 2016

in SQL Server

In the dark days of the last millennium data scientists were serious statisticians using exotic hardware, expensive software and were largely isolated from the rest of the organisations they worked for.  Today the cloud and open source languages like R & Python have made this technology available to anyone curious enough to be interested in it. 

So what do you need to get started? – My Top Six would be:

Curiosity

Are you the sort of person that is always challenging why things are the way they are in your organisation instead of taking matters on trust?  Are you keen on experimenting with new technologies? Are you interested in the art of the possible? If the answer is yes then data science might be for you if you are also..

Evidence Based

I got a great e-mail after a presentation I gave from a lady called Charisma from Ghana – “In god we trust everyone else bring data”. Data scientists need to be scientists by using their creativity and curiosity to come up with hypotheses and then prove them.

Statistics.

Yes you do need to understand the basics of statistics, but with the amazing tooling out there and the great online resources for learning like EdX and Coursera you can acquire the principles and then quickly apply these to your problem domain.

Data Skills

Hopefully you are reading this because you are some sort of data professional e.g. from a BI or data background.  You know about joins, you get data quality and worry about missing data, cartesian products, and how annoying working with dates can be.

Business knowledge

.  Hopefully you understanding the domain you are working in and what you do is closely aligned to that.  This is your sanity check that means that when you find a cause an effect from a piece of analysis it does actually make sense and is not merely a couple of random statistics that have the same curve e.g. Per capita cheese consumption is not really related to deaths by getting entangled in the bed sheets as per Spurious Correlations by Tyler Vigen:

image

Ethics

Just because we can doesn’t mean we should.  In the work I do I am trying to make a difference so my data science dojo is a hack with a charity trying to improve a given situation be that environmental issues, saving lives or at least improving the quality of it.  I also mention this because computer don’t have ethics per se any more than a new born baby does they are imbued with behaviour patterns by other humans.  My litmus test is would my reputation improve  if my use of analytics and big data was made public. 

Is data science is for you?

I mention all this because demand for data scientists is outstripping supply and while many people like my are working with academia to drive and promote the next generation of data scientists we have a huge hole today.  Now I am not going to suggest you can just mug up on some of this stuff and put data scientist on your CV. Rather I am suggesting their are places in the data science world for people like me who have come from a data/BI background as many of our skills are transferrable.  For example our knowledge of the business and data skills are still very valuable.

However we are likely to be light on statistics and even if we studied it to some extent we have forgotten a lot of it and what matter now is applying those statistical theories to our data.  For example which algorithm should I use to select which attributes of a patient and their doctors appointment are most likely to influence their attendance at a doctors appointment?  In Azure Machine Learning (MAML) if you use the filter based feature selection module to do this you are presented with seven options such as Chi Squared, Spearman, Pearson, Fisher, Kendall but which to use?  The answer is a post in itself and the good thing about data science is that there lots of resources on line.  The bad thing about the online content is that while much of it is written in English it is English-Stats not English-GB so can be very hard to decode as so much prior knowledge of stats is needed.  On thing I use a lot is  this article on which algorithm to use for what in MAML and even if you are in R or Python or some other technology this is still useful.

So a question – Do you want to get into data science?

To test this hypothesis for yourself – I would encourage you to look at the Microsoft data science degree course on EdX.  This can be done for free to see if you like this stuff, but if you like what you are seeing then you can actually get a degree.  To do that you have to pay for a verified certificate for each of the ten modules you must do and there are options at various stages..

image

The last module is an actual project where you apply what you have learnt, and on a side note my good friend Amy Nicholson (@amykatenicho) has just made the module on Developing Intelligent Applications so watch out for tough questions on this!

So let’s do some science, see if the course helps and feedback any comments you have as you learn.

Note: none of these degree course is going to get deep on the stats I have ben talking about but there are courses out there for that too.  My own journey has been greatly helped by Prof Andy Field and his book on Learning statistics using R.  He’s also all over YouTube..

image

You have been warned!


Insufficient data from Andrew Fryer

{ Comments on this entry are closed }

Crunch Report | Solar Impulse 2 Completes Record Flight

July 27, 2016

Stefan Etienne tries out a very real meaty veggie burger, Hailo sells 60% of company to Daimler, eHarmony CEO steps down, Solar Impulse completes record flight around the world on only solar electricity, and Mobileye and Tesla will be going separate ways. All this on Crunch Report. Read More TechCrunch

Read the full article →

The Apple goes mushy: OS X’s interface decline

July 27, 2016

Nicholas W. Howard: Wander into almost any online forum or article comment section about a controversial announcement from Apple Inc. and you will almost certainly hear a variation of this sentence: “Apple has gone downhill since Steve Jobs died.” The sentence slithers around vaguely; it never seems to specify how, or in what ways, Apple […]

Read the full article →

Cyanogen Inc. is undergoing major layoffs, may “pivot” to apps

July 23, 2016

We’re hearing from multiple sources that Cyanogen Inc. is in the midst of laying off a significant portion of its workforce around the world today. The layoffs most heavily impact the open source arm of the Android ROM-gone-startup, which may be eliminated entirely (not CyanogenMod itself, just the people at Cyanogen Inc. who work on […]

Read the full article →

The history of the URL: path, fragment, query, and auth

July 20, 2016

In 1992 Tim Berners-Lee created three things, giving birth to what we consider the Internet. The HTTP protocol, HTML, and the URL. His goal was to bring ‘Hypertext’ to life. Hypertext at its simplest is the ability to create documents which link to one another. At the time it was viewed more as a science […]

Read the full article →

Twitter finally bans Milo Yiannopoulos, one of its most notorious trolls

July 20, 2016

 Twitter has permanently suspended Milo Yiannopoulos, an editor at the conservative news outlet Breitbart and one of its most notorious trolls. The expulsion of Yiannopoulos, who counted more than 300,000 followers on the service, comes just one day after he urged on a hateful mob that harassed ‘Ghostbusters’ actress Leslie Jones to the point that she quit Twitter. Jones received… […]

Read the full article →

Apple PC sales fall below market

July 17, 2016

The latest numbers from market research firm IDC reveal that Mac sales experienced a slight year-over-year decline in the second quarter, dropping to 4.4 million from 4.8 million during the year-ago period. Given the past 5-7 years, it’s very unusual to see Apple’s PC sales doing far worse than the overall PC market. Then again, […]

Read the full article →

The email, data and privacy implications of Microsoft’s acquisition of LinkedIn

July 16, 2016

 We all took a collective gasp when we saw the price tag of Microsoft’s acquisition of LinkedIn. Now that the dust has settled a bit, we can pause and reflect on what this means from a data, privacy and email perspective — given that all three are potential strengths, weaknesses and concerns arising from the […]

Read the full article →

Apollo 11 source code released on github

July 13, 2016

From Engadget: The source code for Apollo 11’s guidance computer has been available for a while (Google hosted it several years ago, for instance), but would you know how to find it or search through it? As of this week, it’s almost ridiculously easy. Former NASA intern Chris Garry has posted the entire Apollo Guidance […]

Read the full article →

Search and browse faster with the latest Chrome for iOS

July 13, 2016

Many people whip out their phones throughout the day to check something quickly—get a weather forecast, check what time the Giants game is, find out who guest starred on Mad Men last night. And that’s true for the many millions of you who use Chrome on iOS, so we’ve made our latest version even faster […]

Read the full article →