Data Science

I got another couple of subscriptions to the Climate Data Service (see the end of the post if you want to sign up), and one of them included an interesting question:

I’d also like to ask a question. I’m thinking of making a career change to data science, but am not really sure where to start. I currently work as an analyst for the Department of Defense, so I’m interested in the national security aspects of climate change as well as the human and economic changes coming. (my background is in physics so I’m not new to quantitative analysis)

How did you get your start and what would you recommend as a good path to get there? Thanks!

I got my start doing data analysis (suprise surprise) in astronomy. It took me some time to acquire some of the needed skills, because when I was a math major the only math class I wasn’t really interested in was: statistics! Of course, back in those days we weren’t blessed with a powerful computer in every laptop, so statistics, and data analysis, weren’t the same as they are today.

When I see job ads for data scientists, they’re usually looking for people who don’t necessarily have stats expertise, but do have some knowledge of data analysis and a background that’s definitely quantitative — those with training in math, computer science, physical science. I have a feeling there’s a lot of “on-the-job learning” going on, due to the demand in this field.

So: what skills would you want to develop? I’m hardly the greatest expert here, but I’ll give you some opinions.

First, let me tell you the definition of statistics: A mathematical science concerning the collection, analysis, interpretation, and presentation of data. So, statistics itself is the (mathematical) science of data. That doesn’t mean it’s focused on what career people do as “data science.” But it does tell you that one of the key aspects is: statistics. Most who train in math and/or science neglect this discipline (lots of programs don’t even reqire you to study it at all), so get a solid foundation in basic stats. Your employer won’t care about it that much, and it won’t be “most of the job” by any stretch, but you’ll be glad you did.

Another thing to learn is model-building, especially predictive models. A lot of this is classification by machine learning methods. That involves certain types of mathematial models, which you should become familiar with. You don’t have to learn all methods — they’re inventing new ones all the time any way — but there are some you should be familiar with. These include least-squares regression (which you probably already know) and logistic regression (good for binary classification problems), as well as newer methods, including “tree” methods. Those include classification trees/regression trees, and a variant (which is en vogue right now because it’s so good) called “random forests.” You could add support vector machines to the list, and some make good use of neural nets (although I dislike their “black-box” nature, but that’s just me). Anyway, learn a variety of types of models and how they work.

Whoever hires you will want you to do it on a computer. If it’s a business venture, they’ll probably want programs developed to implement it. So, learn a high-level language that’s common: the ones I see most often are C#, Java (not JavaScript), and Python. That doesn’t mean the data scientist is the programmer — but if you can, you’re more employable.

Many jobs, especially in finance/marketing, talk about “big data.” Basically, it’s when there’s so much data you need more than one computer (distributed processing) to handle the workload. You’ll have a leg up if you know some of the tools for that: Hadoop and MapReduce.

Last, but not least (perhaps most), work with data. I’m a data junkie — I can’t help myself. Because of that, I knew about model-building and statistics and stuff before I ever heard of “data science” And, much knowledge and wisdom comes from experience. A lot of kids straight out of college have trained specifically to be data scientists, but they make “rookie mistakes” because there are so many pitfalls they’ve never seen before. And, they’re usually well-trained in model-building and programming, but weak in statistics. For top positions, I’ll take a seasoned pro.

These days they seem to be offering considerable on-line training courses through EdX. Look into that, “data science” is one of their more popular offerings.

If you want to learn statistics, consider taking the course I’m going to offer soon (I hope). It’s too often neglected, and doesn’t seem to be “sexy” enough to be as regularly offered in online courses. Also, I think I teach it better than they do.

Finally: for those interested in the Climate Data Service, the price will go up soon so you might want to subscribe now. To do so, step 1: donate $25 at Peaseblossom’s Closet, step 2: post a comment here (which I will not make public) so I know who you are and where to email it.

This blog is made possible by readers like you; join others by donating at Peaseblossom’s Closet.


35 responses to “Data Science

  1. “back in those days we weren’t blessed with a powerful computer in every laptop”

    Yes, I remember using a slipstick (slide rule) in freshman physics. And I remember the nixie-tube displays on the four-terminal Wang electronic calculator that they installed in a small room on the top floor of the physics-astronomy building. The “suitcase” under the table was about four times as big as the suitcase shown in

    That small room quickly became much more popular than the larger (and noisier) room with all the Monroe clunker-bangers (see an example at

  2. Zeke Hausfather

    If you want to become a data scientist, move to San Francisco. As the old joke goes, a data scientist is an analyst who lives on the West Coast.

    (On a more serious note, if you want to work in data science for a tech company, learn Python and how to handle large datasets remotely. Machine learning is a big area of interest as well at the moment.)

    [Response: Or Boston.]

  3. Statistics, yes. Some computer science, yes. Numerical analysis, definitely. Curiosity, essential. Patience, essential.

    Python and its libraries, scipy, numpy, pandas, are good to know. One problem with Python is that there are Pythons, either 2.7 or 3.4+. Sometimes the choice is made by who you work with, or what’s on the platforms they run. I’m lucky that I can install Anaconda Python 3 at my whim. One problem with this is that not all modules are compatible with both Python 2.7 and Python 3.

    But, basically, I do the great majority of my work in R. You can do R at big scale, via Revolution Analytics and the packages MPI and parallel. The great thing about R is its huge set of packages, which are richer than Python’s, at least for numerical work. But, also, using R for a lot of the work depends upon having an organization that’s committed to supporting it. So, for instance, some of the time I’ll use R for exploration and algorithm development, and then rewrite in Python, or in ANSI C. Or someone else gets the latter job.

    This is important, because to make a living at data science, you want to write as little code as possible. And, warning, warning, a lot of the time you’ll spend will be in cleaning up and normalizing datasets, and learning about them.

    Who am I? I’m a statistician, quantitative software engineer, and data scientist at Akamai Technologies in Cambridge. And, yes, we’re hiring.

  4. Asking for someone close to me who is what I call a feral Excel guru — math major, self-educated in Excel and consistently over the years doing analyses that get “oh … wow …” comments from business- and science-educated Excel experts — pulling info out of several cooperating organizations’ inconsistently built data sets, unexpectedly fast and well, and getting more and more questions handed to her as a result.

    When Excel starts to bog down and takes 2 hours to save, and you need all the data available to work with, and you’ve already been given the biggest baddest Windows machine the IT people believe in, short of liquid cooling — what do you go to next?

    [Response: It’s hard to be sure what the context is, but — my first instinct is to urge her to learn R. It’s easy.]

  5. Hank: Python or R. Or maybe Tableau, depending on what the analysis is. In R, the data.table package is, once you get familiar with it, a-w-e-s-o-m-e.

  6. I use the R-based freeware package Gretl.

  7. You might want to take a look at

    Lots of material there.

  8. Aside:
    Testing gaming dies for fairness with an automated system: tip bucket, flip die, photograph, repeat: do statistical analysis ….

  9. Xavier Onnasis

    “If you want to learn statistics, consider taking the course I’m going to offer soon (I hope). ”

    more details please???

  10. Couple of updates:

    NSIDC has resumed plotting NH sea ice extent on the sea ice page, now extending into May. New data from April 1 is not yet available pending calibration. Extent is currently running at or near record lows (same for other sea ice indexes), but note the qualifiers on the web page.

    CU has just updated global sea level index to include the first 60 days of 2016. Small uptick now has GMSL at highest in the record (based on 60 day smooth).

    Sea level is highly correlated with the MEI ENSO index. El Nino has raised sea levels quite a bit over the past year.

    • But is the quantitative amount that ENSO contributes to SLR comparable to, in excess of, below that contributed by: (a) thermosterics, (b) ice sheets, (c) local variation such as AMOC (Gulf Stream) moving northwards against Northeast coast? ‘Til those facts are supplied your comment is interesting, but not significant.

      • Interesting is what I was aiming for. But I’ll try to answer your questions.

        At the time-scales ENSO fluctuations occur, ENSO is a stronger contributor to GMSL variation (similar to its influence on lower atmosphere/surface temperatures) than AMOC or ice sheets. AMOC fluctuations primarily impact local sea level variation. Thermosteric change in the ENSO region has the largest variance of the ocean regions. The largest contributing factor in ENSO fluctuations may be exchange of water between ocean and land.

        Removing the seasonal variations (Figure 6) reveals significant interannual variations that are correlated with El Nino and La Nina events (Figure 7)…. The correlation is significant, at the 95% level.

        Click to access nerem_etal_2010_SLR_topex_jason.pdf

        The paper was written by the CU sea level compilers. There is a graph similar to Figure 7 at the bottom of the CU homepage.

        Further reading:

        ENSO thermosteric contribution

        Ocean hydrology and ENSO

      • Thank you so much, Barry!

    • Thanks, barry, a useful update/pointer. This melt season promises great ‘interest’.

    • Hopefully some people will comment on ENSO’s up and down on GMSL… I think it would wash out? This graphic shows the complete reversal of the Western Pacific bulge, which is now an Eastern Pacific bulge, during the current El Nino:

    • Yr welcome, Doc. It’s been an interesting year for climate data junkies.

      JCH, if Tamino permits, I posted a bit on ENSO/sea level in reply to hypergeometric just above.

      • The ENSO signal is indeed buried in tidal gauge signals such as in Sydney harbor. Called the inverse barometric effect. Just have to remove the annual and semi-annual portion.

        And then you will also find that ENSO follows the forcing of known angular momentum changes:

      • barry – the AVISO website allow several ways to look at the satellite sea level data.

        Saral with seasonal signal:

        Seasonal signal removed:

      • Cool, JCH. But let me put on my Capt. Obvious hat, and say it sure looks like the graphs are reversed: that is, the bottom one (supposedly with seasonal signal removed) really looks like it has a strong annual cycle to it.

        Oh, and hotels are great places to stay if you’re away from home.

      • Sorry, yes, I have them backwards.


  11. Wayne Fowler

    Hi there, I’m just leaving a comment hear because well it seems more related than more recent ones. So I’ve started to see a new guy being cited by deniers.. a Jamal Munshi. He’s put out a spate of papers on Researchgate. Tonnes of statistical studies. They are all showing no correlation between human emissions and CO2 rise, ocean acidification, C14 changes you name it. So I’ve asked him a few questions about the no correlation between human emissions and CO2 rise. I’ve pointed out the obvious, like why is the rise less than half of human emissions or since the land is greening and oceans acidifying where else is this CO2 supposed to be coming from. I also pointed out the stupidity of the following line in his abstract ” The results have important implications for the theory of anthropogenic global warming because empirical support for the theory that links warming to fossil fuel emissions rests entirely on a correlation between cumulative values.” :
    However I would love a quick insight into his statistical ways. Not something I know about.

    His basic method seems to be described here:
    “Monte Carlo simulation is used to compare the correlation between normally distributed random numbers with that between their cumulative values under various conditions. The Microsoft Excel function RAND() generates uniformly distributed random numbers from zero to one. The Excel function NORMSINV(RAND()) serves to create normally distributed numbers with RAND() serving as the probability value. Monte Carlo simulation requires a large number of random numbers to be generated.
    Typically 10,000 values are generated for each variable”

    The full paper is here:

    Not expecting anyone to waste much time but any clues would be appreciated. Thanks

    • “empirical support for the theory that links warming to fossil fuel emissions rests entirely on a correlation between cumulative values” is simply his imagination and has no relation to reality.

      For example, isotopic analysis of atmospheric CO2 indicates the fossil fuel origin of the carbon. The relative changes in CO2 and O2 concentrations in the atmosphere indicate burning of the fossil carbon with atmospheric oxygen. A similar analysis can tell you if a hamburger came from a grass-fed or a corn-fed cow: (there a different photosynthetic pathways involved, which results in a measurable carbon-isotope “preference” in the plant).

      The basic causal mechanisms involved in global warming are pretty well-known physics and chemistry going back to the early 19th century and has nothing to do “a correlation between cumulative values.”

      I really doesn’t matter whether his statistics are sound or not. That one single “because” clause in the abstract tells you it is a bunch of pseudoscientific garbage.

      • I very much agree that the isotopic analysis clinches it, both of Carbon and of the Oxygen. But the same spirit that Tamino debunks bad regressions and interpretations of time series, it’s kind of in my wheelhouse to take apart poor statistical reasoning.

    • This is very much up my alley. I’ll have a go at dissecting it, but I may not get to this until Friday of this week, because of other commitments (including some civil disobedience against an explosive methane pipeline build, and other, unrelated work).

  12. I appreciate you taking the time to do this, whenever (or never) is of course fine. I understand all the empirical evidence regarding why we know it’s our CO2. I’m just interested in his statistical angle. I’ve already painted him into a corner… he accepts oceans are acidifying due to CO2 and land is greening due to CO2 absorption so I’m waiting on his explanation for where the added CO2 is coming from. He says we can’t be certain enough of CO2 flows to pin it on man yet he has accepted that land and oceans are obviously absorbing. The denialist mind is quite a thing to behold.

    BTW, he wrote another paper saying the isotopic connection does not correlate with fossil fuel emissions.

    • Ummm, it’s not a correlation, it’s a fingerprint.

      • Wayne Fowler

        So you wouldn’t expect the rate of change in the ratio to have any relationship to the rate of emission?

    • Where words fail, numbers are better.

      That’s depletion of C13 relative to C12 in CO2 due to dilution by plant-derived sources. This is one half of the fingerprint.

      This is a global phenomenon.

      That’s depletion of C14 relative to C12 in CO2 due to dilution by ancient sources. This is the other half of the fingerprint. It is a global phenomenon.

      That CH4 uptick doesn’t look too cool, either.

      The C13 content relative to C12 of CH4 is also decreasing, but the series is so short, as the scholars report, there is not a clear explanation yet, based upon only the series data. There are no series widely available of C14 content relative to C12 in CH4. I do not know, but perhaps there is an experimental constraint, since CH4 is rarer than CO2, and the ordinary fraction of C14 relative to C12 in (even) CO2 is small.

      These are all available at these sources.