Suggested courses

See below for course numbers of relevant seminars!

Note that not all courses presented here are still offered and not all courses that are “relevant” are included in this list (i.e. some information is outdated or incomplete).

First semester recommendations

  • MA students:
    • Students seeking the Master of Arts degree usually need to take STAT 201A (probability) and STAT 201B (statistics) if they haven’t taken masters-level probability and statistics before. During the first fall semester, students usually take one additional class, which could be PB HLTH 252D (causal inference) or any of the PB HLTH 240 series (B: survival analysis; C/D: computational biology, etc.). PB HLTH 252D is offered every fall semester. PB HLTH 240B,C,D are offered every other year during the fall semester. STAT 243 (statistical computing) is also a good choice, which is offered every fall semester.
    • For students with less stat/math background who may find STAT 201A/B too difficult, STAT 134 and STAT 135 are good substitutes. Also, STAT 133 is a good alternative to STAT 243 for students with less programming experience.
    • Other notes:
      • Enrollment in STAT 201A, 201B, and 243 can be problematic since they are reserved for statistics master’s students. Simply go to talk to the professors at the first day of class if you are not able to enroll due to the restriction.
      • STAT 134, 135 are also hard to enroll in due to their popularity.
      • There will be a background quiz at the beginning of STAT 201A. It covers basic probability concepts. For example, calculating expectation and drawing a standard normal curve.

Seminars

Seminars are typically worth 2 credit hours and often involve minimal time commitment but high levels of exposure to advanced topics in a low-pressure environment. Many can (or must) be taken pass/fail. Some seminars may be worth 3 credit hours and will likely involve slightly more, but not an extreme, amount of work outside of class time. Sometimes these course listings can be difficult to find in the online schedule of classes, but a good rule of thumb when search is that they’re typically numbered between 270 and 299 in whatever department is hosting them. The following are a few seminars that are popular among biostatistics students:

  • PB HLTH 295 (section varies): Statistics and Genomics Seminar (%X% not offered during Fall 2014)
  • PB HLTH 296 (section varies): Causal seminar / Semi-Parametric Models seminar
  • CMPBIO 201: Classics in Computational Biology
  • STAT 278B (section varies): Neyman seminar
  • STAT 278B (section varies): Probability seminar

General notes

  • 12 or more units is a “normal” courseload for graduate students at UC Berkeley. This is also the number of units a student must maintain to qualify for GSI or GSR funding.
  • Classes actually begin 10 minutes after the start time listed with the registrar (students and faculty often refer to this as being on “Berkeley Time,” but the practice is not unique to UC Berkeley).

List of relevant courses

Biostatistics (Public Health)

  • C240A: Introduction to Modern Biostatistical Theory and Practice (Hubbard; Spring)
  • C240B: Biostatistical Methods: Survival Analysis and Causality
  • C240C: Computational Statistics with Applications in Biology and Medicine (Dudoit; Fall)
  • C240D: Computational Statistics with Applications in Biology and Medicine II (Dudoit; Fall)
  • C240E: Statistical Genomics I
  • C240F: Statistical Genomics II (Dudoit; Spring)
  • 243D: Adaptive Designs (van der Laan)
  • C246A: Censored Longitudinal Data and Causality (van der Laan)
  • 252D: Causal Inference (Petersen; Fall)
  • 290-013: Big Data: A Public Health Perspective (Li)
  • 295 (section varies): Statistics and Genomics Seminar (Dudoit)
  • 296 (section varies): Advanced Topics in Causal Inference / Semi-Parametric Models Seminar (van der Laan/Hubbard)

Statistics

  • 133: Concepts in Computing with Data
  • 150: Stochastic Processes (Spring)
  • 152: Sampling Surveys (Spring)
  • 153: Time Series (Fall)
  • 154: Modern Statistical Prediction and Machine Learning (Fall/Spring)
  • 201A: Intro to probability at an advanced level (Fall)
  • 201B: Intro to statistics at an advanced level (Fall)
  • 204: Probability for Applications
  • 205A: Theoretical Probability (Fall)
  • 205B: Probability Theory (Spring)
  • 210A: Theoretical statistics (Fall)
  • 210B: Theoretical statistics (Spring)
  • 212: Topics in Theoretical Stats / Semiparametric models
  • 215A: Statistical models (Fall)
  • 215B: Statistical models (Spring)
  • 230A: Linear models (Spring)
  • 232: Experimental Design
  • 240: Non-parametric & robust methods (Fall)
  • 241A: Statistical learning theory (Fall/Spring)
  • 241B: Advanced Topics in Learning and Decision Making (Spring)
  • 243: Introduction to Statistical Computing (Fall)
  • 244: Statistical Computing
  • 248: Time series (Fall/Spring)
  • 272: Statistical consulting (Fall/Spring)
  • 278B: Neyman seminar, Probability seminar

Other

  • ARE 212: Econometrics: Multiple Equation Estimation (Spring)
  • ARE 213: Applied Econometrics (Fall)
  • POL SCI C236A / STAT C239A: The Statistics of Causal Inference in the Social Sciences (Sekhon; Fall)
  • POL SCI 236B / STAT 239B: Quantitative Methodology in the Social Sciences Seminar (Sekhon; Spring)
  • ECON 240B: Econometrics (Spring)
  • ECON 241A: Econometrics (Spring)
  • ECON 270D: Research Transparency Practices in the Social Sciences
  • MATH 104: Introduction to Analysis
  • PB HLTH 144A/144B: SAS Programming
  • PB HLTH C242C: Longitudinal Data Analysis (Fall/Spring)
  • PB HLTH 248: Statistical/Computer Analysis using R
  • PB HLTH 250A/B/C: Epi Methods & Theory
  • PB HLTH 252A: Applied Sampling and Survey Design and Analysis (Fall)
  • PB HLTH 254: Occupational and Environmental Epidemiology
  • AY 250: Python for data scientists (Fall)
  • CS 9H: Python for Programmers (Fall/Spring)
  • CS 294-001: Behavioral data mining
  • INFO 290: Special topic courses, topics varies by semester
  • EDUC 275G: Hierarchical and Longitudinal Modeling
  • CMPBIO 201: Classics in Computational Biology (seminar)

Employment

Teaching

Internal

External

Research

Internal

External

Computing

See below to learn about free software for UC Berkeley students!

Internet Connectivity

The secure wifi on campus is called AirBears2. Visit “Campus Wi-Fi Options” for details on setting it up on your laptop or mobile device.

Software

Students at UC Berkeley may download and install on a personal computer, free of charge, Adobe Creative Suite (CS) Design Premium and Microsoft Office. Visit the Adobe section of the Software Central website and the Microsoft section of the Software Central website for more details. Additional campus-provided software (including Microsoft Windows) is available with a CalNet login at Software Central.

High Performance Computing

Biostatistics Compute Cluster

The biostat cluster is managed by Systems Administrator Burke Bundy. Please contact him to request an account and login details. He is also an excellent source of information regarding the cluster and other computing issues. His tutorials on cluster usage serve as the basis of the information presented below, with supplemental details added as appropriate.

The biostat cluster runs a popular grid computing job scheduler called Sun Grid Engine (SGE). All processes that are executed on the cluster should be run through this interface. Code can be run interactively or as a batch job. To run processes interactively, use either the qlogin or qrsh command. (For interested parties, the fine details of the differences between these two commands is explained in the SGE documentation. For most purposes, either will suffice.) After creating an interactive session with one of these two commands, proceed as usual, by, for example, opening a command-line R session.

To run code as a batch job, first create a bash script containing code that should be run in the shell, and then submit that job script using the qsub command. An example job script, call it example.sh, might look like the following:

#!/bin/bash
#
#$ -cwd
#$ -V
#$ -j y
#$ -S /bin/bash
#$ -m beas
#$ -M your.email@berkeley.edu
#

R --vanilla < example.R > example.Rout

This script differs from a typical shell script because it contains options to be read by qsub. The example options above can be interpreted as the following:

  • -cwd : Executes job from the user’s current working directory.
  • -V : Compute node on which job is run inherits your environment. Omitting this may lead to the job script not being able to find installed software or libraries, etc.
  • -j y : Error output from the job will appear in the standard output stream.
  • -S /bin/bash : Specifies that bash shell should be used as the interpreting shell for the job.
  • -m beas : A notification email will be sent at all of the following points: at the beginning and end of the job and if/when the job is aborted, rescheduled, or suspended.
  • -M your.email@berkeley.edu : The server will send notification emails to your.email@berkeley.edu.

Additional qsub options can be reviewed in the documentation. To submit the above script to a compute node:

$ qsub example.sh

After a job has been submitted, its progress can be reviewed like so:

$ qstat

A cryptic but important piece of information in the output of qstat is the one corresponding to the job state. This column is labeled state and its contents can be interpreted as the following:

  • d : a call to qdel has been used to initiate job d(eletion).
  • E : a job in E(rror) state couldn’t be started due to job properties.
  • h : job pending; currently not eligible for execution due to a h(old) state assigned to it or because the job is waiting for completion of other jobs on which it depends.
  • r : job is r(unning) or executing.
  • R : job was R(estarted) because of job migration or for other reasons.
  • s : job s(uspended) via the qmod command.
  • S : queue containing the job is S(uspended) and therefore the job is also suspended.
  • t : t(ransfering) indicates that a job is about to be executed.
  • T : at least one suspend T(hreshold) of the corresponding queue was exceeded and so the job has been suspended.
  • w : w(aiting) job pending.

One option for the qstat command that may be useful is -f, which causes qstat to display “full” or detailed information on running jobs:

$ qstat -f

To see only your own jobs:

$ qstat -u username

substituting in your own cluster login for username. The qstat command has many more options for customizing output which can be looked up in the documentation.

To delete a running job, run:

$ qdel job#

substituting in the job number for job#, which can be retrieved using qstat.

Statistics Compute Servers

The statistics department manages several compute servers as well as a cluster. Biostatistics graduate students are allowed accounts on the compute servers (but not the cluster); details on requesting an account are available here. More information on running your process on one of the compute servers is available here.

Preparing reports/reproducible research

Basics

The most commonly used document preparation systems by UC Berkeley biostat students are and Markdown.

is a typesetting package and markup language which excels at displaying symbols and mathematical equations as well as allowing the user intricate control over the formatting of complex documents. You can read more about its advantages here.

There are many different ways to interface with on your computer, potentially requiring some trial and error to find the best setup for your situation. First, install a TeX distribution (or update the one you already have). Options include TeX Live (cross-platform), MiKTeX (Windows only), and MacTeX (Mac OS only). Next, choose an editor. Popular, cross-platform choices are LyX, Sublime Text version 2 or 3 with the LaTeXing plugin, Emacs with the AUCTeX package, Texmaker or its fork TeXstudio. Finally, be sure to have software installed that can read whatever type of document you compile to from – this is usually PDF but sometimes DVI or PS.

An enormous amount of documentation, tutorial and help (1, 2) materials are available online for users. A good way to get started is to customize a template – howtoTeX.com provides a fairly sophisticated guide and template package you can use to do just that. Also, writing even a fairly simple document with may require calling on several packages. (More on a couple of those below.) A few popular packages to keep in your back pocket are listed here and here.

Some additional tools that may be handy when using :

Bibliography management

The most widely accepted way to integrate references into a document is to add the references to a .bib (BibTeX) file and reference the .bib within the document’s source. However, there exist – as with most things TeX – many versions of this recipe. Rather than going into those details here, we instead point toward some resources to get you started:

  • A video tutorial (based in writeLaTeX but applicable to a local TeX installation) presenting the creation of a simple bibilography in a document and use of Google Scholar as a quick citation generator. (Also note that many reference managers support exporting citations to BibTeX.)
  • A more detailed introduction, along with a reminder that sometimes multiple compilations of the document will be necessary once citations are introduced. (Some “smart” editors/compilation scripts may take care of these issues automatically.)
  • For the interested reader, these responses (1, 2) detail some of the intricate details of the differences between bibliography packages and backends.

Presentations

The most commonly used presentation-style package in is beamer (official CTAN page). To get you started, here are a few (of many!) resources to look over:

You will likely want to customize the look of your presentation. To choose a beamer theme, the beamer theme matrix and the beamer theme gallery sites may come in handy. There are also custom beamer themes available.

Posters

Although it may not be the “best” method for some users, it is possible to use to produce a research poster. Here are some popular packages/classes that can be used for that purpose:

Reproducible research

The ability to reproduce a report from start (data) to finish (text, tables, figures, etc.) is made possible via the combined forces of R, , and an R library such as knitr or its predecessor, Sweave. Presentations can also be made reproducible in the same manner by using knitr in beamer slides. Use of knitr is strongly recommended whenever possible! Detailed documentation and many examples for using knitr are available on the official sites linked to above, as well as on the the knitr R library page and in the 2013 book Dynamic Documents with R and knitr.

Theses and dissertations

The ucbthesis LaTeX document class is available from the UC Berkeley Math department.

Recent Berkeley Biostatistics MA graduate Steven Pollack developed an R package in May, 2014, which builds on the ucbthesis document class and includes knitR and R markdown templates in addition to a template. The package is available on CRAN and github.

Teaching

Below is a list of courses that Biostat students have GSI’ed in the past.

  • PH 142
  • PH 241
  • PH 245
  • Stat 2, 20, 21, 131A
  • Stat 133
  • Stat 135
  • Psych 101