Math and Data Science
MATP 4961/6961/CS 4961/6961 Spring 2006
The ability to collect large amounts of data and
the importance of gaining understanding from it are becoming essential in
science, engineering, and business. In
the past, data was generated to investigate specific hypotheses. With the explosion of the capacity to
generate, store, and share data, frequently collection of data occurs
independently of any hypotheses. In
this course, we will examine how such data can be transformed into information
for decision making. We will examine
how mathematical models of data from statistics and machine learning can be
used for tackling compelling problems in science, engineering and business such
as fighting infectious diseases, designing drugs, and screening spam. Students will conduct research projects in
data science. The course will be
targeted toward advanced undergraduates but graduate credit may be obtained through
extra assignments and a more advanced project.
Prerequisites: Multivariable Calculus and a course is
probability and/or statistics, or instructor permission.
Instructor: Kristin Bennett
Bennek
at rpi dot edu
Office Hours:
Tuesday 3 to 5, AE327 or by appointment
Place:
Monday and Thursday 2 to 4 Science
2C13
Evaluation: Graduate:
Homework/Commentaries 28%, Class Presentation 14%,
Research Presentation 14% Research Project 28%, Participation 16%.
Undergraduate:
Homework/Commentaries 28%, Research Presentation 20%, Research Project
28%, Participation 24%.
·
discuss an important idea/result in a
paper,
·
explain why the idea/result is important
·
give thoughts on possible limitations of
the work and/or how the work could be extended or applied.
The paper should be your analysis of the
paper, not a simple restatement of the contents of the paper. You can assume the reader has read the paper
and is familiar with its contents. Do
not simple restate the abstract. Use
your own words. Correct grammar is
important and will constitute a major portion of the paper grade. Commentaries must be typed. The
final grade will be based on best 4 (for undergrad) or 5 (for grads)
commentaries handed in, so you may submit as many as you like. Note some commentaries are mandatory. See syllabus.
Your
grade will be based on how correctly and completely you address the points
above as well as on readability (clarity, flow, grammar, spelling, punctuation,
etc.). Here is a rough grading guide
(grades are 0-3):
·
3:
Excellent. Thoughtful, clear use
of concepts, clear evidence of incorporating ideas from the reading, creative thoughts on limitations and/or
extensions, all points are developed and supported, all requirements above
adhered to. Minimal summarizing, maximal
presentation of your thoughts. Few or no
mechanical problems (grammar, flow, etc.).
·
2:
Good. Thoughtful and clear, but
connections with the reading and thoughts on limitations and/or extensions not
as strong as for a “3”. Most points
reasonably well-supported, most requirements adhered to. May replace some presentation of your
thoughts with some summarizing. May have
some mechanical problems.
·
1:
Adequate. Basic response with
little or no depth, and very little evidence of careful reading of the text or
creative thought. Requirements above are probably not adhered to, and may have
substantial mechanical problems.
Resources:
Matlab tutorial :
http://www.math.ufl.edu/help/matlab-tutorial/
Wikipedia: www.wikipedia.org
Texts:
J. Ecker and M.
Kupfershmid, Introductions to Operations
Research, Krieger, 1991. Several
chapters of this book will be used, so you to buy it.
T. Mitchell, Machine
Learning, McGraw Hill, 1997.
S. Durbin, Eddy, A.
Krogh, and G. Mitchison, Biological
Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, R.,
Class
Schedule (subject to change):
1.
1/19/06
Challenges in Data Science, with
special focus on drug discovery.
Computers
Replace Petri Dishes,
http://news.com.com/Vision+Series+Computers+replace+petri+dishes/2030-1070_3-998622.html
Presentation/Discussion
Leader – Kristin Bennett slides
2.
1/23/06
Why is drug development so expensive?
About
drug developement. FAQ PPDI.com
http://www.ppdi.com/corporate/faq/about_drug_development/home.htm
Merck estimates $2.5B impact from pulling Vioxx
plug,
Julie Appleby and Matt Krantz,
G.
Banik, Insilico ADME-Tox prediction: the more the merrier, Current Drug
Discovery, 2004.
http://www.currentdrugdiscovery.com/pdf/2004/537275.pdf
http://www.argentadiscovery.com/news/pdf/cdd_2003_article.pdf
Presentation/Discussion
Leader - Kristin Bennett slides
Mandatory
Commentary Due: 1/23
3. 1/26/06 How
to estimate the similarity of molecules?
Nikolova N., J. Jaworska. Approaches
to Measure Chemical Similarity - a Review. QSAR Comb. Sci. 22, No. 9-10,
1006-1026, 2003.
http://ambit.acad.bg/nina/publications/Similarity%20-%20reprint.pdf
W. Jorgenson, The many roles of Computation in drug discovery, Science, 2004
http://www.rpi.edu/~bennek/class/mds/JorgensenComputationDrugDiscovery.pdf
ADME-TOX
Outlook, Winter 05.
http://www.admetoxoutlook.com/editions/adme_outlook_nl_i01_winter_05.pdf
Guest
Presentation - N. Sukumar slides
4. 1/30/06 Regression
models: least squares models
http://en.wikipedia.org/wiki/Least_squares
Presentation/Discussion
Leader - Kristin Bennett slides
Commentary Due
5. 2/2/06 Linear
programming based models
Linear Programming, Chapter 2, in J. Ecker and
M. Kupferschmidt, Introduction to
Operations Research, Krieger, 1991, pay
special attention to pg 24-29.
Presentation/Discussion
Leader - Kristin Bennett slides
6.
2/6/06 Kernel Methods:
Nonlinear: Chapter 2, “Kernel Methods: an overview”,J.
Shawe-Taylor and N. Christianini Kernel Methods for Pattern Analysis,
Presentation/Discussion
Leader - Kristin Bennett
If you have not
written a commentary by now, consider this one mandatory.
7. 2/9/06
Support Vector Machine
K. Bennett and C. Campbell, “Support Vector Machines: Hype or Hallelujah?”, SIGKDD Explorations, 2:2, 2000, 1-13.Background reading
Nonlinear Programming, Chapter 9, in J. Ecker and
M. Kupferschmid, Introduction to
Operations Research, Krieger, 1991
Presentation/Discussion
Leader – Kristin Bennett slides
8. 2/13/06
Duality
Presentation/Discussion
Leader – Kristin Bennett slides
9.
2/16/06 SVM methods for chemometrics
Presentation/Discussion
Leader - Kristin Bennett slides
10. 2/21/06
Background Mathematics and Computer Lab
HERG
research project --- models in action
Presentation/Discussion
Leader – Kristin Bennett
NOTE
no class 2/20 Instead have Tuesday
class
11.
2/23/06 Bioinformatics and Gene
Microarrays
A Scientific Primer,
www.ncbi.nlm.nih.gov/About/primer/
Biology 101 -- revisited
http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html What is a cell?
http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html What is a genome?
Bioinformatics
http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html Bioinformatics
http://www.ncbi.nlm.nih.gov/About/primer/microarrays.html Microarray Technology
Presentation/Discussion
Leader - Kristin Bennett
12.
2/27/06
Mathematical Challenges in Bioinformatics
R. Karp, Mathematical
Challenges from Genomics and Molecular Biology, Notices of
the AMS,. 49(5) 544-553 2002. http://www.cs.chalmers.se/Cs/Education/Kurser/algfk/karp.pdf
Presentation/Discussion
Leader - John P.
Commentary Due
13. 3/2/06
SVM approaches to Microarrays
Knowledge-based
analysis of microarray gene expression data by using support vector machines,
Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini,
Charles Walsh Sugnet, Terence S. Furey, Manuel Ares, Jr., David Haussler, Proc.
Natl. Acad. Sci. USA, vol. 97, pages 262-267
pdf
http://www.pnas.org/cgi/reprint/97/1/262.pdf
Presentation/Discussion
Leader – TBD
14.
3/6/06 Principal Component Analysis
A
tutorial on Principal Components Analysis. Lindsay I Smith.
February 26, 2002
www.cs.otago.ac.nz/cosc453/
student_tutorials/principal_components.pdf -
Lindsey
I Smith, February 26, 2002.
Presentation/Discussion
Leader – Mike and Jed
15. 3/9/06
Principal components analysis to summarize microarray experiments: application to sporulation time …
Presentation/Discussion
Leader – Jed
Project Proposal
Deadline: 3/9
16. 3/20/06
Baby intro to SPAM + Naïve Bayes
A
PLAN For SPAM, Paul Graham,
http://www.paulgraham.com/spam.html
Bayesian
Learning, Chapter 6, in T. Mitchell, Machine Learning, McGraw Hill,
1997.
Part
1: pg 154-171 Bayesian Learning background
Bayesian
Learning, Chapter 6, in T. Mitchell, Machine Learning, McGraw Hill,
1997.
Part
2: pg 177-184 Naïve Bayes
17. 3/23/05
Bayesian SPAM filters
Mehran
Sahami, Susan Dumais, David Heckerman and Eric Horvitz. ``A Bayesian Approach
to Filtering Junk E-Mail.'' Proceedings of AAAI-98 Workshop on Learning for
Text Categorization.
http://research.microsoft.com/pubs/view.aspx?pubid=278
Presentation/Discussion
Leader – Wenhui
18.
3/27/05 Tuberculosis Intro, + EM
Algorithm for Mixture Models
Bayesian
Learning, Chapter 6, in T. Mitchell, Machine Learning, McGraw Hill,
1997.
Part
3: pg 191-199 EM algorithm
Presentation/Discussion
Leader for EM– TBD
Presentation/Discussion
Leader for TB – Prof Bennett
19.
3/30/05
Inna Vitol, Jeffrey Driscoll, Barry Kreiswirth, Natalia Kurepina, Kristin P. Bennett, "
Identifying Mycobacterium tuberculosis Complex Strain Families using Spoligotypes", Infection, Genetics, and Evolution,
to appear 2006. The SPOTCLUST program that goes with this can be found at www.rpi.edu/~bennek
/EpiResearch.
Presentation/Discussion
Leader – Inna Vitol
21.
4/6/06 Integer Programming
Nonlinear Programming, Chapter 9, in J. Ecker and
M. Kupferschmid, Introduction to
Operations Research, Krieger, 1991
Presentation/Discussion
Leader – Jingye
Project Status
Report DUE
22. 4/10/06 An integer programming approach to
Suduko
Presentation/Discussion
Leader – Susan/Alicia
23. 4/13/06
Crafting a good machine learning paper, how do you know method is working?
24. 4/17/06
Presentation Abstract Deadline:
4/17
25. 4/20/06
Catch-up day
Math in Data Science Mini Conference: Participant presentations
4/24, 4/27, 5/1, 5/4
(Undergrads get first pick of dates)
Final Project Due: Wednesday
5/3, 5 p.m. Prof Bennett’s Box