Reviewing a Book for Computational Statistics

Thank you very much for your interest in reviewing a book for the journal "Computational Statistics." The journal is published by Springer-Verlag, currently four times per year. More information can be obtained from the journal's web pages:
Journal Home Pages
http://comst.wiwi.hu-berlin.de/
Springer-Verlag

If you are interested in reviewing a book, please contact the book review editor, Andreas Krause, at
firstname@elmo.ch (replacing firstname by andreas).

Generally you can suggest any book in the area of computational statistics. If the book review editor agrees that the book is of interest to the readers of the journal, he will get the book sent to you. You can keep the book as a honorarium for a review that you are expected to return within a few months.

Some books are available directly. Please just contact me to review one of these publications:

Guidance on Layout and Formatting

You can write the review in plain text, LaTeX or MS Word. After a review by the book review editor it will be reformatted to match the journal's style in layout. You will get the ready-to-be-published contribution for a final review.

Make sure that you provide the following information in the header:

Book title (edition)
Book author(s)
Publisher
ISBN Number
List price (you could, for example, check amazon.com)
Your name as you would like it to appear in the book review
Your postal address
Your email address
If you are used to and would like to use LaTeX, send an email to the book review editor and ask for the LaTeX style file.

You will get a PDF file as it will look in the publication for final review.

Some Guidance on the Contents

A typical book review in "Computational Statistics" is about two pages long. It can be shorter or longer if you feel the length is appropriate. A narrative is preferred to a dry listing of facts. Formulae should only be used if really necessary.

Generally, follow your own style and ideas. There is no recipe for a good book review. You are certainly encouraged to jot down your impressions and comments. A mere listing of chapter names is not very interesting. One can get that from other sources like amazon.com as well.

The following just provides some guidance that you can use if you like. Some basic ingredients are listed in the following:

The targeted readership
Students, professionals, disciplines, etc.
The intended coverage of material
Exhaustive or focused summary of what particular topics the book covers
The overall intent of the book
What do the authors try to achieve?
Is the intent stated in the preface or on the cover of the book?
The contents
A brief listing of chapter or section titles, possibly with specific comments of yours
Your comments
Is the targeted readership appropriate? The material can be too difficult to read, too basic, not suitable for lectures, etc., or more suitable for other readers.
Is it a narrative or scientific style, lots of formulae or none at all?
How did you like reading the book?
Can you give a broader context? Can you compare the book to other books covering similar topics? Is the book a "first of its kind"?
Any comment, even if they might be your personal ones, can be very useful to the readers. I would like to strongly encourage you to go beyond a mere listing of facts like just giving what's on the cover and listing chapter names. Give any comment you like and think about what you as a reader might like to know.
Be reasonably critical and do not recommend the book to every statistician if it is not appropriate. If you think the book is not a good book, say so and let people know why you think so.
A book review does not have to be "dead serious."
Give the bibliography to sources you cite.

Example Reviews

Example 1

"Bayesian Data Analysis," Second Edition
by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin
Chapman & Hall/CRC, 2004

Jerry R. Nedelman
Novartis Pharmaceuticals
One Health Plaza, East Hanover, NJ 07936-1080, USA
Jerry.Nedelman@pharma.novartis.com

This is a friendly book: not polemical; only as technical as it has to be, and gradually so; replete with examples that entertain and educate; with generous bibliographic notes packaged at the ends of chapters. The authors give useful, down-to-earth advice in plain language. Consider the following (p. 281):

Almost no computer program works the first time. A good way to speed the debugging process is to start with a smaller dataset, perhaps randomly sampled from the original data. This speeds computations and also makes it easier to inspect the data and inferences for possible problems. In many cases, a sample of 10 or 20 data points is enough to take us quickly through the initial stage of debugging. A related piece of advice, when running iterative simulations, is to run for only a few iterations at first, possibly 10 or 100, to check that the algorithm seems to be on the right track. There is no point in waiting an hour for the computer to get highly precise computations for the wrong model.

But the book also develops necessary theory with appropriate rigor. So it meets the authors objectives of being simultaneously a handbook on methods and a graduate text.

The 668 page book divides into five sections plus an appendix. The authors themselves, setting their friendly, readable style already in the preface, describe the contents of the book better than a reviewer can independently (p. xxii-xxiii):

Part I introduces the fundamental Bayesian principle of treating all unknowns as random variables and presents basic concepts, standard probability models, and some applied examples. In Chapters 1 and 2, simple familiar models using the normal, binomial, and Poisson distributions are used to establish this introductory material, as well as to illustrate concepts such as conjugate and noninformative prior distributions, including an example of a nonconjugate model. Chapter 3 presents the Bayesian approach to multiparameter problems. Chapter 4 introduces large-sample asymptotic results that lead to normal approximations to posterior distributions.

Part III covers Bayesian computation, which can be viewed as a highly specialized branch of numerical analysis: given a posterior distribution function (possibly implicitly defined), how does one extract summaries such as quantiles, moments, and modes, and draw random samples of values? We emphasize iterative methods -- the Gibbs sampler and Metropolis algorithm for drawing random samples from the posterior distribution.

Part IV discusses regression models, beginning with a Bayesian treatment of classical regression illustrated using an example from the study of elections that has both causal and predictive aspects. The subsequent chapters give general principles of hierarchical linear models, generalized linear models, and robust models.

Part V presents a range of other Bayesian probability models in more detail, with examples of multivariate models, mixtures, and nonlinear models. We conclude with methods for missing data and decision analysis, tow practical concerns that arise implicitly or explicitly in many statistical problems.

Throughout, we illustrate in examples the three steps of Bayesian statistics: (1) setting up a full probability model using substantive knowledge, (2) conditioning on observed data to form a posterior inference, and (3) evaluating the fit of the model to substantive knowledge and observed data.

Appendixes provide a list of common distributions with their basic properties, a sketch of a proof of the consistency and limiting normality of Bayesian posterior distributions, and an extended example of Bayesian computation in the statistical packages Bugs and R.

Beginning Bayesians will find this book an excellent introduction to the theory and especially the practical methods of the discipline. Veterans may find value in the examples, references, data on distributions, and up-to-date coverage of modern computational techniques. The latter topic, including the 18-page appendix describing the use of Bugs and R to analyze one of the book's running examples on educational testing, may be especially attractive to readers of this journal. At a list price of $59.95 and available discounts of up to 20%, this book should be owned by every statistician. It will remain on readers' desks or nearby shelves as a frequently sought handbook.

Example 2

"Principles of Data Mining"
by David Hand, Heikki Mannila, and Padhraic Smyth
Massachusetts Institute of Technology, 2001

Don X. Sun
Statistics Research, Bell Labs
dxsun@research.bell-labs.com

This book provides a comprehensive introduction to the field of data mining. It covers the basic aspects of a typical data mining problem in data collection, data quality assurance, data organization, database management, data visualization, data analysis, and modeling as well as a list of popular data mining algorithms.

This book is an excellent source of information for those who like to know what data mining is about. However, most of the discussions on individual methods, techniques, and tools are brief and readers need to find other sources of details to get more in-depth discussion on when, why, and how these tools can be used in the context of their problems. Fortunately, the "Further Reading" section in each chapter provides an excellent list of related references for readers to get further details.

Chapter 4 on "Data Analysis and Uncertainty" gives a very good summary of the fundamental concepts of statistical estimation and hypothesis testing as well as the explanation of the iterative nature of data analysis between data collection, model building, and model validation. In the discussion of statistical estimation, the two major approaches of maximum likelihood estimation and Bayesian estimation are well explained with carefully chosen examples. Readers without substantial statistical background will find this chapter very informative.

Similarly, Chapter 9 on "Descriptive Modeling" has an excellent discussion on the topic of cluster analysis. The three major types of clustering algorithms, i.e., partition-based clustering, hierarchical clustering, and probabilistic model-based clustering, are well explained with a good combination of mathematical formulation, algorithm, and illustrative examples and diagrams.

The discussion on text retrieval in Chapter 14 provides a good balance of fundamental principles and implementation details in solving the problem. Many of the concepts are well illustrated by real examples. In contrast, the discussions on image retrieval and sequence retrieval in the same chapter are relatively brief without thorough details.

Throughout the many chapters of this book, there seems to be a common weakness of the missing link between the discussions of data mining methods and their applications in real-world problems. In other words, there is a lack of discussion on how the methods and tools can be applied in real practical data mining tasks. From most of the discussions throughout this book, it is unclear to the readers in terms of the scope, applicability, and limitations of the methods and tools in dealing with problems with large data sets. For example, in the description of the CART algorithm in Chapter 5, there is not much discussion on CART's limitations in handling large data sets. One of the assumption mentioned is that the data should all fit in main memory. But does it matter if there are more variables in one case or more observations in another case ?

The discussion of data visualization in Chapter 3 is on the simplistic side and it does not provide convincing evidence to readers that visualization methods are actually suitable for dealing with large data sets. This viewpoint may discourage the usage of data visualization as a vital technique in data mining. In reality, visualization methods have been extremely important in analyzing massive data sets. The question is how to use them. In the data mining literature, it is not uncommon to see effective visualization techniques in displaying telephone networks traffic with millions of call flows, retail transaction records from millions of customers, Internet traffic data with billions of packets, etc. The weakness of the discussion lies in the fact that it is extremely difficult to teach data mining of massive data sets from examples of small toy data sets.

This book covers an enormous number of methods in this area, and it serves as a good reference in enumerating a broad range of existing methods. The lack of in-depth discussion is the weakness. It could be more informative if it had provided more discussions on the pros and cons of the methods and the context in which the methods are applicable as well as their practical limitations. One such example is the discussion of hidden Markov model (HMM) in Chapter 6 for structured data. With only a very brief description of HMM, it is difficult for a reader to grasp the essence of the method, the reason for applying such a model, when and how to use it.

Another example is the description of "Branch-and-Bound" algorithm. The one page discussion in Chapter 8 is so abstract that a reader without prior knowledge of this algorithm can hardly understand what is being talked about. The important question is how much a reader can learn from the discussion in addition to knowing the existence of an algorithm called "Branch-and-Bound".

In summary, this book is very well written and enjoyable to read. A great benefit in reading this book for beginners in this field is the comprehensive knowledge of the various terminologies and methods that can greatly enhance ones ability to participate in discussions on data mining projects.

Example 3

"Applied Smoothing Techniques for Data Analysis: 
The Kernel Approach with S-Plus Illustrations" 
by Adrian W. Bowman and Adelchi Azzalini

A. J. Rossini
Department of Biostatistics, University of Washington 

Unlike most book titles, this book's title has the wonderful quality of describing its contents accurately. The focus is on kernel methods for density estimation and non-parametric regression, and for the latter, the main technique described is local-linear regression. The ordering of the books subjects flows nicely, in that it starts with density estimation, beginning with description and exploration and ending with inference, moving to the related subject of non-parametric regression and covering subtopics in a similar manner, and finally covering the topics of parametric model evaluation, time-series methods, and semiparametric regression and additive models.

The stated goal of this book, to provide a practical introduction to smoothing methods for a statistically literate audience, is achieved. There is good motivation for the techniques as well as practical suggestions (using S-Plus) for obtaining the results. Data sets are available, and the routines (which can run under the freely available R statistical language) make the contents and methods universally accessible. I was extremely pleased with the level of the book, which managed to get much information about smoothing methods across in ways which allows them to be quickly applied.

There are a few minor problems, in terms of content and presentation. Smoothing of longitudinal data, an area of recent interest, is barely covered. I also wish for a bit more discussion on adaptive or variable bandwidth methods. In addition, the software doesn't use the more common models formulation in S-Plus. The use of "lowess" and "loess" as equivalent and interchangeable smoothing techniques is not correct.

In general, I enjoyed reading and reviewing this book, and feel that it makes a nice applied reference book. It is not appropriate for those looking for a mathematical reference to assist with research into smoothing methods. However, it serves excellently as a reference for those who need an occasional reminder on the appropriate use of smoothing methods for data analysis as well as providing a good introduction to the subject for statistically literate novices, and the provided problems are appropriate for assisting the reader in working through the material.

Example 4

"Data Mining Using SAS Applications"
by George Fernandes
Chapman&Hall/CRC, with sep. CD-ROM, 2002

John Warner
Novartis Pharmaceuticals
One Health Plaza, East Hanover, NJ 07936-1080, USA
John.Warner@pharma.novartis.com

The text and accompanying CD describes and documents 13 SAS macros for performing various aspects of data mining. The macros allow the user to:

- read Excel data sets into SAS (EXCEL)
- split a data set into training and validation samples (RANSPLIT)
- produce frequency data tables (FREQ) and univariate analyses (UNIVAR)
- carry out unsupervised learning using factor analysis (FACTOR) and disjoint clustering (CLUSTER)
- carry out supervised learning / scoring using linear regression (REGDIAG, RSCORE) and logistic regression (LOGISTIC, LSCORE)
- create lift charts (LIFT)
- carry out supervised learning for classification problems using discriminate analysis (DISCRIM) and classification trees (CHAID)

There is also brief discussion of (but no software for) some newer data mining techniques including neural networks, market basket analysis, and the use of data warehouses, but anyone seeking a solid introduction to these topics should look elsewhere.

Generally, all the macros except CHAID provide a friendlier front end and better presentation graphics for existing SAS procedures. The CHAID macro on the other hand fills a gap in the standard SAS offering (i.e. an implementation of classification trees) which I would argue ought to have been filled by SAS Inc. long ago.

The macros would appear to be useful for situations where a few standard analyses are performed on many very similar data sets. Perhaps in this context the author's claim (stated in the preface) that "a complete analysis can be performed in 10 minutes" may be justified, but I would not recommend that analyses of this sort be treated as anything more than quick screens, to be followed with more detailed work if interesting results are found.

There seems to be an assumption made in this book and in other introductory works on data mining, that data mining is somehow simpler than other branches of statistics and can therefore be carried out in a cook book fashion by people with little technical training. On the contrary, I am inclined to believe that data mining is likely to emerge as one of the more challenging subfields within statistics and that data mining tools should be built with more sophisticated users in mind.

The graphical output may be the macros' most important selling point. Graphs are generally more attractive and more informative than would be produced by most SAS users. More flexible approaches to graphical analysis will be needed, however, if data mining analyses are to be turned into actionable business intelligence.

Running of the macros is accomplished by running SAS macro call files from the SAS program editor. After this is done, a screen pops up and one is able to fill in the blanks for all input parameters required by the analysis. The author promises that "No SAS programming experience is required." This may be true in some sense, however, some knowledge of the SAS user interface does seem to be a prerequisite. What is more, experienced SAS users will probably prefer to cut out the interactive portion of this interface in favor of calling the macros directly from a SAS command file. This latter approach has the advantages of leaving a permanent record of what has been done, reducing the number of times parameter values must be entered (and re-entered) on the fly and allowing for many similar analyses to be performed in a large batch.

The reader should be warned that carefully following the directions in the book does not guarantee that the book's example problems will work. In particular, the CHAID macro will not run from the test template on page 335 because that template contains file names with more than 6 characters which is not allowed by the macro. I had to write to the author to find the fix for this bug. I was told that an explanation of this fact appears somewhere on the book's web page but I have not been able to find it.

On the positive side, the book provides a welcome contrasts to treatments of data mining that focus on only the most novel aspects of the subject. Dr. Fernandez is quite right in pointing out that a lot of data mining can be carried out by standard statistical methods in familiar packages. The book also has a healthy emphasis on the use of cross validation (a hall mark of data mining). This and other concepts are well illustrated with numerous examples. Finally, the book demonstrates that the fancy (and expensive) user interfaces sported by many data mining work benches are not essential to the data mining enterprise and might even be counterproductive.


Andreas Krause, Book review editor of "Computational Statistics"
Last modified: August 2005