Case Studies Using Stata

"Stata makes a difference at the Health Policy Institute of Ohio"
"A Commentary on Stata for Marketing Analytics"
"Stata makes a difference at the National Data Bank for Rheumatic Diseases"
"A Commentary on Stata for Business"
"Stata makes a difference at the World Bank: Automated poverty analysis"


Stata makes a difference at the Health Policy Institute of Ohio

The Health Policy Institute of Ohio (HPIO) is an independent, nonpartisan, statewide center that fosters sound health policy within the state by forecasting health trends, analysing key health issues, and communicating current research to policymakers, state agencies, and other decision makers. HPIO promotes and facilitates health policy research among research centers, universities, and other organizations. It identifies gaps in health policy research and data; designs studies; leads the development of a statewide health policy research agenda; promotes collaboration among researchers; develops research projects to address health problems; and, as necessary, undertakes research directly. In addition, HPIO assists researchers in presenting important findings and serves as a network facilitator among health researchers and practitioners.

HPIO primarily depends upon Stata as its analytical weapon of choice

Examples of research topics include the uninsured and underinsured, health systems capacities, health safety net capacities, determinants of health, health disparities, health care reform, public health systems, family violence prevention, poverty, community health status, health information technologies, and behavioral health. Findings from various research topics are readily available at HPIO’s website, in public presentations, and by request. For all this work, HPIO primarily depends upon Stata as its analytical weapon of choice.

The main reasons HPIO uses Stata are its highly intuitive interface, its support for complex survey data, its epidemiology commands, and its support for various types of biostatistical, social-science, and econometric analyses. Some examples of how Stata has helped HPIO in its analytic needs are in analysis of the 2008 Ohio Family Health Survey (OFHS)—a complex, dual-framed survey of health systems, behaviors, and demographics of 50,944 Ohio adults—and in the Medicaid Atlas Project, which uses approximately 2,200,000 cases to examine Medicaid use in Ohio’s 88 counties.

For both projects, the expanded datasets are very large—the OFHS is approximately 300 megabytes and the Medicaid dataset is approximately 1.3 gigabytes. In the 1990s, analysing such datasets was difficult because of software and equipment limitations. With the prerequisite of needing to allocate a large amount of memory at program startup, Stata/MP 11.1 easily handles the analysis of such datasets. For the OFHS, programming code to model the uninsured in Ohio is easily done using ado-files and do-files. The OFHS is the main source of Ohio-specific population-based health system information provided to the state’s legislators, agency heads, and health system stakeholders. Analysis of the OFHS provides Ohioans with information relating to how federal health reform will affect Ohio. Areas of interest include:

  • the characteristics of Ohio’s 1.3 million uninsured;
  • the degree to which Ohio’s uninsured will be eligible for various coverage expansions, including criteria of income, duration of being uninsured, family composition, chronic and extreme health conditions, etc.;
  • the potential costs of covering the newly insurance eligible in Ohio; and
  • select issues, such as potential crowdout, which occurs when currently insured individuals move their health insurance to a government-sponsored program.

Examination and modeling of these types of issues relies on Stata’s survey commands, which allow us to incorporate the design characteristics of the survey.

The Medicaid Atlas Project analyses Medicaid billing information to determine issues such as total Medicaid use per county and the number of physicians serving Medicaid patients in each county. The project also uses this data to monitor expenditures per Medicaid utilization category and to project the growth in average expenditure per category. Additionally, procedures are used to model relative-risk profiles of Medicaid enrollees versus nonenrollees and to model relative-risk profiles of Medicaid managed-care enrollees versus fee-for-service enrollees. Because health policy stakeholders are large contributors of health services in the state, determining the overall populationbased health impact of Ohio Medicaid is very important to them. For example, Ohio has experienced a prolonged economic downturn, having lost over 560,000 jobs since 2000.

During this period, because of the State Children’s Health Insurance Program (SCHIP), Ohio’s rate of uninsured children actually decreased while the adult rate increased. Using internal data from Medicaid mixed with state-specific external data from surveys allowed us to estimate the risk buffering of children’s access to health care that is attributable to Medicaid in hard economic times.

Finally, Stata’s web-enabled interactive search capacities are often indispensable for figuring out complex data setup and analysis issues.

“The main reasons HPIO uses Stata are its highly intuitive interface, its support for complex survey data, its epidemiology commands, and its support for various types of biostatistical, social-science, and econometric analyses.”

The Stata community, including researchers at universities, research institutes, and government agencies, is an excellent resource for figuring out problems. As an example, HPIO is participating in a project to test a concept for examining simulated benefit models for dual-frame surveys—surveys where samples are drawn independently from two overlapping sampling frames to cover the population of interest (e.g., respondents to a survey of households with landline telephones and households with both cell phones and landline telephones). The research team intends to develop a program in Stata that will enable survey researchers to determine whether to develop dual-frame or single-frame surveys rather than making sampling decisions based upon convenience.

In summary, Stata allows the Health Policy Institute of Ohio and its partners to keep an analytical edge on very complex health issues. The program is robust enough to handle very large datasets, fast enough in its MP versions to use high-end computers, and thorough enough to address epidemiology, social-science, and econometric analyses.

Timothy R. Sahr, Director of Research, The Health Policy Institute of Ohio

Director of Research, The Ohio Colleges of Medicine Government Resource Center

Reproduced with permission from The Stata News Vol 25, No 3, September 2010

Go to top

Stata makes a difference at the National Data Bank for Rheumatic Diseases

Fred Wolfe, MD, leads the National Data Bank for Rheumatic Diseases (NDB). The NDB collects self-reported information directly from patients using 28-page questionnaires mailed at six-month intervals, gathering information on use of services, medical costs, financial status, functional ability, quality of life, psychological status, treatments received and their side effects, and long-term outcomes pertaining to illness, work, and death. Patients are typically referred to the NDB by their rheumatologist, and critical medical events are validated by obtaining medical records.

"Stata allowed us to do data management in a flexible, useful, cost-effective way that we couldn’t do otherwise." — Fred Wolfe


Unlike other data banks that collect data primarily from administrative sources like Medicare or insurance companies or from physicians, hospitals, and laboratories, the NDB data sourced directly from patients allows researchers to answer questions that are most germane to patients but that cannot be answered based on other databases’ information. These questions include things like treatment efficacy in the community rather than efficacy in randomized clinical trials, whether patients use the treatments, whether patients report less pain, and how the disease affects patients’ daily lives

NDB staff input the nearly 10,000 variables of data into a Microsoft SQL database. Wolfe and his team then use Stata programs to build and update the dataset, a step that is done nightly and takes about 6 hours to run. Stata programs then check data consistency. “Immortalized” datasets are created upon the completion of a six-month survey phase. The current immortalized dataset contains nearly 600,000 observations on 89,000 patients. Auxiliary programs allow database managers to apply value labels and account for missing values and allow users to extract, manipulate, and process variables of interest. All told, the NDB uses over 1,000 programs and dofiles, mostly written by Wolfe or his colleague, Kaleb Michaud, a senior analyst at NDB and assistant professor of medicine at the University of Nebraska Medical Center.

The database is used to publish research about rheumatic diseases in peer-reviewed journals. Roughly 125 papers that rely on this data have been published. Because the data bank is complex and accessed by Stata commands, researchers using the NDB data typically work with a member of Wolfe’s staff. Michaud adds, “When research collaborators want to work with the data, we highly recommend that they use Stata for the analysis; serious medical students, residents, and fellows who take on research projects with me all use it.”

The NDB also maintains safety registries, longterm observational studies that monitor adverse events among patients receiving new drugs.

Getting started with Stata

Learning that many colleagues were switching to Stata, Wolfe was tempted when he discovered that data management in Stata would free him from the arduous task of writing SAS loops. Finding SAS to be expensive and bloated, and S-Plus and (later) R to be hard to learn and unwieldy for data management, he quickly fell for Stata in 1995. Wolfe writes, “Stata programs were one of the things that made Stata great for us.” Stata’s flexible programming language allowed for all sorts of contingencies in the data and facilitated reporting. Wolfe continues, “In a data bank that was always changing, Stata allowed us to do data management in a flexible, useful, cost-effective way that we couldn’t do otherwise.”

Wolfe also credits the Stata community, including the Statalist email group, the Stata Journal and Stata Technical Bulletin, and the Statistical Software Components (SSC) archive. He says that postings written by Nick Cox, a long-time member of the Stata community, taught him more about programming than he could have learned anywhere else. Wolfe summarizes the Stata community: “Today, between the manuals, the archives, and Googling Stata issues, Stata is a continuing teacher. I certainly learned what not to do.”

Stata’s role—data management, analysis, and reporting

The NDB relies heavily on Stata’s datamanagement facilities, including its support for ODBC connectivity, extensive macro manipulation features, and commands like egen and merge. Statistical techniques such as linear regression, logit modeling, fixed- and randomeffects estimation, mixed modeling, and survival analysis are all carried out using Stata. Stata’s comprehensive graphics capabilities are a vital component of the NDB’s data-management and reporting tasks. Wolfe asks, “How could I live without Stata and the Statalist?”


Research produced with NDB has led to important insights and real changes in recommended treatments. Among their many findings, NDB researchers were the first to show that methotrexate reduced mortality in rheumatoid arthritis (RA) patients. They demonstrated or confirmed the association between RA and heart attacks, stroke, skin cancer, and lymphomas, and they showed there is no increase in cancer and cardiovascular risk from biologic therapy of RA. The NDB documented rates of work disability among RA patients and identified predictors thereof. They published the first longitudinal study on joint replacement in RA patients, and they published the classic and definitive papers on the erythrocyte sedimentation rate in rheumatic diseases. Recently, they determined the rate of retinal toxicity of the common arthritis and lupus drug hydroxychloroquine, and their findings are now being turned into recommendations for treatment monitoring.

The NDB also used Stata to develop clinical assessments, including the HAQ-II functional questionnaire and the fibromyalgia diagnostic questionnaire. They showed that patient questionnaires could be used in clinical settings and were important in predicting outcomes. Their database has also allowed them to learn lessons that are more broadly applicable. For example, the NDB’s extensive questionnaires have led them to understand rates of nonresponse and what makes a survey question good or bad. As Wolfe describes the process, “The data bank is an epidemiologic textbook on data collection and errors, missing data, biases, causality, and on and on. Stata made it easy to learn these things.”


Stata plays a crucial role at the NDB under the direction of Fred Wolfe. From managing raw survey data to integrity checking, to advanced statistical analyses and report generation, Stata provides the tools the NDB needs to get the job done.

Brian Poi, Executive Editor and Senior Economist

Reproduced with permission from The Stata News Vol 25, No 1, March 2010

Go to top

Stata makes a difference at the World Bank: Automated poverty analysis

The World Bank supports the United Nations Millennium Development Goals of eliminating poverty and providing for sustained development. To ameliorate poverty in an area, one must first know who is most affected by poverty and how poverty is distributed among society’s members.

Poverty Assessments are key to the World Bank’s poverty-reduction strategy. These reports are routinely produced for virtually every country the Bank studies. Each Poverty Assessment includes various statistics on poverty and income inequality and reports on how well each country is achieving its povertyreduction targets. Historically, producing a Poverty Assessment for a country would involve hiring a consultant, often a newly minted PhD or a graduate student. The consultant would learn the principles of poverty analysis and write Stata programs to produce the requisite tables and graphs. This approach was prone to error because no standards were in place; instead, Stata programs and documentation were produced by people with varying degrees of skill. Methodologies and assumptions were often vague, and results were difficult to replicate. Maintaining the code and preparing data were costly procedures. To do an analysis similar to an existing one, a researcher would often have to start from scratch rather than reuse the existing code.

Poverty Analysis Toolkit

To rectify and streamline the process of producing Poverty Assessments, Michael Lokshin, a lead economist in the Development Research Group at the World Bank, and his team, including Sergiy Radyakin and Zurab Sajaia, wrote a set of ado-files to implement various poverty measurement and analysis algorithms. These ado-files eventually became known as the Poverty Analysis Toolkit, which was widely used throughout the World Bank. The popular user-written command xml_tab, available from the Statistical Software Components (SSC) archive, also grew out of this work; xml_tab allows users to save Stata results in a format that is easily incorporated into Microsoft Excel spreadsheets.

The Poverty Analysis Toolkit includes several programs for dynamic policy analysis, including commands for plotting growth incidence curves; for plotting poverty incidence, deficit, and severity curves; and for analysing the changes in poverty over time that are due to sectoral and population changes, and growth and redistribution.

The Toolkit greatly simplified the study of poverty at the World Bank by making available a standard set of Stata commands that researchers could use without having to reinvent the wheel. However, having a collection of programs instead of a single interface raised the learning curve for new researchers and limited researchers’ ability to produce standard output that could easily be included in reports.


To make the Poverty Analysis Toolkit appeal to a wider audience, Lokshin and his team decided to combine the separate routines and to provide a single easy-to-use graphical interface. The Toolkit was renamed ADePT (which stands for Automated DEC Poverty Tables) and was quickly adopted by researchers around the world. In contrast to the Toolkit, the ADePT software, available at, is no longer a set of isolated components, but rather an integrated platform. Having an integrated platform allows the components to work together and simplifies the development of additional modules.

ADePT was developed using a combination of Stata’s ado-language, Mata, and dialog programming language, including over 150,000 lines of code. Certain routines were also developed in C++ and assembly language for maximum performance and used Stata’s plug-in facilities. One example of such a routine is the usespss command, which is available from the SSC archive; this command allows Stata users to read datasets in SPSS format.

Different modules within ADePT perform an array of statistical analyses, from simple cross-tabulations to estimation of simultaneous equations via maximum likelihood. Routines allow one to estimate standard errors for many poverty and inequality measures. ADePT even makes running complex simulations easy.

To use early versions of ADePT, a researcher had to have Stata installed on his or her computer. This requirement posed an impediment to the widespread adoption of ADePT among users who did not already have Stata, especially in low-income and developing countries, where poverty research is most critical.

Numerics by Stata

In 2009, StataCorp announced Numerics by Stata, an embedded version of Stata that allows software developers to create applications in the language of their choice and to call on Numerics by Stata to perform the same types of computations and analyses that Stata users have come to depend on. Organizations use Numerics by Stata to create in-house applications with user interfaces that match their needs while still garnering all the analytic power of Stata. Those applications do not require that end-users have Stata installed on their machines.

In late 2009, the World Bank harnessed the power of Numerics by Stata and modified ADePT to work with it. As a result, researchers across the globe who did not own Stata and therefore could not use earlier versions of ADePT became able to download the new version from the Bank’s web site and begin using it immediately.

Stata's Automation features allowed developers to create the user interface in C# using standard Windows components while continuing to perform statistical analyses using Stata’s programming language. This separation of the front end from the back end also allowed developers to create localized versions of ADePT. In addition to the English version, Bahasa Indonesian and Russian language versions are also available; work is underway to produce Spanish, Bulgarian, Romanian, Georgian, Portuguese, and other versions, as well. Having versions of ADePT in local languages makes adopting ADePT easier for researchers.

Localization is not limited to changing the language used in the interface. For example, in November 2009, the Indonesia Poverty Team collaborated with ADePT developers to customize the software so that analyses could be conducted at the kabupaten (district) level. That customization allowed researchers to answer policy questions specific to Indonesia using nationally representative household survey data.

Lokshin estimates that more than 1,500 registered users of ADePT currently exist. As the number of users continues to grow, more testing and refinement of the software occurs. For example, when difficult cases and exceptions are encountered by users, the ADePT team is able to modify the code to be more robust in response to those situations. The development team also uses an automated testing procedure to ensure that results from newer versions of the program agree with those from earlier versions.

ADePT reduces the time needed for data analysis, giving users more time to focus on results. Users can also simulate results to explore the possible outcomes from various policy alternatives such as cash transfers or subsidies. Most importantly, ADePT is easy to use and, with appropriate data preparation, improves accuracy and consistency.


The World Bank is committed to reducing poverty worldwide, and Stata has played a pivotal role in this regard for many years. Stata’s flexibility helped make the Poverty Analysis Toolkit possible and improved the efficiency of researchers worldwide. World Bank developers used Stata’s programmable dialog boxes, Mata, and class system to create ADePT, which elevated poverty analysis to a higher level. The recently announced Numerics by Stata lets poverty researchers around the world conduct their analyses effectively and economically.

Brian Poi, Executive Editor and Senior Economist

Reproduced with permission from The Stata News Vol 25, No 2, June 2010

Go to top

Want to share your story on how your organisation uses Stata? Email