The database seminar is a weekly gathering of students and faculty interested in database systems. Each meeting lasts for an hour and consists of presentations of recent research papers, practice talks, or invited talks. Local speakers may choose to present their own research work or recent publications from database conferences or related fields. Students interested in database systems are strongly encouraged to attend the seminar.
In recent years, there has been an explosion of large-scale real-time analytics needs and a plethora of streaming systems have been developed to support such applications. These systems are able to continue stream processing even when faced with hardware and software failures. However, these systems do not address some crucial challenges facing their operators - the manual, time-consuming and error-prone tasks of tuning various configuration knobs to achieve service level objectives (SLOs) as well as the maintenance of SLOs in the face of sudden, unpredictable load variation and hardware or software performance degradation. In this talk, we introduce the notion of self-regulating streaming systems and the key properties that they must satisfy. We then present the design and evaluation of Dhalion, a system that provides self-regulation capabilities to underlying streaming systems. We describe our implementation of the Dhalion framework on top of Twitter Heron, as well as a number of policies that automatically reconfigure Heron topologies to meet throughput SLOs, scaling resource consumption up and down as needed. We are in the process of open-sourcing Dhalion as part of the Heron project.
Avrilia Floratou is a Senior Scientist at Microsoft's Cloud and Information Services Lab (CISL). Her research interests broadly lie in the area of data management with a recent focus on large-scale stream processing and real-time analytics. She is also an active contributor to Heron, collaborating with Twitter. Prior to her current role, she was a research scientist at IBM Almaden Research Center working on SQL-on-Hadoop engines and incorporating her research in the IBM Big SQL product offering. She received her Ph.D. and M.Sc. in Computer Science from University of Wisconsin-Madison and her B.S from University of Athens in Greece.
Federated stream processing systems, which utilise nodes from multiple independent domains, can be found increasingly in multi-provider cloud deployments, internet-of-things systems, collaborative sensing applications and large-scale grid systems. To pool resources from several sites and take advantage of local processing, submitted queries are split into query fragments, which are executed collaboratively by different sites. When supporting many concurrent users, however, queries may exhaust available processing resources, thus requiring constant load shedding. Given that individual sites have autonomy over how they allocate query fragments on their nodes, it is an open challenge how to ensure global fairness on processing quality experienced by queries in a federated scenario.
We describe THEMIS, a federated stream processing system for resource-starved, multi-site deployments. It executes queries in a globally fair fashion and provides users with constant feedback on the experienced processing quality for their queries. THEMIS associates stream data with its source information content (SIC), a metric that quantifies the contribution of that data towards the query result, based on the amount of source data used to generate it. We provide the BALANCE-SIC distributed load shedding algorithm that balances the SIC values of result data. Our evaluation shows that the BALANCE-SIC algorithm yields balanced SIC values across queries, as measured by Jain’s Fairness Index. Our approach also incurs a low execution time overhead.
Harshad is a PhD candidate in the Database Systems Group working with Prof. Jignesh M. Patel on the Quickstep project. His primary research interest is in resource management for in-memory database systems, specifically in designing query schedulers.
Query optimizers and query execution engines cooperate to deliver high performance on complex analytic queries. Typically, the optimizer searches through the plan space and sends a selected plan to the execution engine. However, optimizers may at times miss the optimal plan, with some- times the disastrous impact on performance. In this paper, we develop the notion of robustness of a query evaluation strategy with respect to a space of query plans. We also propose a novel query execution strategy called Lookahead Information Passing (LIP) that is robust with respect to the space of (fully pipeline-able) left-deep query plan trees for in-memory star schema data warehouses. LIP ensures that execution times for the best and the worst case plans are far closer than without LIP. In fact, under certain assumptions of independent and uniform distributions, any plan in that space is theoretically guaranteed to execute in near-optimal time. LIP ensures that the execution time for every plan in the space is nearly-optimal. We also evaluate these claims using workloads that include skew and correlation. With LIP we make an initial foray into a novel way of thinking about robustness from the perspective of query evaluation, where we develop strategies (like LIP) that collapse plan sub-spaces in the overall global plan space.
Jianqiao Zhu is a 5th year Ph.D. student in the database group at UW-Madison, working with Prof Jignesh Patel. His research interests include query languages, query optimization, query processing system design and data visualization.
Navneet Potti is a 4th-year graduate student at UW-Madison, working with Prof Jignesh Patel on the Quickstep project. His primary research interests broadly lie in all aspects of developing high-performance data analytics platforms, ranging from main memory database engines to machine learning. He is currently affiliated with Microsoft Gray Systems Lab and has spent summers doing internships at IBM Almaden as well as at Pivotal.
Decide the logistics of the spring 2017 database seminar.
Most current data cleaning solutions solve only certain steps of the cleaning pipeline, not the entire pipeline. For example, many cleaning pipelines consist of two parts - a machine part that employs a cleaning algorithm, and a human part that employs a user to manually clean up the output of the algorithm. Current work has addressed only the machine part. This raises three serious problems. First, without guidance, users often end up doing something ad-hoc, suboptimal, and incorrect. Second, the machine part may optimize for the wrong goal, causing more work in the human part. Finally, focusing only on the machine part makes it very difficult to address the brittleness of the cleaning algorithm.
To address these problems, we argue for more attention to end-to-end data cleaning, focusing not just on the machine, but also on the human part. We propose an RDBMS-style solution strategy, which defines basic human operations, combines them with cleaning algorithms to form hybrid plans, estimates plan costs (e.g., in terms of human effort), then selects the best plan. As a case study, we examine the important problem of value normalization and develop an effective solution. Our work thus advocates for end-to-end data cleaning and shows that it is possible to apply an RDBMS-style approach to build complex machine-human solutions to this problem.
Motivated by a growing market that involves buying and selling data over the web, we study pricing schemes that assign value to queries issued over a database. Previous work studied pricing mechanisms that compute the price of a query by extending a data seller's explicit prices on certain queries, or investigated the properties that a pricing function should exhibit without detailing a generic construction. In this work, we present a formal framework for pricing queries over data that allows the construction of general families of pricing functions, with the main goal of avoiding arbitrage. We consider two types of pricing schemes - instance-independent schemes, where the price depends only on the structure of the query, and answer-dependent schemes, where the price also depends on the query output. Our main result is a complete characterization of the structure of pricing functions in both settings, by relating it to properties of a function over a lattice.
Shaleen is a second year grad student working with Prof. Paris Koutris. His interest is in the area of database theory.
Leaders from Intuit's Innovation & Advanced Technology Group look at the horizon in Artificial Intelligence and share what Intuit is doing in the space. With 37 million customers across personal finance, accounting, and tax products, Intuit is uniquely positioned to address this topic. Not only did Intuit use AI to create the "Tax Knowledge Engine", a system that understands the American tax code by analyzing thirty million tax returns facilitated through TurboTax annually, but also the company is exploring a wide variety of new modes of artificial intelligence and conversational user interfaces.
Bharath Kadaba leads Intuit's Innovation and Advanced Technology group and serves as Intuit's Chief Innovation Officer. During Bharath's eight years with Intuit, he has led various teams across Intuit in exploring and creating right-for-me offerings for Global Small Business, Accountants and Consumers. Prior to Intuit, he was the SVP of Media Engineering at Yahoo, VP/GM of Application Integration at Siebel Systems, and led engineering at two high profile Internet startups in Silicon Valley. Early in his career he led research and development on Gigabit Internet Routers at the IBM T.J. Watson Labs. Bharath earned a Ph.D. in Information Theory and Networks from the University of Hawaii.
Michael J. Radwin leads Intuit's Innovation & Advanced Technology Data Science team, where he turns forward-looking data ideas into working prototypes and products. Prior to Intuit, Radwin was VP Engineering of Anchor Intelligence, which used machine learning ensemble methods to fight online advertising fraud. Earlier, Radwin was Director of Engineering at Yahoo!, where he built ad-targeting and personalization algorithms with neural networks and naive Bayesian classifiers, and scaled web platform technologies Apache and PHP.
Pari Lingampally is an alumnus of UW-Madison, who graduated with a degree in Computer Engineering in 2014. At Intuit, he has worked on multiple projects that ranged from a novel identity management system to an in-production algorithm used in TurboTax to help catch user typed errors. More recently, he has worked on an project that experimented with using Google's TensorFlow to do user prediction based on TurboTax clickstream data.
Byte addressable, non-volatile memory (NVRAM) with close to DRAM speed is becoming a reality. Low capacity DIMMs (10s of MBs) are already available and high capacity DIMMs (100s of GB) are expected in 2017. This talk is about how database systems, in particular, main-memory databases can benefit from NVRAM. It will begin with an outline of the characteristics of different types of NVRAM and how the operating systems provide applications access to NVRAM. Ensuring that data structures such as indexes in NVRAM can be recovered to a consistent state without data or memory loss after a crash is challenging. The talk will discuss what causes the difficulties and how they can be overcome. It will then show how NVRAM can be used to greatly reduce the latency of commit processing and replication. By storing a main-memory database, including indexes, in NVRAM it is possible to achieve near-instant recovery after a crash. The final part of the talk will discuss how this can be achieved.
Paul has conducted research in the database field for over 35 years. He served as a Professor in the Department of Computer Science at the University of Waterloo for 15 years and as a Principal Researcher at Microsoft Research for close to 20 years. Paul is a Fellow of the ACM. He has worked in a variety of areas - file structures, materialized views, query processing, query optimization, column stores, and main-memory databases among others. Paul collaborated closely with the SQL Server team to drastically improve SQL Server performance by adding column store indexes, a novel main-memory engine (Hekaton), and support for real-time analytics.
How far away are we from a future where a data management system sits in the critical path of everything we do? Already today we often go through a data system in order to do several basic tasks, e.g., to pay at the grocery store, to book a flight, to find out where our friends are and even to get coffee. Businesses and sciences are increasingly recognizing the value of storing and analyzing vast amounts of data. Other than the expected path towards an exploding number of data-driven businesses and scientific scenarios in the next few years, in this talk we also envision a future where data becomes readily available and its power can be harnessed by everyone. What both scenarios have in common is a need for new kinds of data systems that "just work"; they are easy to design, easy to tune, and easy to use. We will discuss this vision and our recent efforts towards 1) adaptive data systems that can adapt to data and access patterns on-the-fly, 2) self-designing data systems that make it easy to spin-off and test new data system architectures and 3) curious data systems that make it easy to explore data even if we do not know what queries to ask.
Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory at Harvard SEAS. Stratos works on data system architectures with emphasis on how we can make it easy to design efficient data systems as applications and hardware keep evolving and on how we can make it easy to use these systems even for non-experts. For his doctoral work on Database Cracking, Stratos won the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the IEEE TCDE Early Career Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.
Cloud computing has become one of the most active areas of computer science research, in large part because it allows computing to behave like a general utility that is always available on demand. While existing cloud infrastructures and services reduce significantly the application development time, significant effort is still required by cloud data management applications to manage their monetary cost, for often this cost depends on a number of decisions including but not limited to performance goals, resource provisioning and workload allocation. These tasks depend on the application-specific workload characteristics and performance objectives and today their implementation burden is left on the application developers.
We argue for a substantial shift away from human-crafted solutions and towards leveraging machine learning algorithms to address the above challenges. These algorithms can be trained on application-specific properties and customized performance goals to automatically learn how to provision resources as well as schedule the execution of incoming query workloads with low cost. Towards this vision, we have developed WiSeDB, a learning-based cost management service for cloud-deployed data management applications. In this talk, I will discuss how WiSeDB uses (a) supervised learning toautomatically learn cost-effective models for guiding query placement, scheduling, and resource provisioning decisions for batch processing, and (b) reinforcement learning to offer low cost online processing solutions, while being adaptive to resource availability and decoupled from notoriously inaccurate performance prediction models.
Olga Papaemmanouil is an Assistant Professor in the Department of Computer Science at Brandeis University since January 2009. Her research interest lies in the area of data management with a recent focus on cloud databases, data exploration, query optimization and query performance prediction. She is the recipient of an NSF Career Award (2013), an ACM SIGMOD Best Demonstration Award (2005) and a Paris Kanellakis Fellowship from Brown University (2002). She received a undergraduate degree in Computer Engineering and Informatics at the University of Patras, Greece, in 1999, a Sc.M. in Information Systems at the University of Economics and Business, in Athens, Greece, in 2002, and a Ph.D in Computer Science at Brown University, in 2008.
Work done by - Goetz Graefe and Harumi Kuno at HP and Bernhard Seeger (U Marburg)
The integrity of B-tree structures can become compromised for many reasons. Even if all programming errors in a database management system could be avoided, failures in the various layers of hardware and software beneath the database kernel can result in inconsistencies. Since these inconsistencies manifest themselves in unpredictable and sometimes disastrous ways, all commercial database management systems include mechanisms to verify the integrity and trustworthiness of an individual index and of a set of related indexes, and all vendors recommend index verification as part of regular database maintenance similar to regular backups.
Instead, or in addition, we suggest verifying the B-tree structure in each traversal. With carefully designed tree structure and node contents, a root-to-leaf pass can verify all nodes along its path comprehensively, i.e., it can verify all B-tree invariants including its consistency constraints with respect to its siblings and neighbors. Thus, instead of verifying the B-tree structure and contents offline, i.e., instead of the traditional approach, normal application execution verifies the structure frequently and efficiently. Frequent verification and fast access to log records permits automatic, reliable, and efficient recovery of the correct, up-to-date page contents. Our contribution is an index structure that gives reliable and efficient access to all relevant log records and thus enables root cause analysis during testing and automatic recovery after deployment.
Goetz Graefe recently joined Google's Madison office after almost 10 years with HP Labs, over 10 years with Microsoft SQL Server, and 7 years in academia. He graduated with an MS in 1984 and a PhD in 1987, both at UW-Madison. He has published about database query optimization, query execution, indexing, transactions, recovery, and concurrency control; and has meandered all over from the ivory tower to release management.
The work presented today was done at HP Labs.
I will talk about three simple combinatorial ideas to speed up parallel and distributed learning algorithms. We will start off with serial equivalence in asynchronous parallel machine learning, its significance, and how we can efficiently achieve it using a recent result on graph phase transitions. We will continue on the issue of stragglers (i.e., slow nodes) in distributed systems; here, we will use erasure codes to robustify gradient based algorithms against delays. In our last example, we will reduce the high communication cost of data-shuffling in distributed learning, using the seemingly unrelated notion of coded caching. For all the above, we will see real world experiments that indicate how these simple ideas can significantly improve the speed and accuracy of large-scale learning.
I am an assistant professor of Electrical and Computer Engineering at the University of Wisconsin-Madison, and a faculty affiliate with the Optimization group at the Wisconsin Institute for Discovery. My research lies in the intersection of machine learning, coding theory, and distributed systems. I am particularly interested in coordination-avoiding algorithms for large-scale machine learning, and on using erasure codes to speed up distributed computation.
Before coming to Madison, I spent two wonderful years as a postdoc at UC Berkeley. At Berkeley, I was a member of the AMPLab and BLISS, and had the pleasure to collaborate with Kannan Ramchandran and Ben Recht, on several fun projects. I received my Ph.D. in 2014 from UT Austin, where I was fortunate to be advised by Alex Dimakis. Before UT, I spent 3.5 years as a grad student at USC. Before all that, I received my M.Sc. (2009) and ECE Diploma (2007) from the Technical University of Crete (TUC), located in the beautiful city of Chania.
Data Integration (DI) has long been an important topic in data management, and will become even more so in the age of Big Data and data science. Most DI works so far have focused on developing algorithms. In this talk I argue that we should devote far more effort to building DI systems, in order to truly advance the field. Toward this goal, I begin by drawing on my recent industrial experience to discuss the limitations of current DI systems. Next, I propose an agenda to build a new kind of DI systems to address these limitations. These systems guide users through the DI workflow, step by step. They provide tools to address the ``pain points'' of the steps, and tools are built on top of the Python data science and big data ecosystem (PyData). I discuss how to foster an ecosystem of such tools within PyData, then use it to build DI systems for collaborative/cloud/crowd/lay user settings. As an example of such systems, I discuss ``hands-off'' DI systems, which solve the entire DI task using only crowdsourcing. Finally, I discuss ongoing work at Wisconsin, which suggests that these DI systems are highly promising and building them raises many interesting research challenges. If time permits, I will discuss how this DI agenda can be built on to develop a broader R&D agenda for data science.
Rogers is a second year grad student working with Dr. Jignesh Patel. His interests are broadly in the areas of Machine Learning, NLP, and Data Science. He hopes to add some more to this bio during his time at Wisconsin.
Navneet Potti is a 4th year graduate student at UW-Madison, working with Prof Jignesh Patel on the Quickstep project. His primary research interests broadly lie in all aspects of developing high performance data analytics platforms, ranging from algorithmic problems in approximate computation and robust query optimization to engineering challenges in storage and compute optimization. He also works on developing systems to simplify data management and democratize data analytics. He graduated from IIT Madras in 2010 and then worked for Goldman Sachs until 2013. He is currently affiliated with Microsoft Gray Systems Lab and has spent summers doing internships at IBM Almaden, working on their SQL-on-Hadoop system Big SQL, as well as at Pivotal, working on their Greenplum database.
In this talk, I will give a brief overview on two research topics- (1) How can we price relational data? (2) How can we design parallel algorithms for query processing tasks? If time permits, I will also talk about one of my favorite (and still open) problems- finding a fast parallel algorithm for computing connected components in a graph. Blackboard only!
I am an assistant professor who joined the Department of Computer Sciences in fall 2015. My research focuses on the theoretical aspects of data management and, more specifically, on problems that arise in modern applications and systems for processing big data. I am particularly interested in theoretical models for parallel query processing, uncertainty in data management and pricing data. Prior to moving to Madison, I completed my Ph.D. in Computer Science and Engineering at the University of Washington, supervised by Dan Suciu. Before that, I received a Diploma in Electrical and Computer Engineering from the National Technical University of Athens and a M.Sc. Degree in Logic, Algorithms and Computation from the University of Athens.