The database seminar is a weekly gathering of students and faculty interested in database systems. Each meeting lasts for an hour and consists of presentations of recent research papers, practice talks, or invited talks. Local speakers may choose to present their own research work or recent publications from database conferences or related fields. Students interested in database systems are strongly encouraged to attend the seminar.
In this talk, I will give a brief overview on two research topics- (1) How can we price relational data? (2) How can we design parallel algorithms for query processing tasks? If time permits, I will also talk about one of my favorite (and still open) problems- finding a fast parallel algorithm for computing connected components in a graph. Blackboard only!
I am an assistant professor who joined the Department of Computer Sciences in fall 2015. My research focuses on the theoretical aspects of data management and, more specifically, on problems that arise in modern applications and systems for processing big data. I am particularly interested in theoretical models for parallel query processing, uncertainty in data management and pricing data. Prior to moving to Madison, I completed my Ph.D. in Computer Science and Engineering at the University of Washington, supervised by Dan Suciu. Before that, I received a Diploma in Electrical and Computer Engineering from the National Technical University of Athens and a M.Sc. Degree in Logic, Algorithms and Computation from the University of Athens.
Rogers is a second year grad student working with Dr. Jignesh Patel. His interests are broadly in the areas of Machine Learning, NLP, and Data Science. He hopes to add some more to this bio during his time at Wisconsin.
Navneet Potti is a 4th year graduate student at UW-Madison, working with Prof Jignesh Patel on the Quickstep project. His primary research interests broadly lie in all aspects of developing high performance data analytics platforms, ranging from algorithmic problems in approximate computation and robust query optimization to engineering challenges in storage and compute optimization. He also works on developing systems to simplify data management and democratize data analytics. He graduated from IIT Madras in 2010 and then worked for Goldman Sachs until 2013. He is currently affiliated with Microsoft Gray Systems Lab and has spent summers doing internships at IBM Almaden, working on their SQL-on-Hadoop system Big SQL, as well as at Pivotal, working on their Greenplum database.
Data Integration (DI) has long been an important topic in data management, and will become even more so in the age of Big Data and data science. Most DI works so far have focused on developing algorithms. In this talk I argue that we should devote far more effort to building DI systems, in order to truly advance the field. Toward this goal, I begin by drawing on my recent industrial experience to discuss the limitations of current DI systems. Next, I propose an agenda to build a new kind of DI systems to address these limitations. These systems guide users through the DI workflow, step by step. They provide tools to address the ``pain points'' of the steps, and tools are built on top of the Python data science and big data ecosystem (PyData). I discuss how to foster an ecosystem of such tools within PyData, then use it to build DI systems for collaborative/cloud/crowd/lay user settings. As an example of such systems, I discuss ``hands-off'' DI systems, which solve the entire DI task using only crowdsourcing. Finally, I discuss ongoing work at Wisconsin, which suggests that these DI systems are highly promising and building them raises many interesting research challenges. If time permits, I will discuss how this DI agenda can be built on to develop a broader R&D agenda for data science.
I will talk about three simple combinatorial ideas to speed up parallel and distributed learning algorithms. We will start off with serial equivalence in asynchronous parallel machine learning, its significance, and how we can efficiently achieve it using a recent result on graph phase transitions. We will continue on the issue of stragglers (i.e., slow nodes) in distributed systems; here, we will use erasure codes to robustify gradient based algorithms against delays. In our last example, we will reduce the high communication cost of data-shuffling in distributed learning, using the seemingly unrelated notion of coded caching. For all the above, we will see real world experiments that indicate how these simple ideas can significantly improve the speed and accuracy of large-scale learning.
I am an assistant professor of Electrical and Computer Engineering at the University of Wisconsin-Madison, and a faculty affiliate with the Optimization group at the Wisconsin Institute for Discovery. My research lies in the intersection of machine learning, coding theory, and distributed systems. I am particularly interested in coordination-avoiding algorithms for large-scale machine learning, and on using erasure codes to speed up distributed computation.
Before coming to Madison, I spent two wonderful years as a postdoc at UC Berkeley. At Berkeley, I was a member of the AMPLab and BLISS, and had the pleasure to collaborate with Kannan Ramchandran and Ben Recht, on several fun projects. I received my Ph.D. in 2014 from UT Austin, where I was fortunate to be advised by Alex Dimakis. Before UT, I spent 3.5 years as a grad student at USC. Before all that, I received my M.Sc. (2009) and ECE Diploma (2007) from the Technical University of Crete (TUC), located in the beautiful city of Chania.
Work done by - Goetz Graefe and Harumi Kuno at HP and Bernhard Seeger (U Marburg)
The integrity of B-tree structures can become compromised for many reasons. Even if all programming errors in a database management system could be avoided, failures in the various layers of hardware and software beneath the database kernel can result in inconsistencies. Since these inconsistencies manifest themselves in unpredictable and sometimes disastrous ways, all commercial database management systems include mechanisms to verify the integrity and trustworthiness of an individual index and of a set of related indexes, and all vendors recommend index verification as part of regular database maintenance similar to regular backups.
Instead, or in addition, we suggest verifying the B-tree structure in each traversal. With carefully designed tree structure and node contents, a root-to-leaf pass can verify all nodes along its path comprehensively, i.e., it can verify all B-tree invariants including its consistency constraints with respect to its siblings and neighbors. Thus, instead of verifying the B-tree structure and contents offline, i.e., instead of the traditional approach, normal application execution verifies the structure frequently and efficiently. Frequent verification and fast access to log records permits automatic, reliable, and efficient recovery of the correct, up-to-date page contents. Our contribution is an index structure that gives reliable and efficient access to all relevant log records and thus enables root cause analysis during testing and automatic recovery after deployment.
Goetz Graefe recently joined Google's Madison office after almost 10 years with HP Labs, over 10 years with Microsoft SQL Server, and 7 years in academia. He graduated with an MS in 1984 and a PhD in 1987, both at UW-Madison. He has published about database query optimization, query execution, indexing, transactions, recovery, and concurrency control; and has meandered all over from the ivory tower to release management.
The work presented today was done at HP Labs.
Cloud computing has become one of the most active areas of computer science research, in large part because it allows computing to behave like a general utility that is always available on demand. While existing cloud infrastructures and services reduce significantly the application development time, significant effort is still required by cloud data management applications to manage their monetary cost, for often this cost depends on a number of decisions including but not limited to performance goals, resource provisioning and workload allocation. These tasks depend on the application-specific workload characteristics and performance objectives and today their implementation burden is left on the application developers.
We argue for a substantial shift away from human-crafted solutions and towards leveraging machine learning algorithms to address the above challenges. These algorithms can be trained on application-specific properties and customized performance goals to automatically learn how to provision resources as well as schedule the execution of incoming query workloads with low cost. Towards this vision, we have developed WiSeDB, a learning-based cost management service for cloud-deployed data management applications. In this talk, I will discuss how WiSeDB uses (a) supervised learning toautomatically learn cost-effective models for guiding query placement, scheduling, and resource provisioning decisions for batch processing, and (b) reinforcement learning to offer low cost online processing solutions, while being adaptive to resource availability and decoupled from notoriously inaccurate performance prediction models.
Olga Papaemmanouil is an Assistant Professor in the Department of Computer Science at Brandeis University since January 2009. Her research interest lies in the area of data management with a recent focus on cloud databases, data exploration, query optimization and query performance prediction. She is the recipient of an NSF Career Award (2013), an ACM SIGMOD Best Demonstration Award (2005) and a Paris Kanellakis Fellowship from Brown University (2002). She received a undergraduate degree in Computer Engineering and Informatics at the University of Patras, Greece, in 1999, a Sc.M. in Information Systems at the University of Economics and Business, in Athens, Greece, in 2002, and a Ph.D in Computer Science at Brown University, in 2008.
How far away are we from a future where a data management system sits in the critical path of everything we do? Already today we often go through a data system in order to do several basic tasks, e.g., to pay at the grocery store, to book a flight, to find out where our friends are and even to get coffee. Businesses and sciences are increasingly recognizing the value of storing and analyzing vast amounts of data. Other than the expected path towards an exploding number of data-driven businesses and scientific scenarios in the next few years, in this talk we also envision a future where data becomes readily available and its power can be harnessed by everyone. What both scenarios have in common is a need for new kinds of data systems that "just work"; they are easy to design, easy to tune, and easy to use. We will discuss this vision and our recent efforts towards 1) adaptive data systems that can adapt to data and access patterns on-the-fly, 2) self-designing data systems that make it easy to spin-off and test new data system architectures and 3) curious data systems that make it easy to explore data even if we do not know what queries to ask.
Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory at Harvard SEAS. Stratos works on data system architectures with emphasis on how we can make it easy to design efficient data systems as applications and hardware keep evolving and on how we can make it easy to use these systems even for non-experts. For his doctoral work on Database Cracking, Stratos won the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the IEEE TCDE Early Career Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.
Byte addressable, non-volatile memory (NVRAM) with close to DRAM speed is becoming a reality. Low capacity DIMMs (10s of MBs) are already available and high capacity DIMMs (100s of GB) are expected in 2017. This talk is about how database systems, in particular, main-memory databases can benefit from NVRAM. It will begin with an outline of the characteristics of different types of NVRAM and how the operating systems provide applications access to NVRAM. Ensuring that data structures such as indexes in NVRAM can be recovered to a consistent state without data or memory loss after a crash is challenging. The talk will discuss what causes the difficulties and how they can be overcome. It will then show how NVRAM can be used to greatly reduce the latency of commit processing and replication. By storing a main-memory database, including indexes, in NVRAM it is possible to achieve near-instant recovery after a crash. The final part of the talk will discuss how this can be achieved.
Paul has conducted research in the database field for over 35 years. He served as a Professor in the Department of Computer Science at the University of Waterloo for 15 years and as a Principal Researcher at Microsoft Research for close to 20 years. Paul is a Fellow of the ACM. He has worked in a variety of areas - file structures, materialized views, query processing, query optimization, column stores, and main-memory databases among others. Paul collaborated closely with the SQL Server team to drastically improve SQL Server performance by adding column store indexes, a novel main-memory engine (Hekaton), and support for real-time analytics.
Leaders from Intuit's Innovation & Advanced Technology Group look at the horizon in Artificial Intelligence and share what Intuit is doing in the space. With 37 million customers across personal finance, accounting, and tax products, Intuit is uniquely positioned to address this topic. Not only did Intuit use AI to create the "Tax Knowledge Engine", a system that understands the American tax code by analyzing thirty million tax returns facilitated through TurboTax annually, but also the company is exploring a wide variety of new modes of artificial intelligence and conversational user interfaces.
Bharath Kadaba leads Intuit's Innovation and Advanced Technology group and serves as Intuit's Chief Innovation Officer. During Bharath's eight years with Intuit, he has led various teams across Intuit in exploring and creating right-for-me offerings for Global Small Business, Accountants and Consumers. Prior to Intuit, he was the SVP of Media Engineering at Yahoo, VP/GM of Application Integration at Siebel Systems, and led engineering at two high profile Internet startups in Silicon Valley. Early in his career he led research and development on Gigabit Internet Routers at the IBM T.J. Watson Labs. Bharath earned a Ph.D. in Information Theory and Networks from the University of Hawaii.
Michael J. Radwin leads Intuit's Innovation & Advanced Technology Data Science team, where he turns forward-looking data ideas into working prototypes and products. Prior to Intuit, Radwin was VP Engineering of Anchor Intelligence, which used machine learning ensemble methods to fight online advertising fraud. Earlier, Radwin was Director of Engineering at Yahoo!, where he built ad-targeting and personalization algorithms with neural networks and naive Bayesian classifiers, and scaled web platform technologies Apache and PHP.
Pari Lingampally is an alumnus of UW-Madison, who graduated with a degree in Computer Engineering in 2014. At Intuit, he has worked on multiple projects that ranged from a novel identity management system to an in-production algorithm used in TurboTax to help catch user typed errors. More recently, he has worked on an project that experimented with using Google's TensorFlow to do user prediction based on TurboTax clickstream data.
Motivated by a growing market that involves buying and selling data over the web, we study pricing schemes that assign value to queries issued over a database. Previous work studied pricing mechanisms that compute the price of a query by extending a data seller's explicit prices on certain queries, or investigated the properties that a pricing function should exhibit without detailing a generic construction. In this work, we present a formal framework for pricing queries over data that allows the construction of general families of pricing functions, with the main goal of avoiding arbitrage. We consider two types of pricing schemes - instance-independent schemes, where the price depends only on the structure of the query, and answer-dependent schemes, where the price also depends on the query output. Our main result is a complete characterization of the structure of pricing functions in both settings, by relating it to properties of a function over a lattice.
Shaleen is a second year grad student working with Prof. Paris Koutris. His interest is in the area of database theory.
Most current data cleaning solutions solve only certain steps of the cleaning pipeline, not the entire pipeline. For example, many cleaning pipelines consist of two parts - a machine part that employs a cleaning algorithm, and a human part that employs a user to manually clean up the output of the algorithm. Current work has addressed only the machine part. This raises three serious problems. First, without guidance, users often end up doing something ad-hoc, suboptimal, and incorrect. Second, the machine part may optimize for the wrong goal, causing more work in the human part. Finally, focusing only on the machine part makes it very difficult to address the brittleness of the cleaning algorithm.
To address these problems, we argue for more attention to end-to-end data cleaning, focusing not just on the machine, but also on the human part. We propose an RDBMS-style solution strategy, which defines basic human operations, combines them with cleaning algorithms to form hybrid plans, estimates plan costs (e.g., in terms of human effort), then selects the best plan. As a case study, we examine the important problem of value normalization and develop an effective solution. Our work thus advocates for end-to-end data cleaning and shows that it is possible to apply an RDBMS-style approach to build complex machine-human solutions to this problem.
Decide the logistics of the spring 2017 database seminar.